From doug at ewellic.org Sun Feb 1 17:18:18 2015 From: doug at ewellic.org (Doug Ewell) Date: Sun, 1 Feb 2015 16:18:18 -0700 Subject: N'Ko - which character? 02BC vs. 2019 In-Reply-To: References: Message-ID: <34B5AEA2EAC449CA9DDFE10107DAD3A6@DougEwell> Markus Scherer wrote: > Dear Unicoders, which is the proper second character in "N'Ko"? > See below for details. > > ---------- Forwarded message ---------- > From: Doug Ewell For the record, I did not ask on ietf-languages for any re-evaluation of the apostrophe character used in the name N'Ko. My question, and that of the group, was about the apostrophes used in the names of Khoisan and Bantu languages. -- Doug Ewell | Thornton, CO, USA | http://ewellic.org ? From chris.fynn at gmail.com Mon Feb 2 00:12:55 2015 From: chris.fynn at gmail.com (Christopher Fynn) Date: Mon, 2 Feb 2015 12:12:55 +0600 Subject: N'Ko - which character? 02BC vs. 2019 In-Reply-To: <34B5AEA2EAC449CA9DDFE10107DAD3A6@DougEwell> References: <34B5AEA2EAC449CA9DDFE10107DAD3A6@DougEwell> Message-ID: If used as characters that are part of a word, especially when they occur at the beginning or end of a word, ASCII apostrophes and and both right and left quotation marks easily get changed to something else by the auto quotes features of word-processors. From Andrew.Glass at microsoft.com Mon Feb 2 12:14:31 2015 From: Andrew.Glass at microsoft.com (Andrew Glass (WINDOWS)) Date: Mon, 2 Feb 2015 18:14:31 +0000 Subject: N'Ko - which character? 02BC vs. 2019 In-Reply-To: References: <34B5AEA2EAC449CA9DDFE10107DAD3A6@DougEwell> Message-ID: For what it's worth, the N'ko Institute of America uses U+2019. But that is probably a reflection of the font situation and the fact that U+2019 is often more accessible in word processors. http://nkoinstitute.com/the-n-character/ -----Original Message----- From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Christopher Fynn Sent: Sunday, February 1, 2015 10:13 PM To: Doug Ewell Cc: Markus Scherer; unicode at unicode.org Subject: Re: N'Ko - which character? 02BC vs. 2019 If used as characters that are part of a word, especially when they occur at the beginning or end of a word, ASCII apostrophes and and both right and left quotation marks easily get changed to something else by the auto quotes features of word-processors. _______________________________________________ Unicode mailing list Unicode at unicode.org http://unicode.org/mailman/listinfo/unicode From everson at evertype.com Mon Feb 2 12:36:58 2015 From: everson at evertype.com (Michael Everson) Date: Mon, 2 Feb 2015 18:36:58 +0000 Subject: N'Ko - which character? 02BC vs. 2019 In-Reply-To: References: Message-ID: On 31 Jan 2015, at 22:04, Markus Scherer wrote: > Dear Unicoders, which is the proper second character in "N'Ko"? > See below for details. U+2019. It is not a letter in N?Ko. Moreover, the reference fonts for N?Ko didn?t even have U+02BC. For N?Ko, this is not arguable. I would like to point out (perhaps again) that in my Hawaiian, Samoan, and Tongan, editions of Alice?s Adventures in Wonderland, and in the forthcoming Hawaiian Hobbit, U+02BB has been drawn 133% taller, but of the same width, as U+2018. I believe this really must be considered good practice. In these novels, with ?quotation marks ?and nested quotation marks?,? making this distinction is really rather essential. Michael Everson * http://www.evertype.com/ From verdy_p at wanadoo.fr Mon Feb 2 12:54:55 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Mon, 2 Feb 2015 19:54:55 +0100 Subject: N'Ko - which character? 02BC vs. 2019 In-Reply-To: References: <34B5AEA2EAC449CA9DDFE10107DAD3A6@DougEwell> Message-ID: On this page the N'ko Institute hesitates ans uses U+2018 (?) in English i.e. the reverse direction. It has advantages that it is used immediately after letter N/n and if ever it appears at end of words, it won't match a pair of single quotation marks (U+2018 is a punctuation only at start of lines, or after whitespaces and punctuations; U+2019 is not always a quotation punctuation after a letter, even if it's followed by whitespace or punctuation, it may also be an orthographic apostrophe). 2015-02-02 19:14 GMT+01:00 Andrew Glass (WINDOWS) < Andrew.Glass at microsoft.com>: > For what it's worth, the N'ko Institute of America uses U+2019. But that > is probably a reflection of the font situation and the fact that U+2019 is > often more accessible in word processors. > > http://nkoinstitute.com/the-n-character/ > > > -----Original Message----- > From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of > Christopher Fynn > Sent: Sunday, February 1, 2015 10:13 PM > To: Doug Ewell > Cc: Markus Scherer; unicode at unicode.org > Subject: Re: N'Ko - which character? 02BC vs. 2019 > > If used as characters that are part of a word, especially when they occur > at the beginning or end of a word, ASCII apostrophes and and both right and > left quotation marks easily get changed to something else by the auto > quotes features of word-processors. > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Mon Feb 2 12:55:17 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Mon, 2 Feb 2015 19:55:17 +0100 Subject: N'Ko - which character? 02BC vs. 2019 In-Reply-To: References: <34B5AEA2EAC449CA9DDFE10107DAD3A6@DougEwell> Message-ID: The link did not pass : http://nkoinstitute.com/nko-alphabet/ 2015-02-02 19:54 GMT+01:00 Philippe Verdy : > On this page > > the N'ko Institute hesitates ans uses U+2018 (?) in English i.e. the > reverse direction. > It has advantages that it is used immediately after letter N/n and if ever > it appears at end of words, it won't match a pair of single quotation marks > (U+2018 is a punctuation only at start of lines, or after whitespaces and > punctuations; U+2019 is not always a quotation punctuation after a letter, > even if it's followed by whitespace or punctuation, it may also be an > orthographic apostrophe). > > > 2015-02-02 19:14 GMT+01:00 Andrew Glass (WINDOWS) < > Andrew.Glass at microsoft.com>: > >> For what it's worth, the N'ko Institute of America uses U+2019. But that >> is probably a reflection of the font situation and the fact that U+2019 is >> often more accessible in word processors. >> >> http://nkoinstitute.com/the-n-character/ >> >> >> -----Original Message----- >> From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of >> Christopher Fynn >> Sent: Sunday, February 1, 2015 10:13 PM >> To: Doug Ewell >> Cc: Markus Scherer; unicode at unicode.org >> Subject: Re: N'Ko - which character? 02BC vs. 2019 >> >> If used as characters that are part of a word, especially when they occur >> at the beginning or end of a word, ASCII apostrophes and and both right and >> left quotation marks easily get changed to something else by the auto >> quotes features of word-processors. >> _______________________________________________ >> Unicode mailing list >> Unicode at unicode.org >> http://unicode.org/mailman/listinfo/unicode >> >> _______________________________________________ >> Unicode mailing list >> Unicode at unicode.org >> http://unicode.org/mailman/listinfo/unicode >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From kent.karlsson14 at telia.com Mon Feb 2 17:31:11 2015 From: kent.karlsson14 at telia.com (Kent Karlsson) Date: Tue, 03 Feb 2015 00:31:11 +0100 Subject: N'Ko - which character? 02BC vs. 2019 In-Reply-To: Message-ID: Den 2015-02-02 19:36, skrev "Michael Everson" : > Hawaiian Hobbit, U+02BB has been drawn 133% taller, but of the same width, as > U+2018. I believe this really must be considered good practice. In these I think you mean 33 % taller, i.e. height 133 % relative to its "normal" height. 133 % taller would be more than double its normal height, making it about as tall as an uppercase letter... That would be excessive... /Kent K From everson at evertype.com Mon Feb 2 18:00:17 2015 From: everson at evertype.com (Michael Everson) Date: Tue, 3 Feb 2015 00:00:17 +0000 Subject: N'Ko - which character? 02BC vs. 2019 In-Reply-To: References: Message-ID: <17FCA56A-E578-4031-A885-8FB5AD8A853D@evertype.com> On 2 Feb 2015, at 23:31, Kent Karlsson wrote: >> Hawaiian Hobbit, U+02BB has been drawn 133% taller, but of the same width, as >> U+2018. I believe this really must be considered good practice. In these > > I think you mean 33 % taller, i.e. height 133 % relative to its "normal" > height. 133 % taller would be more than double its normal height, making > it about as tall as an uppercase letter... That would be excessive? Yes, that?s right. I just type ?133? into the font editor. Michael Everson * http://www.evertype.com/ From ishida at w3.org Wed Feb 4 06:40:01 2015 From: ishida at w3.org (Richard Ishida) Date: Wed, 04 Feb 2015 12:40:01 +0000 Subject: Bopomofo light tone mark on the Web Message-ID: <54D21321.8030904@w3.org> At the W3C we are trying to understand how to handle the bopomofo in phonetic annotations (for the CSS Ruby spec). Please see a write up of the background and some relevant questions at http://rishida.net/scripts/bopomofo/ontheweb A key question relates to the light tone. The light tone falls out from most IMEs and is displayed, for example, by Keynote's phonetic guide function, after the bopomofo letters. In pretty much all the vertical bopomofo we have seen, and in pretty much all dictionaries we have seen (horizontal or vertically set) the light tone, however, is displayed before the bopomofo letters. Note that modern dictionaries appear to be actually moving the character code into first position in the syllable to achieve this. We'd like to know: 1. is anyone aware of any ruling about where the light tone should appear and/or be stored in the text stream? 2. does it (really) matter if text sometimes contains the light tone character before the syllable and sometimes trailing, depending on where people prefer to put it? (Obviously, there's a theoretical issue for sorting and searching if it is sometimes in one place and sometimes in another, but it may be that both places are actually viable positions.) 3. is there any font/rendering software out there that makes the light tone appear at the start of a syllable, when the character is actually at the end of the syllable? cheers, ri From verdy_p at wanadoo.fr Wed Feb 4 19:45:48 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Thu, 5 Feb 2015 02:45:48 +0100 Subject: Bopomofo light tone mark on the Web In-Reply-To: <54D21321.8030904@w3.org> References: <54D21321.8030904@w3.org> Message-ID: Does it really matter, given that the sign is written orthogonally direction of writing of the bopomofo line ? Does it have to be a combining character when it could be a standard spacing character on that line so that users can place it before or after (for collation it would be a problem only for the ternary level, but it can be ignorable in the first and second level). Wouldn't the common middle dot be usable ? Or could a variant be encoded after the specific Bobomofo light tone spacing mark, to indicate its preferred placement ("above" or "below", probably with '"above" being the default) in the vertical writing style (this variant being ignored for the horizontal writing style for example in IME) ? 2015-02-04 13:40 GMT+01:00 Richard Ishida : > At the W3C we are trying to understand how to handle the bopomofo in > phonetic annotations (for the CSS Ruby spec). > > Please see a write up of the background and some relevant questions at > http://rishida.net/scripts/bopomofo/ontheweb > > A key question relates to the light tone. > > The light tone falls out from most IMEs and is displayed, for example, by > Keynote's phonetic guide function, after the bopomofo letters. In pretty > much all the vertical bopomofo we have seen, and in pretty much all > dictionaries we have seen (horizontal or vertically set) the light tone, > however, is displayed before the bopomofo letters. > > Note that modern dictionaries appear to be actually moving the character > code into first position in the syllable to achieve this. > > We'd like to know: > > 1. is anyone aware of any ruling about where the light tone should appear > and/or be stored in the text stream? > > 2. does it (really) matter if text sometimes contains the light tone > character before the syllable and sometimes trailing, depending on where > people prefer to put it? > > (Obviously, there's a theoretical issue for sorting and searching if it is > sometimes in one place and sometimes in another, but it may be that both > places are actually viable positions.) > > 3. is there any font/rendering software out there that makes the light > tone appear at the start of a syllable, when the character is actually at > the end of the syllable? > > cheers, > > ri > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Wed Feb 4 19:52:48 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Thu, 5 Feb 2015 02:52:48 +0100 Subject: Bopomofo light tone mark on the Web In-Reply-To: References: <54D21321.8030904@w3.org> Message-ID: An alternative could be to encode TWO separate tone marks: - one for the usual mode where it appears to the right (horizontal writing) or top (vertical writing) and where it is then a combining character. - one for the alternate "phonetic" mode where it will be forced to appear always before : it will be a spacing mark and will be encoded in the reading and typing order (but for this specific usage, the common middle dot would be enough and will work on both writing directions). 2015-02-05 2:45 GMT+01:00 Philippe Verdy : > Does it really matter, given that the sign is written orthogonally > direction of writing of the bopomofo line ? Does it have to be a combining > character when it could be a standard spacing character on that line so > that users can place it before or after (for collation it would be a > problem only for the ternary level, but it can be ignorable in the first > and second level). > Wouldn't the common middle dot be usable ? > Or could a variant be encoded after the specific Bobomofo light tone > spacing mark, to indicate its preferred placement ("above" or "below", > probably with '"above" being the default) in the vertical writing style > (this variant being ignored for the horizontal writing style for example in > IME) ? > > > 2015-02-04 13:40 GMT+01:00 Richard Ishida : > >> At the W3C we are trying to understand how to handle the bopomofo in >> phonetic annotations (for the CSS Ruby spec). >> >> Please see a write up of the background and some relevant questions at >> http://rishida.net/scripts/bopomofo/ontheweb >> >> A key question relates to the light tone. >> >> The light tone falls out from most IMEs and is displayed, for example, by >> Keynote's phonetic guide function, after the bopomofo letters. In pretty >> much all the vertical bopomofo we have seen, and in pretty much all >> dictionaries we have seen (horizontal or vertically set) the light tone, >> however, is displayed before the bopomofo letters. >> >> Note that modern dictionaries appear to be actually moving the character >> code into first position in the syllable to achieve this. >> >> We'd like to know: >> >> 1. is anyone aware of any ruling about where the light tone should appear >> and/or be stored in the text stream? >> >> 2. does it (really) matter if text sometimes contains the light tone >> character before the syllable and sometimes trailing, depending on where >> people prefer to put it? >> >> (Obviously, there's a theoretical issue for sorting and searching if it >> is sometimes in one place and sometimes in another, but it may be that both >> places are actually viable positions.) >> >> 3. is there any font/rendering software out there that makes the light >> tone appear at the start of a syllable, when the character is actually at >> the end of the syllable? >> >> cheers, >> >> ri >> _______________________________________________ >> Unicode mailing list >> Unicode at unicode.org >> http://unicode.org/mailman/listinfo/unicode >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jf at colson.eu Fri Feb 6 07:30:32 2015 From: jf at colson.eu (=?UTF-8?B?SmVhbi1GcmFuw6dvaXMgQ29sc29u?=) Date: Fri, 06 Feb 2015 14:30:32 +0100 Subject: Wrong plane numbers Message-ID: <54D4C1F8.9050908@colson.eu> In the file NamesList.txt, I see: @@ 2FF80 Unassigned 2FFFF @@ 3FF80 Unassigned 3FFFF @@ 4FF80 Unassigned 4FFFF @@ 5FF80 Unassigned 5FFFF @@ 6FF80 Unassigned 6FFFF @@ 7FF80 Unassigned 7FFFF @@ 8FF80 Unassigned 8FFFF @@ 9FF80 Unassigned 9FFFF @@ AFF80 Unassigned AFFFF @@ BFF80 Unassigned BFFFF @@ CFF80 Unassigned CFFFF @@ DFF80 Unassigned DFFFF @@ EFF80 Unassigned EFFFF @@ FFF80 Supplementary Private Use Area-A FFFFF @@ 10FF80 Supplementary Private Use Area-B 10FFFF Shouldn?t 2FF80 3FF80 4FF80 5FF80 6FF80 7FF80 8FF80 9FF80 AFF80 BFF80 CFF80 DFF80 EFF80 FFF80 10FF80 become 20000 30000 40000 50000 60000 70000 80000 90000 A0000 B0000 C0000 D0000 E01F0 F0000 100000 ? From jf at colson.eu Fri Feb 6 07:33:36 2015 From: jf at colson.eu (=?UTF-8?B?SmVhbi1GcmFuw6dvaXMgQ29sc29u?=) Date: Fri, 06 Feb 2015 14:33:36 +0100 Subject: Wrong plane numbers In-Reply-To: <54D4C1F8.9050908@colson.eu> References: <54D4C1F8.9050908@colson.eu> Message-ID: <54D4C2B0.8090907@colson.eu> Le 06/02/15 14:30, Jean-Fran?ois Colson a ?crit : > In the file NamesList.txt, I see: > > @@ 2FF80 Unassigned 2FFFF > @@ 3FF80 Unassigned 3FFFF > @@ 4FF80 Unassigned 4FFFF > @@ 5FF80 Unassigned 5FFFF > @@ 6FF80 Unassigned 6FFFF > @@ 7FF80 Unassigned 7FFFF > @@ 8FF80 Unassigned 8FFFF > @@ 9FF80 Unassigned 9FFFF > @@ AFF80 Unassigned AFFFF > @@ BFF80 Unassigned BFFFF > @@ CFF80 Unassigned CFFFF > @@ DFF80 Unassigned DFFFF > @@ EFF80 Unassigned EFFFF > @@ FFF80 Supplementary Private Use Area-A FFFFF > @@ 10FF80 Supplementary Private Use Area-B 10FFFF > > > Shouldn?t > > 2FF80 3FF80 4FF80 5FF80 6FF80 7FF80 8FF80 9FF80 AFF80 BFF80 CFF80 > DFF80 EFF80 FFF80 10FF80 > > become > > 20000 30000 40000 50000 60000 70000 80000 90000 A0000 B0000 C0000 > D0000 E01F0 F0000 100000 > Of course I meant 2FA1E, not 20000? > ? > From markus.icu at gmail.com Fri Feb 6 09:06:14 2015 From: markus.icu at gmail.com (Markus Scherer) Date: Fri, 6 Feb 2015 07:06:14 -0800 Subject: Wrong plane numbers In-Reply-To: <54D4C2B0.8090907@colson.eu> References: <54D4C1F8.9050908@colson.eu> <54D4C2B0.8090907@colson.eu> Message-ID: These are not block boundaries. These lines are for book chart production, where we don't need to print every unsigned code point. markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From jf at colson.eu Fri Feb 6 09:15:57 2015 From: jf at colson.eu (=?UTF-8?B?SmVhbi1GcmFuw6dvaXMgQ29sc29u?=) Date: Fri, 06 Feb 2015 16:15:57 +0100 Subject: Wrong plane numbers In-Reply-To: References: <54D4C1F8.9050908@colson.eu> <54D4C2B0.8090907@colson.eu> Message-ID: <54D4DAAD.8060407@colson.eu> Le 06/02/15 16:06, Markus Scherer a ?crit : > > These are not block boundaries. These lines are for book chart > production, where we don't need to print every unsigned code point. > markus > OK. But what about @@ FFF80 Supplementary Private Use Area-A FFFFF ? The Supplementary Private Use Area-A doesn?t begin at FFF80: it begins at F0000. It doesn?t end at FFFFF: it ends at FFFFD. In @@ 1E00 Latin Extended Additional 1EFF 1E00 and 1EFF are the limits of the block ?Latin Extended Additional?. Why isn?t it so with @@ FFF80 Supplementary Private Use Area-A FFFFF ? From kenwhistler at att.net Fri Feb 6 09:50:21 2015 From: kenwhistler at att.net (Ken Whistler) Date: Fri, 06 Feb 2015 07:50:21 -0800 Subject: Wrong plane numbers In-Reply-To: <54D4DAAD.8060407@colson.eu> References: <54D4C1F8.9050908@colson.eu> <54D4C2B0.8090907@colson.eu> <54D4DAAD.8060407@colson.eu> Message-ID: <54D4E2BD.6080305@att.net> Markus has already explained this. But the following explanation fills out some details. These @@ lines are conveniences for chart production. They are headers read by the unibook chart layout tool, which help guide where chart layout for a block starts and stops. The @@ lines are *NOT* block boundary definitions. So please do not try to interpret them as such. The normative definitions of block boundaries can be found in: http://www.unicode.org/Public/UCD/latest/ucd/Blocks.txt Incidentally, the block for the Supplementary Private Use Area-A *does* end at FFFFF, not at FFFFD, as demonstrated by the *normative* block definition from Blocks.txt: F0000..FFFFF; Supplementary Private Use Area-A The syntax used in the NamesList.txt file to drive chart production is fully described in: http://www.unicode.org/Public/UCD/lastest/ucd/NamesList.html Much of the content of NamesList.txt is indirectly normative, of course, because it is used to generate the code charts for versions of the Unicode Standard, but much of the content of the file is just markup that assists in the layout of the charts and/or various informative, annotational material. It is *NOT* safe or recommended to attempt to reverse engineer content out of the bare text file, nor to try to infer content implications for the standard by extracting it from the bare text file. Also, the unibook tool is regularly used by proposal writers to do chart layout for encoding proposals, where block definitions obviously do not even exist yet. The @@ header lines are used there, too, to specify ranges used in the charts for the proposals. By the way, there is over fifteen years of development history here for the interaction of syntax in NamesList.txt and the ongoing maintenance of the unibook chart production tool. The mismatch between @@ blockheader ranges and normative block definitions has been noted (and explained) a number of times now. --Ken On 2/6/2015 7:15 AM, Jean-Fran?ois Colson wrote: > > Le 06/02/15 16:06, Markus Scherer a ?crit : >> >> These are not block boundaries. These lines are for book chart >> production, where we don't need to print every unsigned code point. >> markus >> > OK. But what about > @@ FFF80 Supplementary Private Use Area-A FFFFF > ? > > The Supplementary Private Use Area-A doesn?t begin at FFF80: it begins > at F0000. > It doesn?t end at FFFFF: it ends at FFFFD. > > In > @@ 1E00 Latin Extended Additional 1EFF > 1E00 and 1EFF are the limits of the block ?Latin Extended Additional?. > > Why isn?t it so with > @@ FFF80 Supplementary Private Use Area-A FFFFF > ? > > _ From jf at colson.eu Fri Feb 6 10:21:15 2015 From: jf at colson.eu (=?UTF-8?B?SmVhbi1GcmFuw6dvaXMgQ29sc29u?=) Date: Fri, 06 Feb 2015 17:21:15 +0100 Subject: Not so wrong plane numbers In-Reply-To: <54D4E2BD.6080305@att.net> References: <54D4C1F8.9050908@colson.eu> <54D4C2B0.8090907@colson.eu> <54D4DAAD.8060407@colson.eu> <54D4E2BD.6080305@att.net> Message-ID: <54D4E9FB.4020706@colson.eu> Le 06/02/15 16:50, Ken Whistler a ?crit : > By the way, there is over fifteen years of development history here for > the interaction of syntax in NamesList.txt and the ongoing maintenance > of the unibook chart production tool. The mismatch between @@ blockheader > ranges and normative block definitions has been noted (and explained) > a number of times now. OK. Sorry for the noise? From jf at colson.eu Fri Feb 6 10:22:31 2015 From: jf at colson.eu (=?UTF-8?B?SmVhbi1GcmFuw6dvaXMgQ29sc29u?=) Date: Fri, 06 Feb 2015 17:22:31 +0100 Subject: Not so wrong plane numbers In-Reply-To: <54D4E2BD.6080305@att.net> References: <54D4C1F8.9050908@colson.eu> <54D4C2B0.8090907@colson.eu> <54D4DAAD.8060407@colson.eu> <54D4E2BD.6080305@att.net> Message-ID: <54D4EA47.2060302@colson.eu> Le 06/02/15 16:50, Ken Whistler a ?crit : > By the way, there is over fifteen years of development history here for > the interaction of syntax in NamesList.txt and the ongoing maintenance > of the unibook chart production tool. The mismatch between @@ blockheader > ranges and normative block definitions has been noted (and explained) > a number of times now. OK. Sorry for the noise? -------------- next part -------------- An HTML attachment was scrubbed... URL: From alfred_z at web.de Sun Feb 8 14:15:38 2015 From: alfred_z at web.de (Alfred Zett) Date: Sun, 08 Feb 2015 21:15:38 +0100 Subject: Unicode block for programming related symbols and codepoints? Message-ID: <54D7C3EA.6080000@web.de> Hello everyone, is there such a unicode block for programming related codepoints? Conventional search engines as well as wolfram alpha can't answer that, with the former one leading to all the programming problems that occur... If such a block doesn't exist, I'd like to make a proposal - if possible - to add one with at least the following codepoints/characters: - Indentation codepoint, with no fixed defined graphical representation. For indentation based programming languages. Because: -- specific clients may want to show it different (for example as arrows, lines etc., using another color): --- browsers could let the web page creator let decide the visual representation (character and size) via CSS --- the same with editors, independent from the actual font --- in case of visual impairment, the user could even change the accoustical representation if the editor allows it -- unlike a space symbol, it wouldn't need more than one character per indentation -- unlike tabs or space, it wouldn't be whitespace -- unlike normal arrow characters, one could customize the length in an editor and wouldn't have to insert extra spaces for a better visual imagery - A codepoint for string literal quotes, that would spare one the escaping. - A statement separator symbol. - Other ideas? You may now think, this is highly specific and you are right. However, so are EMOJI signs, in particular those like PINE DECORATION. These days, there are a lot of tools to create small embedded scripting languages and DSLs, which are used in-program in special editors. And there is a lot of people using them. Exactly these could really profit from such a codeblock instead of using conventional ASCII subset characters. Also, there is a lot of potential with really good text editors and IDEs where semantics may matter a lot. Excuse my english, I hope this was understandable. Best regards, A. Z. From olopierpa at gmail.com Sun Feb 8 15:32:03 2015 From: olopierpa at gmail.com (Pierpaolo Bernardi) Date: Sun, 8 Feb 2015 22:32:03 +0100 Subject: Unicode block for programming related symbols and codepoints? In-Reply-To: <54D7C3EA.6080000@web.de> References: <54D7C3EA.6080000@web.de> Message-ID: On Sun, Feb 8, 2015 at 9:15 PM, Alfred Zett wrote: > Hello everyone, > > is there such a unicode block for programming related codepoints? > > Conventional search engines as well as wolfram alpha can't answer that, with > the former one leading to all the programming problems that occur... > > If such a block doesn't exist, I'd like to make a proposal - if possible - > to add one with at least the following codepoints/characters: Once upon a time I would have said that this is out of scope for Unicode. But now anything goes, so who knows. > - Indentation codepoint, with no fixed defined graphical representation. For > indentation based programming languages. > Because: > -- specific clients may want to show it different (for example as arrows, > lines etc., using another color): > --- browsers could let the web page creator let decide the visual > representation (character and size) via CSS > --- the same with editors, independent from the actual font > --- in case of visual impairment, the user could even change the accoustical > representation if the editor allows it > -- unlike a space symbol, it wouldn't need more than one character per > indentation > -- unlike tabs or space, it wouldn't be whitespace > -- unlike normal arrow characters, one could customize the length in an > editor and wouldn't have to insert extra spaces for a better visual imagery a Tab is exactly what you described. > - A codepoint for string literal quotes, that would spare one the escaping. How would this work exactly? > - A statement separator symbol. What's wrong with ; , . : # % ^ & and other hundreds of punctuation symbols? Cheers P. From jf at colson.eu Sun Feb 8 15:51:58 2015 From: jf at colson.eu (=?windows-1252?Q?Jean-Fran=E7ois_Colson?=) Date: Sun, 08 Feb 2015 22:51:58 +0100 Subject: Unicode block for programming related symbols and codepoints? In-Reply-To: <54D7C3EA.6080000@web.de> References: <54D7C3EA.6080000@web.de> Message-ID: <54D7DA7E.6010009@colson.eu> Le 08/02/15 21:15, Alfred Zett a ?crit : > Hello everyone, > > is there such a unicode block for programming related codepoints? > > Conventional search engines as well as wolfram alpha can't answer > that, with the former one leading to all the programming problems that > occur... > > If such a block doesn't exist, I'd like to make a proposal - if > possible - to add one with at least the following codepoints/characters: > > - Indentation codepoint, with no fixed defined graphical > representation. For indentation based programming languages. That wouldn?t be compliant with existing languages and future languages might use any existing character. > Because: > -- specific clients may want to show it different (for example as > arrows, lines etc., using another color): Can?t good editors display tabs in a different color when required ? > --- browsers could let the web page creator let decide the visual > representation (character and size) via CSS > --- the same with editors, independent from the actual font > --- in case of visual impairment, the user could even change the > accoustical representation if the editor allows it > -- unlike a space symbol, it wouldn't need more than one character per > indentation > -- unlike tabs or space, it wouldn't be whitespace > -- unlike normal arrow characters, one could customize the length in > an editor and wouldn't have to insert extra spaces for a better visual > imagery > > - A codepoint for string literal quotes, that would spare one the > escaping. I rarely escape quotes. In a text, I use ? (U+2019) as an apostrophe and ?????? as quotes, so I don?t need to escape them. When I use PHP to generate some HTML code, I try to alternate simple and double quotes as much as possible. That way I rarely need to escape them. > - A statement separator symbol. To replace the semicolon in C and the languages based on its syntax? > - Other ideas? Aren?t you trying to reinvent APL? > > You may now think, this is highly specific and you are right. > However, so are EMOJI signs, in particular those like PINE DECORATION. > > These days, there are a lot of tools to create small embedded > scripting languages and DSLs, which are used in-program in special > editors. And there is a lot of people using them. > Exactly these could really profit from such a codeblock instead of > using conventional ASCII subset characters. > Also, there is a lot of potential with really good text editors and > IDEs where semantics may matter a lot. > > Excuse my english, I hope this was understandable. > > Best regards, > > A. Z. > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode -------------- next part -------------- An HTML attachment was scrubbed... URL: From jf at colson.eu Sun Feb 8 16:02:05 2015 From: jf at colson.eu (=?windows-1252?Q?Jean-Fran=E7ois_Colson?=) Date: Sun, 08 Feb 2015 23:02:05 +0100 Subject: Unicode block for programming related symbols and codepoints? In-Reply-To: References: <54D7C3EA.6080000@web.de> Message-ID: <54D7DCDD.9060003@colson.eu> Le 08/02/15 22:32, Pierpaolo Bernardi a ?crit : > On Sun, Feb 8, 2015 at 9:15 PM, Alfred Zett wrote: > [?] > > -- unlike tabs or space, it wouldn't be whitespace > [?] > > a Tab is exactly what you described. Not exactly: a tab IS whitespace. It may sometimes be displayed in a different color or with a special symbol on request if the editor allows it, but in most cases it is whitespace. From alfred_z at web.de Sun Feb 8 16:07:52 2015 From: alfred_z at web.de (Alfred Zett) Date: Sun, 08 Feb 2015 23:07:52 +0100 Subject: Unicode block for programming related symbols and codepoints? In-Reply-To: <54D7D37B.8090900@colson.eu> References: <54D7C3EA.6080000@web.de> <54D7D37B.8090900@colson.eu> Message-ID: <54D7DE38.7090300@web.de> Hi Jean-Francois Colson, I hope this doesn't mess up the mailing list. >> >> - Indentation codepoint, with no fixed defined graphical >> representation. For indentation based programming languages. > > That wouldn?t be compliant with existing languages and future > languages might use any existing character. This was for new languages. Creators of future languages mostly orient on whatever is available and make sense, so I may make this proposal as well, so they don't have to choose the half-assed workarounds they use now. Also, as long as there is stuff like https://github.com/sferik/active_emoji it still makes more sense. >> Because: >> -- specific clients may want to show it different (for example as >> arrows, lines etc., using another color): > > Can?t good editors display tabs in a different color when required ? Not as reliable and customizable as a special codepoint. For example > >> --- browsers could let the web page creator let decide the visual >> representation (character and size) via CSS can't be done and on-the-fly copy and paste conversion with JavaScript is horrid and broken for security reasons. But it's an issue even in good editors as well. You need a lexing plugin that may work or not. And the size and other factors are still fixed. After all, tabs have whitespace semantics that may appear everywhere in the text. >> --- the same with editors, independent from the actual font >> --- in case of visual impairment, the user could even change the >> accoustical representation if the editor allows it >> -- unlike a space symbol, it wouldn't need more than one character >> per indentation >> -- unlike tabs or space, it wouldn't be whitespace >> -- unlike normal arrow characters, one could customize the length in >> an editor and wouldn't have to insert extra spaces for a better >> visual imagery >> >> - A codepoint for string literal quotes, that would spare one the >> escaping. > > I rarely escape quotes. > In a text, I use ? (U+2019) as an apostrophe and ?????? as quotes, so > I don?t need to escape them. > When I use PHP to generate some HTML code, I try to alternate simple > and double quotes as much as possible. That way I rarely need to > escape them. OK, but that's just your scenario. With a language design from the past. With probably an editor from the past that allows non-unicode encodings. In a better world, manual code point inserting was a last resort. Imagine someone wants to make his text look like written with a typewriter. Or something else. > >> - A statement separator symbol. > > To replace the semicolon in C and the languages based on its syntax? Again, for future uses. To be honest, this might sound questionable, but this could blur the line between visual line breaks and visual characters like semicolons. Line-break ended comments are separator ended comments. Of course, that's the least required part of those three proposed characters, but I thought for the sake and completeness that shouldn't miss. Come to think of it, two sets of opening and closing block symbols couldn't harm either. And a continue-after-linebreak symbol as well. > >> - Other ideas? > > Aren?t you trying to reinvent APL? > No. APL places a lot of alien-looking, annoying characters to anyone except mathematicians into your code that are hard to input. In particular from the context. My proposal on the other hand - if implemented right - introduces some really intuitive looking and easy to input characters, because a bold arrow at the left doesn't need further explanation and your IDE of the future can easily place them when pressing tab in the right position. From alfred_z at web.de Sun Feb 8 16:27:46 2015 From: alfred_z at web.de (Alfred Zett) Date: Sun, 08 Feb 2015 23:27:46 +0100 Subject: Unicode block for programming related symbols and codepoints? In-Reply-To: References: <54D7C3EA.6080000@web.de> Message-ID: <54D7E2E2.6080705@web.de> Hi Pierpaolo Bernardi, given that you did include my adress as well as the unicode adress I'm doing the same. > On Sun, Feb 8, 2015 at 9:15 PM, Alfred Zett wrote: >> Hello everyone, >> >> is there such a unicode block for programming related codepoints? >> >> Conventional search engines as well as wolfram alpha can't answer that, with >> the former one leading to all the programming problems that occur... >> >> If such a block doesn't exist, I'd like to make a proposal - if possible - >> to add one with at least the following codepoints/characters: > Once upon a time I would have said that this is out of scope for > Unicode. But now anything goes, so who knows. That was exactly my thought, so I figured it couldn't harm to have these comfy special characters in there :) >> - Indentation codepoint, with no fixed defined graphical representation. For >> indentation based programming languages. >> Because: >> -- specific clients may want to show it different (for example as arrows, >> lines etc., using another color): >> --- browsers could let the web page creator let decide the visual >> representation (character and size) via CSS >> --- the same with editors, independent from the actual font >> --- in case of visual impairment, the user could even change the accoustical >> representation if the editor allows it >> -- unlike a space symbol, it wouldn't need more than one character per >> indentation >> -- unlike tabs or space, it wouldn't be whitespace >> -- unlike normal arrow characters, one could customize the length in an >> editor and wouldn't have to insert extra spaces for a better visual imagery > a Tab is exactly what you described. No. It's only half of what I described. It's still a typographical character that implies whitespace and may appear everywhere in the text. Custom size behavior (but not too custom) is the only similarity to that indentation character. > >> - A codepoint for string literal quotes, that would spare one the escaping. > How would this work exactly? Imagine you type " in your IDE, but because your IDE does know that this new programming language requires this special character as literal token, it replaces it with a special looking quotation mark. Now you are free to type any type of quotation mark until you hit ESC or something which places a closing special quotation mark and your caret right to it. Of course, IDEs could render this without special marks and a different background colour instead; or whatever float the IDE creators boat. >> - A statement separator symbol. > What's wrong with ; , . : # % ^ & and other hundreds of punctuation symbols? Nothing, they are just semantically not as nice and customizable. Best regards A.Z. From shervinafshar at gmail.com Sun Feb 8 16:36:14 2015 From: shervinafshar at gmail.com (Shervin Afshar) Date: Sun, 8 Feb 2015 14:36:14 -0800 Subject: Unicode block for programming related symbols and codepoints? In-Reply-To: <54D7C3EA.6080000@web.de> References: <54D7C3EA.6080000@web.de> Message-ID: All of the requirements mentioned here can be (and are) implemented in higher levels of software (like IDEs). IMO, there isn't any need for adding new characters to Unicode to address these issues. Additionally, people tend to forget that simply because Unicode is doing emoji out of compatibility (or other) requirements, it does not mean that "now anything goes". I refer folks to TR51[1] (specifically sections 1.3, 8, and Annex C). [1]: http://www.unicode.org/reports/tr51 ? Shervin On Sun, Feb 8, 2015 at 12:15 PM, Alfred Zett wrote: > Hello everyone, > > is there such a unicode block for programming related codepoints? > > Conventional search engines as well as wolfram alpha can't answer that, > with the former one leading to all the programming problems that occur... > > If such a block doesn't exist, I'd like to make a proposal - if possible - > to add one with at least the following codepoints/characters: > > - Indentation codepoint, with no fixed defined graphical representation. > For indentation based programming languages. > Because: > -- specific clients may want to show it different (for example as arrows, > lines etc., using another color): > --- browsers could let the web page creator let decide the visual > representation (character and size) via CSS > --- the same with editors, independent from the actual font > --- in case of visual impairment, the user could even change the > accoustical representation if the editor allows it > -- unlike a space symbol, it wouldn't need more than one character per > indentation > -- unlike tabs or space, it wouldn't be whitespace > -- unlike normal arrow characters, one could customize the length in an > editor and wouldn't have to insert extra spaces for a better visual imagery > > - A codepoint for string literal quotes, that would spare one the escaping. > - A statement separator symbol. > - Other ideas? > > You may now think, this is highly specific and you are right. > However, so are EMOJI signs, in particular those like PINE DECORATION. > > These days, there are a lot of tools to create small embedded scripting > languages and DSLs, which are used in-program in special editors. And there > is a lot of people using them. > Exactly these could really profit from such a codeblock instead of using > conventional ASCII subset characters. > Also, there is a lot of potential with really good text editors and IDEs > where semantics may matter a lot. > > Excuse my english, I hope this was understandable. > > Best regards, > > A. Z. > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jf at colson.eu Sun Feb 8 16:45:27 2015 From: jf at colson.eu (=?UTF-8?B?SmVhbi1GcmFuw6dvaXMgQ29sc29u?=) Date: Sun, 08 Feb 2015 23:45:27 +0100 Subject: Unicode block for programming related symbols and codepoints? In-Reply-To: <54D7DE38.7090300@web.de> References: <54D7C3EA.6080000@web.de> <54D7D37B.8090900@colson.eu> <54D7DE38.7090300@web.de> Message-ID: <54D7E707.10104@colson.eu> Le 08/02/15 23:07, Alfred Zett a ?crit : > Hi Jean-Francois Colson, > > I hope this doesn't mess up the mailing list. > >>> >>> - Indentation codepoint, with no fixed defined graphical >>> representation. For indentation based programming languages. >> >> That wouldn?t be compliant with existing languages and future >> languages might use any existing character. > > This was for new languages. Creators of future languages mostly orient > on whatever is available and make sense, so I may make this proposal > as well, so they don't have to choose the half-assed workarounds they > use now. I need a few tens of characters for a conlang I?m developping. ? The problem is that Unicode only encodes characters which are effectively used today or which have been used in the past. It doesn?t encode characters which could perhaps be used in a hypothetical new programing language in the future. > > Also, as long as there is stuff like > https://github.com/sferik/active_emoji it still makes more sense. > >>> Because: >>> -- specific clients may want to show it different (for example as >>> arrows, lines etc., using another color): >> >> Can?t good editors display tabs in a different color when required ? > Not as reliable and customizable as a special codepoint. For example > >> >>> --- browsers could let the web page creator let decide the visual >>> representation (character and size) via CSS > > can't be done and on-the-fly copy and paste conversion with JavaScript > is horrid and broken for security reasons. > But it's an issue even in good editors as well. You need a lexing > plugin that may work or not. And the size and other factors are still > fixed. After all, tabs have whitespace semantics that may appear > everywhere in the text. > >>> --- the same with editors, independent from the actual font >>> --- in case of visual impairment, the user could even change the >>> accoustical representation if the editor allows it >>> -- unlike a space symbol, it wouldn't need more than one character >>> per indentation >>> -- unlike tabs or space, it wouldn't be whitespace >>> -- unlike normal arrow characters, one could customize the length in >>> an editor and wouldn't have to insert extra spaces for a better >>> visual imagery >>> >>> - A codepoint for string literal quotes, that would spare one the >>> escaping. >> >> I rarely escape quotes. >> In a text, I use ? (U+2019) as an apostrophe and ?????? as quotes, so >> I don?t need to escape them. >> When I use PHP to generate some HTML code, I try to alternate simple >> and double quotes as much as possible. That way I rarely need to >> escape them. > OK, but that's just your scenario. With a language design from the > past. With probably an editor from the past that allows non-unicode > encodings. In a better world, manual code point inserting was a last > resort. > > Imagine someone wants to make his text look like written with a > typewriter. Or something else. > >> >>> - A statement separator symbol. >> >> To replace the semicolon in C and the languages based on its syntax? > Again, for future uses. To be honest, this might sound questionable, > but this could blur the line between visual line breaks and visual > characters like semicolons. > Line-break ended comments are separator ended comments. > Of course, that's the least required part of those three proposed > characters, but I thought for the sake and completeness that shouldn't > miss. > > Come to think of it, two sets of opening and closing block symbols > couldn't harm either. And a continue-after-linebreak symbol as well. > >> >>> - Other ideas? >> >> Aren?t you trying to reinvent APL? >> > No. APL places a lot of alien-looking, annoying characters to anyone > except mathematicians into your code that are hard to input. In > particular from the context. > > My proposal on the other hand - if implemented right - introduces some > really intuitive looking and easy to input characters, because a bold > arrow at the left doesn't need further explanation and your IDE of the > future can easily place them when pressing tab in the right position. > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode From olopierpa at gmail.com Sun Feb 8 16:54:11 2015 From: olopierpa at gmail.com (Pierpaolo Bernardi) Date: Sun, 8 Feb 2015 23:54:11 +0100 Subject: Unicode block for programming related symbols and codepoints? In-Reply-To: <54D7E2E2.6080705@web.de> References: <54D7C3EA.6080000@web.de> <54D7E2E2.6080705@web.de> Message-ID: On Sun, Feb 8, 2015 at 11:27 PM, Alfred Zett wrote: > That was exactly my thought, so I figured it couldn't harm to have these >> a Tab is exactly what you described. > > No. It's only half of what I described. > It's still a typographical character that implies whitespace and may appear > everywhere in the text. How would your proposed character be displayed as plain text? >>> - A codepoint for string literal quotes, that would spare one the >>> escaping. >> >> How would this work exactly? > > Imagine you type " in your IDE, but because your IDE does know that this new > programming language requires this special character as literal token, it > replaces it with a special looking quotation mark. Unicode is a standard for plain text. If you require a special IDE for your programming language then why use plain text at all? From ritt.ks at gmail.com Sun Feb 8 17:27:59 2015 From: ritt.ks at gmail.com (Konstantin Ritt) Date: Mon, 9 Feb 2015 03:27:59 +0400 Subject: Unicode block for programming related symbols and codepoints? In-Reply-To: References: <54D7C3EA.6080000@web.de> <54D7E2E2.6080705@web.de> Message-ID: > My proposal on the other hand - if implemented right - introduces some really intuitive looking and easy to input characters, Easier than latin1, a layout one could find on [almost] every keyboard? Good luck. Konstantin 2015-02-09 2:54 GMT+04:00 Pierpaolo Bernardi : > On Sun, Feb 8, 2015 at 11:27 PM, Alfred Zett wrote: > > > That was exactly my thought, so I figured it couldn't harm to have these > > >> a Tab is exactly what you described. > > > > No. It's only half of what I described. > > It's still a typographical character that implies whitespace and may > appear > > everywhere in the text. > > How would your proposed character be displayed as plain text? > > >>> - A codepoint for string literal quotes, that would spare one the > >>> escaping. > >> > >> How would this work exactly? > > > > Imagine you type " in your IDE, but because your IDE does know that this > new > > programming language requires this special character as literal token, it > > replaces it with a special looking quotation mark. > > Unicode is a standard for plain text. If you require a special IDE > for your programming language then why use plain text at all? > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jf at colson.eu Sun Feb 8 17:36:19 2015 From: jf at colson.eu (=?UTF-8?B?SmVhbi1GcmFuw6dvaXMgQ29sc29u?=) Date: Mon, 09 Feb 2015 00:36:19 +0100 Subject: Unicode block for programming related symbols and codepoints? In-Reply-To: <54D7DE38.7090300@web.de> References: <54D7C3EA.6080000@web.de> <54D7D37B.8090900@colson.eu> <54D7DE38.7090300@web.de> Message-ID: <54D7F2F3.5080102@colson.eu> Le 08/02/15 23:07, Alfred Zett a ?crit : > Hi Jean-Francois Colson, >>> >>> - A codepoint for string literal quotes, that would spare one the >>> escaping. >> >> I rarely escape quotes. >> In a text, I use ? (U+2019) as an apostrophe and ?????? as quotes, so >> I don?t need to escape them. >> When I use PHP to generate some HTML code, I try to alternate simple >> and double quotes as much as possible. That way I rarely need to >> escape them. > OK, but that's just your scenario. With a language design from the > past. With probably an editor from the past that allows non-unicode > encodings. ????? That?s mainly with gcc on GNU/Linux with a UTF-8 locale or with PHP with a in the XHTML document. > In a better world, manual code point inserting was a last resort. What do you call ?manual inserting?? > > Imagine someone wants to make his text look like written with a > typewriter. That?s a very special case and a few \ are not a big problem. You could use existing characters as ?string litteral quotes?. I?ve never used APL so I don?t remember the meanings of its symbols, but couldn?t ? U+2358 APL FUNCTIONAL SYMBOL QUOTE UNDERBAR or ? U+235E APL FUNCTIONAL SYMBOL QUOTE QUAD work as ?string litteral quotes? in a new programming language? > >> Aren?t you trying to reinvent APL? >> > No. APL places a lot of alien-looking, annoying characters to anyone > except mathematicians into your code that are hard to input. Hard to input? Not harder than the new symbols you?d like to propose. That?s only a matter of keyboard layout and input method. > In particular from the context. > > My proposal on the other hand - if implemented right - introduces some > really intuitive looking and easy to input characters, In what would they be easier to input? > because a bold arrow at the left doesn't need further explanation and > your IDE of the future can easily place them when pressing tab in the > right position. If the IDE inputs your new character when you press tab, then your new character is a tab? From jf at colson.eu Sun Feb 8 18:04:10 2015 From: jf at colson.eu (=?windows-1252?Q?Jean-Fran=E7ois_Colson?=) Date: Mon, 09 Feb 2015 01:04:10 +0100 Subject: Unicode block for programming related symbols and codepoints? In-Reply-To: References: <54D7C3EA.6080000@web.de> <54D7E2E2.6080705@web.de> Message-ID: <54D7F97A.1060909@colson.eu> Le 09/02/15 00:27, Konstantin Ritt a ?crit : > > My proposal on the other hand - if implemented right - introduces > some really intuitive looking and easy to input characters, > > Easier than latin1, a layout one could find on [almost] every > keyboard? Good luck. Latin-1 is not a keyboard layout, it?s a character set: ISO/CEI 8859-1. Latin-1 is not available on almost every keyboard: It is not available on most US keyboards except for the minority who uses a US international driver; It is not available on most Russian keyboards which only provide Cyrillic letters and ASCII (unaccented) Latin letters; It is not fully available on many Western European keyboards (With a French azerty keyboard on M$ Windows, using the default driver, you have no way to type a capital ? or a capital ? except by typing Alt + 0 2 0 1 or Alt + 0 1 9 9.); It is not available on keyboards of Central and Eastern European keyboards (to the East of Germany, Latin-2); It is not available on Maltese or Turkish keyboards (Latin-3); It is not available on keyboards of the Baltic countries (Latin-4); Etc. > > Konstantin > > 2015-02-09 2:54 GMT+04:00 Pierpaolo Bernardi >: > > On Sun, Feb 8, 2015 at 11:27 PM, Alfred Zett > wrote: > > > That was exactly my thought, so I figured it couldn't harm to > have these > > >> a Tab is exactly what you described. > > > > No. It's only half of what I described. > > It's still a typographical character that implies whitespace and > may appear > > everywhere in the text. > > How would your proposed character be displayed as plain text? > > >>> - A codepoint for string literal quotes, that would spare one the > >>> escaping. > >> > >> How would this work exactly? > > > > Imagine you type " in your IDE, but because your IDE does know > that this new > > programming language requires this special character as literal > token, it > > replaces it with a special looking quotation mark. > > Unicode is a standard for plain text. If you require a special IDE > for your programming language then why use plain text at all? > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > > > > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode -------------- next part -------------- An HTML attachment was scrubbed... URL: From alfred_z at web.de Mon Feb 9 06:55:02 2015 From: alfred_z at web.de (Alfred Zett) Date: Mon, 09 Feb 2015 13:55:02 +0100 Subject: Unicode block for programming related symbols and codepoints? In-Reply-To: References: <54D7C3EA.6080000@web.de> Message-ID: <54D8AE26.3030409@web.de> OK, I will now try to answer all of you in one mail, otherwise it gets hard to overlook... Shervin Afshar: > All of the requirements mentioned here can be (and are) implemented in > higher levels of software (like IDEs). IMO, there isn't any need for > adding new characters to Unicode to address these issues. But then it would be incompatible from IDE to IDE, like Python is incompatible using 2 spaces, 4 spaces and tabs. It's the data that is important, not the software. > > Additionally, people tend to forget that simply because Unicode is > doing emoji out of compatibility (or other) requirements, it does not > mean that "now anything goes". I refer folks to TR51[1] (specifically > sections 1.3, 8, and Annex C). > > [1]: http://www.unicode.org/reports/tr51 > You know, the fact that this consortium ever took emoji into consideration immediately justifies to include everything everyone ever wanted. There is no such thing as important data including emoji. :) Jean-Francois Colson: > I need a few tens of characters for a conlang I?m developping. ? Except two or three control characters don't make a con language. Also, if you don't like con languages in Unicode, what's this: http://unicode.org/charts/PDF/U1F700.pdf > The problem is that Unicode only encodes characters which are > effectively used today or which have been used in the past. It doesn?t > encode characters which could perhaps be used in a hypothetical new > programing language in the future. So you want the font encoding scheme to be a limitating factor for new things? Pierpaolo Bernardi: > How would your proposed character be displayed as plain text? There is no such thing as plain text. Even line breaks and tabs are a matter of interpretation. It's just that they usually have typographic semantics, even in programming editors, with all the side effects. In very simple (and with that I mean shitty or not even remotely programming oriented) editors, it may show like a control character, like ?. Browsers and any editor passing the "based on scintilla" complexity mark of course should display something that makes more sense, like an arrow or ? plus surrounding space. > Unicode is a standard for plain text. If you require a special IDE > for your programming language then why use plain text at all? Because binary custom encoded databases or blob files are the death of interoperability. Konstantin Ritt: > Easier than latin1, a layout one could find on [almost] every > keyboard? Good luck. Also: Jean-Francois Colson: > Hard to input? Not harder than the new symbols you?d like to propose. > That?s only a matter of keyboard layout and input method. Indent by pressing tab and insert the literal thing by pressing ". Nothing changes, the IDE/editor does the work on the fly. Just that you have clean semantics, interoperability and customizability. Beat that, APL. Where you would >10 key bindings or an annoying software keyboard. > I?ve never used APL so I don?t remember the meanings of its symbols, > but couldn?t ? U+2358 APL FUNCTIONAL SYMBOL QUOTE UNDERBAR or ? U+235E > APL FUNCTIONAL SYMBOL QUOTE QUAD work as ?string litteral quotes? in a > new programming language? That's a good idea. That still leaves the indentation character, which is harder than that, because one would want a control character with certain semantics. E.G.: For programming editors it would make sense to only allow it after line breaks and convert other occurences into tabs. > If the IDE inputs your new character when you press tab, then your new > character is a tab? Not if it detects the beginning of a line. Best regards A. Z. From frederic.grosshans at gmail.com Mon Feb 9 08:08:39 2015 From: frederic.grosshans at gmail.com (=?UTF-8?B?RnLDqWTDqXJpYyBHcm9zc2hhbnM=?=) Date: Mon, 09 Feb 2015 15:08:39 +0100 Subject: Unicode block for programming related symbols and codepoints? In-Reply-To: <54D8AE26.3030409@web.de> References: <54D7C3EA.6080000@web.de> <54D8AE26.3030409@web.de> Message-ID: <54D8BF67.3050100@gmail.com> Le 09/02/2015 13:55, Alfred Zett a ?crit : > >> Additionally, people tend to forget that simply because Unicode is >> doing emoji out of compatibility (or other) requirements, it does not >> mean that "now anything goes". I refer folks to TR51[1] (specifically >> sections 1.3, 8, and Annex C). >> >> [1]: http://www.unicode.org/reports/tr51 >> > You know, the fact that this consortium ever took emoji into > consideration immediately justifies to include everything everyone > ever wanted. There is no such thing as important data including emoji. :) The including of emoji was a considerable debate here, with people strongly against and strongly for. The trick is that they were already used as digital characters by Japanese Telcos and their millions of customers. They were de facto encoded as characters in Japanese text messages. At the time of encoding, the spread of smartphones made them appear in other places (emails, web forums, etc.) > > > Jean-Francois Colson: >> I need a few tens of characters for a conlang I?m developping. ? > Except two or three control characters don't make a con language. > Also, if you don't like con languages in Unicode, what's this: > http://unicode.org/charts/PDF/U1F700.pdf I doubt that ?not liking con languages? is a faithful description of Jean-Fran?ois ;-) On a more serious notes, this block is actually a set of ?scientific? (at his time) notations used by Isaac Newton in its time. They were encoded in Unicode following an academic project to digitize his manuscripts. So here, you have characters used 3 centuries ago by no less than Isaac Newton, most of them having a much longer history, and useful for science historians. See http://www.unicode.org/L2/L2009/09037r2-alchemy.pdf for details. This does not compares with a few characters invented for a conlang invented by an amateur and used by no one but himself. I think that is the point Jean-Fran?ois wanted to make. A closer counter-example to Jean-Fran?ois's ?wish? would be Shavian (10450..1047F), but this alphabet has shown some use, and I guess that its encoding would have been much harder without its association with someone as famous as George Berard Shaw or without the existing publication of a full text in Shavian. > >> The problem is that Unicode only encodes characters which are >> effectively used today or which have been used in the past. It >> doesn?t encode characters which could perhaps be used in a >> hypothetical new programing language in the future. > So you want the font encoding scheme to be a limitating factor for new > things? It is more or less the rule, expt that is not a font encoding, but a standard encoding. Once something is encoded , it can never be unencoded. And the Unicode standard is built to stay relevant as long as possible (decades or centuries). So you ask for your character top be encoded in billions of devices for decades. It is more than a mere font encoding. There are a few exceptions, but only when a widespread use is really expected, like for monetary symbols (it was the case for the Euro). What you are asking, is a character for an untested idea. You are convinced it is useful, but cannot prove anyone beyond yourself will use it, hence Jean-Fran?ois?s parallel with conlangs. In order to have a chance of success, design a language using existing characters (e.g. some APL + ? for TAB) and/or private use codepoints. Once your language start gathering steam, come back and argue that using an arrow or a tab is awkward, and that U+XXXX SHINY TAB FOR PROGRAMMERS would be an improvement for a significant community. I know it is a lot of work, but that is probably what it takes. > > Pierpaolo Bernardi: >> How would your proposed character be displayed as plain text? > There is no such thing as plain text. When you say that, you don?t accept the premise of Unicode encoding. Unicode?s goal is to encode all plain text characters, but only plain text characters. > Even line breaks and tabs are a matter of interpretation. It's just > that they usually have typographic semantics, even in programming > editors, with all the side effects. > > In very simple (and with that I mean shitty or not even remotely > programming oriented) editors, it may show like a control character, > like ?. > > Browsers and any editor passing the "based on scintilla" complexity > mark of course should display something that makes more sense, like an > arrow or ? plus surrounding space. I think everyone her knows what you are saying, and that the notion of plain text is a bit fuzzy. But if you cannot argue that your character has a meaning in plaint text, for some value of ?plain text?, then you can not hope for an encoding in Unicode. From alfred_z at web.de Mon Feb 9 08:57:15 2015 From: alfred_z at web.de (Alfred Zett) Date: Mon, 09 Feb 2015 15:57:15 +0100 Subject: Unicode block for programming related symbols and codepoints? In-Reply-To: <54D8BF67.3050100@gmail.com> References: <54D7C3EA.6080000@web.de> <54D8AE26.3030409@web.de> <54D8BF67.3050100@gmail.com> Message-ID: <54D8CACB.1060308@web.de> Fr?d?ric Grosshans: > Le 09/02/2015 13:55, Alfred Zett a ?crit : >> >>> Additionally, people tend to forget that simply because Unicode is >>> doing emoji out of compatibility (or other) requirements, it does >>> not mean that "now anything goes". I refer folks to TR51[1] >>> (specifically sections 1.3, 8, and Annex C). >>> >>> [1]: http://www.unicode.org/reports/tr51 >>> >> You know, the fact that this consortium ever took emoji into >> consideration immediately justifies to include everything everyone >> ever wanted. There is no such thing as important data including >> emoji. :) > The including of emoji was a considerable debate here, with people > strongly against and strongly for. The trick is that they were already > used as digital characters by Japanese Telcos and their millions of > customers. They were de facto encoded as characters in Japanese text > messages. At the time of encoding, the spread of smartphones made them > appear in other places (emails, web forums, etc.) > The trick is that one doesn't bargain with Telcos and similar criminals. Gotta drop them hard and the pest will go away from itself after five years or so. >> Jean-Francois Colson: >>> I need a few tens of characters for a conlang I?m developping. ? >> Except two or three control characters don't make a con language. >> Also, if you don't like con languages in Unicode, what's this: >> http://unicode.org/charts/PDF/U1F700.pdf > I doubt that ?not liking con languages? is a faithful description of > Jean-Fran?ois ;-) > > On a more serious notes, this block is actually a set of ?scientific? > (at his time) notations used by Isaac Newton in its time. They were > encoded in Unicode following an academic project to digitize his > manuscripts. So here, you have characters used 3 centuries ago by no > less than Isaac Newton, most of them having a much longer history, and > useful for science historians. See > http://www.unicode.org/L2/L2009/09037r2-alchemy.pdf for details. > That's actually interesting. Good to know, thanks. > I think everyone her knows what you are saying, and that the notion of > plain text is a bit fuzzy. But if you cannot argue that your character > has a meaning in plaint text, for some value of ?plain text?, then you > can not hope for an encoding in Unicode. > OK, in this case I agree it makes little sense to hope for such characters. Best regards, A. Z. From john at mitre.org Mon Feb 9 09:37:38 2015 From: john at mitre.org (John D Burger) Date: Mon, 9 Feb 2015 10:37:38 -0500 Subject: Unicode block for programming related symbols and codepoints? In-Reply-To: <54D7DA7E.6010009@colson.eu> References: <54D7C3EA.6080000@web.de> <54D7DA7E.6010009@colson.eu> Message-ID: <6EE199C2-134D-4A63-91D4-DBF75B2C85CA@mitre.org> >> - Indentation codepoint, with no fixed defined graphical representation. For indentation based programming languages. > > That wouldn?t be compliant with existing languages and future languages might use any existing character. > >> Because: >> -- specific clients may want to show it different (for example as arrows, lines etc., using another color): > > Can?t good editors display tabs in a different color when required ? Lots of them already do, e.g. Emacs in various modes. - John Burger MITRE > >> --- browsers could let the web page creator let decide the visual representation (character and size) via CSS >> --- the same with editors, independent from the actual font >> --- in case of visual impairment, the user could even change the accoustical representation if the editor allows it >> -- unlike a space symbol, it wouldn't need more than one character per indentation >> -- unlike tabs or space, it wouldn't be whitespace >> -- unlike normal arrow characters, one could customize the length in an editor and wouldn't have to insert extra spaces for a better visual imagery >> >> - A codepoint for string literal quotes, that would spare one the escaping. > > I rarely escape quotes. > In a text, I use ? (U+2019) as an apostrophe and ?????? as quotes, so I don?t need to escape them. > When I use PHP to generate some HTML code, I try to alternate simple and double quotes as much as possible. That way I rarely need to escape them. > >> - A statement separator symbol. > > To replace the semicolon in C and the languages based on its syntax? > >> - Other ideas? > > Aren?t you trying to reinvent APL? > >> >> You may now think, this is highly specific and you are right. >> However, so are EMOJI signs, in particular those like PINE DECORATION. >> >> These days, there are a lot of tools to create small embedded scripting languages and DSLs, which are used in-program in special editors. And there is a lot of people using them. >> Exactly these could really profit from such a codeblock instead of using conventional ASCII subset characters. >> Also, there is a lot of potential with really good text editors and IDEs where semantics may matter a lot. >> >> Excuse my english, I hope this was understandable. >> >> Best regards, >> >> A. Z. >> _______________________________________________ >> Unicode mailing list >> Unicode at unicode.org >> http://unicode.org/mailman/listinfo/unicode > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode From wjgo_10009 at btinternet.com Mon Feb 9 04:48:18 2015 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Mon, 9 Feb 2015 10:48:18 +0000 (GMT) Subject: Unicode block for programming related symbols and codepoints? In-Reply-To: <54D7E707.10104@colson.eu> References: <54D7C3EA.6080000@web.de> <54D7D37B.8090900@colson.eu> <54D7DE38.7090300@web.de> <54D7E707.10104@colson.eu> Message-ID: <30873394.16253.1423478898115.JavaMail.defaultUser@defaultHost> > The problem is that Unicode only encodes characters which are effectively used today or which have been used in the past. Well, that was the case, but the situation appears to be changing. There is my feedback note that is the last item in the following linked document. http://www.unicode.org/L2/L2015/15019-pubrev.html > It doesn?t encode characters which could perhaps be used in a hypothetical new programing language in the future. Well, that was the case and might still be the case. We will only find out for sure, and then only for a particular case, when a situation arises where the Unicode Technical Committee rules about a petition submitted to the committee requesting the encoding of some such characters. The fact that the rules over what can be encoded are changing rapidly opens up great possibilities for future developments from ideas put forward from the community. If the changes in policy continue then this will be very beneficial to progress as a regular Unicode encoding makes an encoding of free equal use for all with no proprietary aspect to the encoding. William Overington 9 February 2015 From A.Schappo at lboro.ac.uk Mon Feb 9 10:41:14 2015 From: A.Schappo at lboro.ac.uk (Andre Schappo) Date: Mon, 9 Feb 2015 16:41:14 +0000 Subject: Unicode block for programming related symbols and codepoints? In-Reply-To: <54D7C3EA.6080000@web.de> References: <54D7C3EA.6080000@web.de> Message-ID: <9B99B8A6-BB16-4DC0-A193-5D5869274040@lboro.ac.uk> I think this is a very good idea. There are so many multiple uses of ASCII characters in programming languages that really does need sorting out. The fundamental separation of character semantics and glyph visual representation works really well for this proposal. Let me take as an example the use of = in programming. The = is used for test of equality and assignment in various programming languages. The equality and assignment operations should have different characters. e.g. U+XXX1 TEST FOR EQUALITY U+XXX2 ASSIGNMENT OPERATOR Initially the glyphs used for these characters could be = but then this mechanism can be used to transition to a new and less ambiguous visual representation. The new visual representation could be something like U+XXX1 TEST FOR EQUALITY = U+XXX2 ASSIGNMENT OPERATOR ? Such a visual and character distinction between the 2 functions must surely make it easier for those learning to program and for interpreter and compiler writers. I think it would also make for easier to read/understand program code. Andr? On 8 Feb 2015, at 20:15, Alfred Zett wrote: Hello everyone, is there such a unicode block for programming related codepoints? Conventional search engines as well as wolfram alpha can't answer that, with the former one leading to all the programming problems that occur... If such a block doesn't exist, I'd like to make a proposal - if possible - to add one with at least the following codepoints/characters: - Indentation codepoint, with no fixed defined graphical representation. For indentation based programming languages. Because: -- specific clients may want to show it different (for example as arrows, lines etc., using another color): --- browsers could let the web page creator let decide the visual representation (character and size) via CSS --- the same with editors, independent from the actual font --- in case of visual impairment, the user could even change the accoustical representation if the editor allows it -- unlike a space symbol, it wouldn't need more than one character per indentation -- unlike tabs or space, it wouldn't be whitespace -- unlike normal arrow characters, one could customize the length in an editor and wouldn't have to insert extra spaces for a better visual imagery - A codepoint for string literal quotes, that would spare one the escaping. - A statement separator symbol. - Other ideas? You may now think, this is highly specific and you are right. However, so are EMOJI signs, in particular those like PINE DECORATION. These days, there are a lot of tools to create small embedded scripting languages and DSLs, which are used in-program in special editors. And there is a lot of people using them. Exactly these could really profit from such a codeblock instead of using conventional ASCII subset characters. Also, there is a lot of potential with really good text editors and IDEs where semantics may matter a lot. Excuse my english, I hope this was understandable. Best regards, A. Z. _______________________________________________ Unicode mailing list Unicode at unicode.org http://unicode.org/mailman/listinfo/unicode ???????????????? http://twitter.com/andreschappo http://schappo.blogspot.co.uk http://weibo.com/andreschappo http://blog.sina.com.cn/andreschappo -------------- next part -------------- An HTML attachment was scrubbed... URL: From martin at v.loewis.de Mon Feb 9 11:11:47 2015 From: martin at v.loewis.de (=?UTF-8?B?Ik1hcnRpbiB2LiBMw7Z3aXMi?=) Date: Mon, 09 Feb 2015 18:11:47 +0100 Subject: Unicode block for programming related symbols and codepoints? In-Reply-To: <54D8CACB.1060308@web.de> References: <54D7C3EA.6080000@web.de> <54D8AE26.3030409@web.de> <54D8BF67.3050100@gmail.com> <54D8CACB.1060308@web.de> Message-ID: <54D8EA53.5010907@v.loewis.de> Am 09.02.15 um 15:57 schrieb Alfred Zett: > That's actually interesting. Good to know, thanks. >> I think everyone her knows what you are saying, and that the notion of >> plain text is a bit fuzzy. But if you cannot argue that your character >> has a meaning in plaint text, for some value of ?plain text?, then you >> can not hope for an encoding in Unicode. >> > OK, in this case I agree it makes little sense to hope for such characters. That Unicode encodes "plain text" is indeed in its fundamentals (see 2.2, Unicode Design Principles). Also, the Criteria for Encoding Symbols speak against your characters, on the grounds of Jean-Fran?ois objections: http://www.unicode.org/pending/symbol-guidelines.html "The fact that a symbol merely "seems to be useful or potentially useful" is precisely not a reason to code it. Demonstrated usage, or demonstrated demand, on the other hand, does constitute a good reason to encode the symbol." So if you can't demonstrate usage, you should at least demonstrate demand (rather than just claiming that there might be demand). The canonical example for adding symbols with no demonstrated usage are apparently the currency symbols, where it is easy to demonstrate demand (by referring to the legislation that brings the currency to life). Welcome the NEW DRACHMA SIGN :-) Regards, Martin From alfred_z at web.de Mon Feb 9 11:53:43 2015 From: alfred_z at web.de (Alfred Zett) Date: Mon, 09 Feb 2015 18:53:43 +0100 Subject: Unicode block for programming related symbols and codepoints? In-Reply-To: <13507203.47311.1423496435218.JavaMail.defaultUser@defaultHost> References: <54D7C3EA.6080000@web.de> <13507203.47311.1423496435218.JavaMail.defaultUser@defaultHost> Message-ID: <54D8F427.6070106@web.de> @ John D Burger: And out of the sudden a war wages what counts as good editor. :D @ Andre Schappo: That's a good idea. We need it in the name of science and education. :D William_J_G Overington: > Hi > > You might like the following post. > > http://www.unicode.org/mail-arch/unicode-ml/y2010-m06/0001.html > > William > Hi, I'm really not sure what this is about, but it seems like an interface to deliver instructions to the rendering VM? Martin v. L?wis: > So if you can't demonstrate usage, you should at least demonstrate > demand (rather than just claiming that there might be demand). The problem is, you can't do that with the topic at hand. Because most programmers don't even see the possibilities. It's like asking a blind what colors look like. Although that may sound kind of arrogant. Among language designers and people interested in stuff like this, there is only a small fraction that doesn't hold the ill-minded opinion that syntax doesn't matter at all. Among those who care for syntax there is only a small fraction that really knows enough about Unicode. And who can blame them, I still see broken characters on a weekly base. Among those there is only a small fraction that cares enough. Among those there is only a small fraction that has the nerves/balls to put up with a consortium. This small subset is a handful of people, like Andr?, me and maybe 3 other persons. I don't really feel comfortable to sound that elitist, but in this case I dare say that the consortium shouldn't care for established popularity, the same way they should have handled emoji characters. Best regards A. Z. From andrea.giammarchi at gmail.com Mon Feb 9 11:54:18 2015 From: andrea.giammarchi at gmail.com (Andrea Giammarchi) Date: Mon, 9 Feb 2015 18:54:18 +0100 Subject: About cultural/languages communities flags Message-ID: Hello everyone, I've had an interesting request [1] that makes sense to me, but I'd like to understand Unicode position about it. The TL;DR version of the request is the following: There are communities, let's take Scottish people as example, that have even a domain but not an emoji flag. Some flag s related project adopted more than what we have now in emoji, inclucing 239 flags: http://www.famfamfam.com/archive/flag-icons-released/ The proposal is quite simple, and I am quoting from the request: > if a cultural/language TLD is typed with Unicode RIS, then show the flag for these culture/language: ???????? --> it shows Scottish flag ?????????? --> it shows a Welsh flag ?????? --> it shows a Breton flag ?????? --> it shows Catalan flag ?????? --> it shows a Basque flag ?????? --> it shows a Gallician flag Thanks in advance for any sort of outcome. Best Regards [1] https://github.com/twitter/twemoji/issues/40 -------------- next part -------------- An HTML attachment was scrubbed... URL: From kenwhistler at att.net Mon Feb 9 12:17:09 2015 From: kenwhistler at att.net (Ken Whistler) Date: Mon, 09 Feb 2015 10:17:09 -0800 Subject: Unicode block for programming related symbols and codepoints? In-Reply-To: <9B99B8A6-BB16-4DC0-A193-5D5869274040@lboro.ac.uk> References: <54D7C3EA.6080000@web.de> <9B99B8A6-BB16-4DC0-A193-5D5869274040@lboro.ac.uk> Message-ID: <54D8F9A5.9070302@att.net> I think this discussion is confusing the need for separate syntactic functions in formal language definitions with the need for *encoding* of characters. The distinction between assignment and test for equality has been around for decades in formal languages, and of course it is almost always carefully distinguished in the formal syntax: C, C++ and kindred Use "=" for assignment. Use "==" for equivalence operator. Pascal and kindred Use ":=" for assignment. Use "=" for equivalence operator. Lisp Assignment: let (a 6) Equivalence evaluation: (= a 6) And so on. The fact that these formal languages do not use a *single* distinct character for each of these syntactic functions is not a formal defect -- there are many, many concepts in formal languages which are defined using sequences of characters, rather than a single character. As has already been alluded to in this thread, trying to stack all functionality into single character definitions heads back in the direction of relatively illegible APL program text. It might have its place, but isn't much of a choice for widely used general programming languages. There are two basic issues with using sequences of (typically ASCII) characters for fundamental operators: 1. It marginally complicates parsing. 2. If chosen badly, they can confuse programmers using the syntax. #1 is basically trivial, as long as the formal syntax passes the bar of not introducing syntactic ambiguity. #2 is the *real* problem, imo. The use in C of "=" and "==" was badly designed from the start, and is the source of bezillions of inadvertent programming errors in practice. But if a left arrow, for example, might be a better choice for an assignment operator in a programming language, and a two-character ASCII operator like ":=" or "<-" doesn't seem appropriate or causes other confusion, there still isn't a character *encoding* issue here. Just use "?", which already exists (U+2190), and is a fine left arrow! What is *not* appropriate for Unicode consideration here is trying to encode programming *functions* per se. That turns the problem on its head really. There are lots and lots of symbols already defined in the standard: it is the job of formal language designers to simply pick from them and *define* their formal functions in their language design. Just because the UTC occasionally invents new control functions and encodes them in characters -- as for the bidirectional algorithm -- does not mean that every new function conceived for a programming language is automatically a character encoding problem. Coming to the UTC looking to encode a "new functional character" on spec should be a matter of *last* resort -- not a first resort. It requires a carefully built case demonstrating a real use and showing that alternative approaches using existing characters do not (and cannot) work. --Ken P.S. Arrow symbols like U+2190 have been in the Unicode Standard since Unicode 1.0 in 1991. They are far, far more widely supported nowadays than any new, language-specific functional symbol addition would be. Even if the UTC agreed to such character additions at the next meeting in May, its earliest opportunity for publication would be Unicode 10 in June, 2017. That amounts to a 26 year impedance mismatch for implementations. Why would a designer of a new formal language syntax want to buy into that kind of grief for character availability, when there are hundreds of symbols in the standard to choose from that have been encoded for decades now? On 2/9/2015 8:41 AM, Andre Schappo wrote: > > > Let me take as an example the use of = in programming. The = is used > for test of equality and assignment in various programming languages. > The equality and assignment operations should have different > characters. e.g. > > U+XXX1 TEST FOR EQUALITY > U+XXX2 ASSIGNMENT OPERATOR > > Initially the glyphs used for these characters could be = but then > this mechanism can be used to transition to a new and less ambiguous > visual representation. The new visual representation could be > something like > > U+XXX1 TEST FOR EQUALITY = > U+XXX2 ASSIGNMENT OPERATOR ? > > Such a visual and character distinction between the 2 functions must > surely make it easier for those learning to program and for > interpreter and compiler writers. I think it would also make for > easier to read/understand program code. > > Andr? > -------------- next part -------------- An HTML attachment was scrubbed... URL: From markus.icu at gmail.com Mon Feb 9 13:16:37 2015 From: markus.icu at gmail.com (Markus Scherer) Date: Mon, 9 Feb 2015 11:16:37 -0800 Subject: About cultural/languages communities flags In-Reply-To: References: Message-ID: On Mon, Feb 9, 2015 at 9:54 AM, Andrea Giammarchi < andrea.giammarchi at gmail.com> wrote: > > if a cultural/language TLD is typed with Unicode RIS, then show the flag > for these culture/language: > This does not work. The "Unicode RIS" are defined to be used in pairs, with semantics according to corresponding ISO 3166 alpha2 codes. In your examples, each successive pair will encode a flag. If you want to represent every flag of every locality, you first have to figure out how to catalog and label them. You are mentioning provinces, one level down from nation states; I guess there are thousands of them. In much of Europe, every little village has its own flag and coat of arms. Where do you want the text encoding and fonts to stop? markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From shervinafshar at gmail.com Mon Feb 9 13:23:15 2015 From: shervinafshar at gmail.com (Shervin Afshar) Date: Mon, 9 Feb 2015 11:23:15 -0800 Subject: Unicode block for programming related symbols and codepoints? In-Reply-To: <54D8AE26.3030409@web.de> References: <54D7C3EA.6080000@web.de> <54D8AE26.3030409@web.de> Message-ID: > But then it would be incompatible from IDE to IDE, like Python is incompatible using 2 spaces, 4 spaces and tabs. > It's the data that is important, not the software. Specifically talking about Python, we should not solve what PEP 8[1] is intended for in Unicode. Pythonistas and their IDEs are encouraged to use linters to address syntactical discrepancies. This, more or less, applies to other programming language as well. [1]: https://www.python.org/dev/peps/pep-0008/#tabs-or-spaces > You know, the fact that this consortium ever took emoji into consideration immediately justifies to include everything everyone ever wanted. There is no such thing as important data including emoji. :) If you read the background information (in TR51 or elsewhere) on Unicode emoji, you will see how common and widespread use of PUA by Japanese providers introduced interoperability issues with the rest of the world. And no...Addressing that major compatibility/interoperability issue (and any future issue raised from address that) do not justify inclusion of "everything everyone ever wanted". ? Shervin On Mon, Feb 9, 2015 at 4:55 AM, Alfred Zett wrote: > OK, I will now try to answer all of you in one mail, otherwise it gets > hard to overlook... > > Shervin Afshar: > >> All of the requirements mentioned here can be (and are) implemented in >> higher levels of software (like IDEs). IMO, there isn't any need for adding >> new characters to Unicode to address these issues. >> > But then it would be incompatible from IDE to IDE, like Python is > incompatible using 2 spaces, 4 spaces and tabs. > It's the data that is important, not the software. > >> >> Additionally, people tend to forget that simply because Unicode is doing >> emoji out of compatibility (or other) requirements, it does not mean that >> "now anything goes". I refer folks to TR51[1] (specifically sections 1.3, >> 8, and Annex C). >> >> [1]: http://www.unicode.org/reports/tr51 >> >> You know, the fact that this consortium ever took emoji into > consideration immediately justifies to include everything everyone ever > wanted. There is no such thing as important data including emoji. :) > > Jean-Francois Colson: > >> I need a few tens of characters for a conlang I?m developping. ? >> > Except two or three control characters don't make a con language. > Also, if you don't like con languages in Unicode, what's this: > http://unicode.org/charts/PDF/U1F700.pdf > > The problem is that Unicode only encodes characters which are effectively >> used today or which have been used in the past. It doesn?t encode >> characters which could perhaps be used in a hypothetical new programing >> language in the future. >> > So you want the font encoding scheme to be a limitating factor for new > things? > > Pierpaolo Bernardi: > >> How would your proposed character be displayed as plain text? >> > There is no such thing as plain text. > Even line breaks and tabs are a matter of interpretation. It's just that > they usually have typographic semantics, even in programming editors, with > all the side effects. > > In very simple (and with that I mean shitty or not even remotely > programming oriented) editors, it may show like a control character, like ?. > > Browsers and any editor passing the "based on scintilla" complexity mark > of course should display something that makes more sense, like an arrow or > ? plus surrounding space. > > Unicode is a standard for plain text. If you require a special IDE >> for your programming language then why use plain text at all? >> > Because binary custom encoded databases or blob files are the death of > interoperability. > > Konstantin Ritt: > >> Easier than latin1, a layout one could find on [almost] every keyboard? >> Good luck. >> > Also: > > Jean-Francois Colson: > >> Hard to input? Not harder than the new symbols you?d like to propose. >> That?s only a matter of keyboard layout and input method. >> > > Indent by pressing tab and insert the literal thing by pressing ". Nothing > changes, the IDE/editor does the work on the fly. > Just that you have clean semantics, interoperability and customizability. > > Beat that, APL. Where you would >10 key bindings or an annoying software > keyboard. > > I?ve never used APL so I don?t remember the meanings of its symbols, but >> couldn?t ? U+2358 APL FUNCTIONAL SYMBOL QUOTE UNDERBAR or ? U+235E APL >> FUNCTIONAL SYMBOL QUOTE QUAD work as ?string litteral quotes? in a new >> programming language? >> > That's a good idea. > > That still leaves the indentation character, which is harder than that, > because one would want a control character with certain semantics. > E.G.: For programming editors it would make sense to only allow it after > line breaks and convert other occurences into tabs. > > If the IDE inputs your new character when you press tab, then your new >> character is a tab? >> > Not if it detects the beginning of a line. > > Best regards > > > A. Z. > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Mon Feb 9 13:25:30 2015 From: doug at ewellic.org (Doug Ewell) Date: Mon, 09 Feb 2015 12:25:30 -0700 Subject: Unicode block for programming related symbols and =?UTF-8?Q?codepoints=3F?= Message-ID: <20150209122530.665a7a7059d7ee80bb4d670165c8327d.98030114b2.wbe@email03.secureserver.net> Fr?d?ric Grosshans wrote: > The including of emoji was a considerable debate here, with people > strongly against and strongly for. The trick is that they were already > used as digital characters by Japanese Telcos and their millions of > customers. They were de facto encoded as characters in Japanese text > messages. At the time of encoding, the spread of smartphones made them > appear in other places (emails, web forums, etc.) Sorry, I can't let the "compatibility" argument go unchallenged again. It can be argued ? and was, repeatedly and persuasively ? that the initial collection of emoji in Unicode 6.1 [1] were added for compatibility with Japanese telco extensions to JIS. But the additional emoji added to Unicode 6.2 and 7.0, and planned for 8.0, do not have even this provenance; they were added on foot of novel proposals sent directly to Unicode, or (more recently) by "popular request." There is no longer any requirement that the robot faces and burritos appear first in any sort of industry character set extension, with which Unicode is then obliged to maintain compatibility. [1] No, I am not counting the ARIB symbols or any other long-encoded symbols that have been retroactively defined as emoji, to help legitimize the latter. Alfred Zett The trick is that one doesn't bargain with Telcos and similar > criminals. Gotta drop them hard and the pest will go away from itself > after five years or so. This does not help to make a case for or against encoding of anything. -- Doug Ewell | Thornton, CO, USA | http://ewellic.org From doug at ewellic.org Mon Feb 9 13:28:44 2015 From: doug at ewellic.org (Doug Ewell) Date: Mon, 09 Feb 2015 12:28:44 -0700 Subject: Unicode block for programming related symbols and =?UTF-8?Q?codepoints=3F?= Message-ID: <20150209122844.665a7a7059d7ee80bb4d670165c8327d.241b973136.wbe@email03.secureserver.net> I can't count: > It can be argued ? and was, repeatedly and persuasively ? that > the initial collection of emoji in Unicode 6.1 6.0 > But the additional emoji added to Unicode 6.2 and 7.0 6.1 and 7.0 -- Doug Ewell | Thornton, CO, USA | http://ewellic.org From shervinafshar at gmail.com Mon Feb 9 13:44:54 2015 From: shervinafshar at gmail.com (Shervin Afshar) Date: Mon, 9 Feb 2015 11:44:54 -0800 Subject: Unicode block for programming related symbols and codepoints? In-Reply-To: <20150209122530.665a7a7059d7ee80bb4d670165c8327d.98030114b2.wbe@email03.secureserver.net> References: <20150209122530.665a7a7059d7ee80bb4d670165c8327d.98030114b2.wbe@email03.secureserver.net> Message-ID: > > There is no longer any requirement that the robot faces and > burritos appear first in any sort of industry character set extension, > with which Unicode is then obliged to maintain compatibility. Only if you don't consider existing usage and popular requests as requirement and precedence; for example Gmail had Robot Face for a long time. ? Shervin On Mon, Feb 9, 2015 at 11:25 AM, Doug Ewell wrote: > Fr?d?ric Grosshans wrote: > > > The including of emoji was a considerable debate here, with people > > strongly against and strongly for. The trick is that they were already > > used as digital characters by Japanese Telcos and their millions of > > customers. They were de facto encoded as characters in Japanese text > > messages. At the time of encoding, the spread of smartphones made them > > appear in other places (emails, web forums, etc.) > > Sorry, I can't let the "compatibility" argument go unchallenged again. > > It can be argued ? and was, repeatedly and persuasively ? that the > initial collection of emoji in Unicode 6.1 [1] were added for > compatibility with Japanese telco extensions to JIS. > > But the additional emoji added to Unicode 6.2 and 7.0, and planned for > 8.0, do not have even this provenance; they were added on foot of novel > proposals sent directly to Unicode, or (more recently) by "popular > request." There is no longer any requirement that the robot faces and > burritos appear first in any sort of industry character set extension, > with which Unicode is then obliged to maintain compatibility. > > [1] No, I am not counting the ARIB symbols or any other long-encoded > symbols that have been retroactively defined as emoji, to help > legitimize the latter. > > Alfred Zett > > The trick is that one doesn't bargain with Telcos and similar > > criminals. Gotta drop them hard and the pest will go away from itself > > after five years or so. > > This does not help to make a case for or against encoding of anything. > > -- > Doug Ewell | Thornton, CO, USA | http://ewellic.org > > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > -------------- next part -------------- An HTML attachment was scrubbed... URL: From haberg-1 at telia.com Mon Feb 9 13:48:21 2015 From: haberg-1 at telia.com (Hans Aberg) Date: Mon, 9 Feb 2015 20:48:21 +0100 Subject: Unicode block for programming related symbols and codepoints? In-Reply-To: <54D8F9A5.9070302@att.net> References: <54D7C3EA.6080000@web.de> <9B99B8A6-BB16-4DC0-A193-5D5869274040@lboro.ac.uk> <54D8F9A5.9070302@att.net> Message-ID: <6A89E653-0EC3-4E2A-849B-53F1A54CB5B6@telia.com> > On 9 Feb 2015, at 19:17, Ken Whistler wrote: ... > The use in C of "=" and "==" was badly designed > from the start, and is the source of bezillions of inadvertent programming > errors in practice. It is the ample oversupply of implicit conversions in combination with the lack of a proper boolean type that is causing those programming errors. > But if a left arrow, for example, might be a better choice for an assignment > operator in a programming language, and a two-character ASCII operator > like ":=" or "<-" doesn't seem appropriate or causes other confusion, there > still isn't a character *encoding* issue here. Just use "?", which already exists (U+2190), > and is a fine left arrow! There are also ? COLON EQUALS U+2254 and others. No problems using such characters in Flex: The problem is the lack of input methods. From doug at ewellic.org Mon Feb 9 14:16:58 2015 From: doug at ewellic.org (Doug Ewell) Date: Mon, 09 Feb 2015 13:16:58 -0700 Subject: Emoji (was: Re: Unicode block for programming related symbols and =?UTF-8?Q?codepoints=3F=29?= Message-ID: <20150209131658.665a7a7059d7ee80bb4d670165c8327d.de86ceee88.wbe@email03.secureserver.net> Shervin Afshar wrote: >> There is no longer any requirement that the robot faces and >> burritos appear first in any sort of industry character set >> extension, with which Unicode is then obliged to maintain >> compatibility. > > Only if you don't consider existing usage and popular requests as > requirement and precedence; for example Gmail had Robot Face for a > long time. I said there was no longer a requirement *that the items appear first in an industry character set extension*, right? In what character encoding standard, or extension, does ROBOT FACE appear? "Gmail has it" is not a character encoding standard. Neither is "People want to see it." "Most popularly requested," as a criterion for adding a character, is absolutely new to Unicode. Earlier I wrote privately to a Unicode officer about whether PERSON TAKING SELFIE and GIRL TWERKING and PERSON DUMPING ICE BUCKET OVER HEAD would be ephemeral enough, and got no reply. (What, you've forgotten the ice-bucket craze already? That's exactly why "most popular at the moment" wasn't supposed to be a criterion.) -- Doug Ewell | Thornton, CO, USA | http://ewellic.org From frederic.grosshans at gmail.com Mon Feb 9 14:34:33 2015 From: frederic.grosshans at gmail.com (=?UTF-8?B?RnLDqWTDqXJpYyBHcm9zc2hhbnM=?=) Date: Mon, 9 Feb 2015 21:34:33 +0100 Subject: Unicode block for programming related symbols and codepoints? In-Reply-To: <20150209122530.665a7a7059d7ee80bb4d670165c8327d.98030114b2.wbe@email03.secureserver.net> References: <20150209122530.665a7a7059d7ee80bb4d670165c8327d.98030114b2.wbe@email03.secureserver.net> Message-ID: Le 9 f?vr. 2015 20:27, "Doug Ewell" a ?crit : > > Sorry, I can't let the "compatibility" argument go unchallenged again. > I stand corrected (and I should have known better! ) -------------- next part -------------- An HTML attachment was scrubbed... URL: From everson at evertype.com Mon Feb 9 14:36:33 2015 From: everson at evertype.com (Michael Everson) Date: Mon, 9 Feb 2015 20:36:33 +0000 Subject: Emoji (was: Re: Unicode block for programming related symbols and codepoints?) In-Reply-To: <20150209131658.665a7a7059d7ee80bb4d670165c8327d.de86ceee88.wbe@email03.secureserver.net> References: <20150209131658.665a7a7059d7ee80bb4d670165c8327d.de86ceee88.wbe@email03.secureserver.net> Message-ID: <62EBF72F-6832-4174-946C-234508DE434D@evertype.com> I like symbols a lot. But I know that I and a number of people have been thinking that too much emphasis is being put on emoji. Michael Everson * http://www.evertype.com/ From andrea.giammarchi at gmail.com Mon Feb 9 15:02:54 2015 From: andrea.giammarchi at gmail.com (Andrea Giammarchi) Date: Mon, 9 Feb 2015 22:02:54 +0100 Subject: About cultural/languages communities flags In-Reply-To: References:

Message-ID: Thanks, that was somehow indeed my very first concern. Everyone could claim an emoji, at that point. Enough info for me so far, so thanks again. Best Regards On Mon, Feb 9, 2015 at 8:16 PM, Markus Scherer wrote: > On Mon, Feb 9, 2015 at 9:54 AM, Andrea Giammarchi < > andrea.giammarchi at gmail.com> wrote: > >> > if a cultural/language TLD is typed with Unicode RIS, then show the >> flag for these culture/language: >> > > This does not work. The "Unicode RIS" are defined to be used in pairs, > with semantics according to corresponding ISO 3166 alpha2 codes. In your > examples, each successive pair will encode a flag. > > If you want to represent every flag of every locality, you first have to > figure out how to catalog and label them. You are mentioning provinces, one > level down from nation states; I guess there are thousands of them. In much > of Europe, every little village > has its own flag and coat of arms. Where do you want the text encoding and > fonts to stop? > > markus > -------------- next part -------------- An HTML attachment was scrubbed... URL: From alfred_z at web.de Mon Feb 9 15:04:32 2015 From: alfred_z at web.de (Alfred Zett) Date: Mon, 09 Feb 2015 22:04:32 +0100 Subject: Emoji (was: Re: Unicode block for programming related symbols and codepoints?) In-Reply-To: <20150209131658.665a7a7059d7ee80bb4d670165c8327d.de86ceee88.wbe@email03.secureserver.net> References: <20150209131658.665a7a7059d7ee80bb4d670165c8327d.de86ceee88.wbe@email03.secureserver.net> Message-ID: <54D920E0.8020008@web.de> Doug Ewell: > "Most popularly requested," as a criterion for adding a character, is > absolutely new to Unicode. Earlier I wrote privately to a Unicode > officer about whether PERSON TAKING SELFIE and GIRL TWERKING and > PERSON DUMPING ICE BUCKET OVER HEAD would be ephemeral enough, and got > no reply. (What, you've forgotten the ice-bucket craze already? That's > exactly why "most popular at the moment" wasn't supposed to be a > criterion.) There is much truth in this. I'll now leave the discussion, because it doesn't lead anywhere. Best regards, A. Z. From shervinafshar at gmail.com Mon Feb 9 15:12:52 2015 From: shervinafshar at gmail.com (Shervin Afshar) Date: Mon, 9 Feb 2015 13:12:52 -0800 Subject: Emoji (was: Re: Unicode block for programming related symbols and codepoints?) In-Reply-To: <20150209131658.665a7a7059d7ee80bb4d670165c8327d.de86ceee88.wbe@email03.secureserver.net> References: <20150209131658.665a7a7059d7ee80bb4d670165c8327d.de86ceee88.wbe@email03.secureserver.net> Message-ID: > > I said there was no longer a requirement *that the items appear first in > an industry character set extension*, right? > The issue is with your very rigid interpretation of the criteria for encoding new symbols. Is "appearing in an industry character set extension" an official phrasing that you keep referring to? In what character encoding standard, or extension, does ROBOT FACE > appear? "Gmail has it" is not a character encoding standard. Neither is > "People want to see it." > Robot Face is available on Gmail (GChat), Facebook, and Twitch among others (calculating the size of user community is left as an assignment for the reader). That's enough usage for consideration by the UTC even if the symbol is not present in a character encoding standard. Also, since Unicode is an industry standard maintained by industry members (among others), then if there is enough request to these corporations from communities of users, then there might be some reason for considering those symbols. I think that's the case for the newer symbols. > "Most popularly requested," as a criterion for adding a character, is > absolutely new to Unicode. Earlier I wrote privately to a Unicode > officer about whether PERSON TAKING SELFIE and GIRL TWERKING and PERSON > DUMPING ICE BUCKET OVER HEAD would be ephemeral enough, and got no > reply. (What, you've forgotten the ice-bucket craze already? That's > exactly why "most popular at the moment" wasn't supposed to be a > criterion.) IMO, Unicode officers seems to have low patience for such sentiments. You might want to reconsider your tone. There is a time and place for sarcasm. ? Shervin On Mon, Feb 9, 2015 at 12:16 PM, Doug Ewell wrote: > Shervin Afshar wrote: > > >> There is no longer any requirement that the robot faces and > >> burritos appear first in any sort of industry character set > >> extension, with which Unicode is then obliged to maintain > >> compatibility. > > > > Only if you don't consider existing usage and popular requests as > > requirement and precedence; for example Gmail had Robot Face for a > > long time. > > I said there was no longer a requirement *that the items appear first in > an industry character set extension*, right? > > In what character encoding standard, or extension, does ROBOT FACE > appear? "Gmail has it" is not a character encoding standard. Neither is > "People want to see it." > > "Most popularly requested," as a criterion for adding a character, is > absolutely new to Unicode. Earlier I wrote privately to a Unicode > officer about whether PERSON TAKING SELFIE and GIRL TWERKING and PERSON > DUMPING ICE BUCKET OVER HEAD would be ephemeral enough, and got no > reply. (What, you've forgotten the ice-bucket craze already? That's > exactly why "most popular at the moment" wasn't supposed to be a > criterion.) > > -- > Doug Ewell | Thornton, CO, USA | http://ewellic.org > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From markus.icu at gmail.com Mon Feb 9 15:21:06 2015 From: markus.icu at gmail.com (Markus Scherer) Date: Mon, 9 Feb 2015 13:21:06 -0800 Subject: About cultural/languages communities flags In-Reply-To: References:

Message-ID: On Mon, Feb 9, 2015 at 1:11 PM, Joan Montan? wrote: > AFAIK, this is done in font side. Emoji flags are just ligatures, so a > font can provide a ligature for 4 RIS characters. > Technically true, but a font that violates the encoding standard would cause large problems. Imagine a font that ligates letters 't' and 'h' and displays an Egyptian hieroglyph for the combination. What's the way for encoding them in Unicode standard? > In principle, the way for encoding anything in the Unicode Standard is to write a well-formed proposal, and convince the Unicode Technical Committee and ISO JTC1/SC2 that the proposal has merit. However, I would much prefer if everyone spent their considerable energy on upgrading protocols (e.g., IETF RFCs for email subject lines) and lobby relevant vendors (e.g., chat services & social network messages) to support images embedded in the text stream, ideally with scaling and other behavior that would make them behave somewhat text-like. Best regards, markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From jf at colson.eu Mon Feb 9 15:07:29 2015 From: jf at colson.eu (=?ISO-8859-1?Q?Jean-Fran=E7ois_Colson?=) Date: Mon, 09 Feb 2015 22:07:29 +0100 Subject: Unicode block for programming related symbols and codepoints? Message-ID: <4rhiswo1bpjq22xajg9jw62q.1423515992998@email.android.com> -------- Message d'origine -------- De : Hans Aberg Date :09/02/2015 20:48 (GMT+01:00) A : Ken Whistler Cc : Unicode Mailing List Objet : Re: Unicode block for programming related symbols and codepoints? > On 9 Feb 2015, at 19:17, Ken Whistler wrote: ... > But if a left arrow, for example, might be a better choice for an assignment > operator in a programming language, and a two-character ASCII operator > like ":=" or "<-" doesn't seem appropriate or causes other confusion, there > still isn't a character *encoding* issue here. Just use "?", which already exists (U+2190), > and is a fine left arrow! There are also ? ? COLON EQUALS U+2254 and others. No problems using such characters in Flex: The problem is the lack of input methods. No problem for me: I can input a?? by typing either Alt Gr + 4 (on the numeric keypad) or compose + ?< + - I have no way to type "colon equals" but to type it as compose + : + = I should simply add one single line to my ~/.XCompose file: :?U2254 and restart my text editor. That isn't more difficult than that. (I'm on my phone right now.) -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Mon Feb 9 16:17:36 2015 From: doug at ewellic.org (Doug Ewell) Date: Mon, 09 Feb 2015 15:17:36 -0700 Subject: Emoji (was: Re: Unicode block for programming related symbols and =?UTF-8?Q?codepoints=3F=29?= Message-ID: <20150209151736.665a7a7059d7ee80bb4d670165c8327d.e90f409a66.wbe@email03.secureserver.net> Shervin Afshar wrote: > The issue is with your very rigid interpretation of the criteria for > encoding new symbols. Is "appearing in an industry character set > extension" an official phrasing that you keep referring to? It was either from the WG2 Principles and Procedures document, or some other bit of Unicode/10646 folklore that I've read over the past 22 years of keeping up with Unicode/10646. I should look up the exact wording. Of course, Unicode can encode anything they please. That's not in question. But in order to claim "compatibility" as the basis for encoding something, these specific, "rigid" definitions and criteria have historically been required. "Compatibility" with any random JPEG or meme that makes the rounds on the Internet was not enough. > Robot Face is available on Gmail (GChat), Facebook, and Twitch among > others (calculating the size of user community is left as an > assignment for the reader). That's enough usage for consideration by > the UTC even if the symbol is not present in a character encoding > standard. Also, since Unicode is an industry standard maintained by > industry members (among others), then if there is enough request to > these corporations from communities of users, then there might be some > reason for considering those symbols. I think that's the case for the > newer symbols. Great. Go ahead and encode them, UTC. But don't say it's because your hands are tied and you have no choice. > IMO, Unicode officers seems to have low patience for such sentiments. > You might want to reconsider your tone. There is a time and place for > sarcasm. I'll take my chances. I've been called out before for discouraging list members from requesting things that were out of scope according to the old rules. All I'm saying now is, if the old rules no longer apply, say so. -- Doug Ewell | Thornton, CO, USA | http://ewellic.org From joan at montane.cat Mon Feb 9 15:11:01 2015 From: joan at montane.cat (=?ISO-8859-1?Q?Joan_Montan=E9?=) Date: Mon, 9 Feb 2015 22:11:01 +0100 Subject: About cultural/languages communities flags In-Reply-To: References:

Message-ID: Hi all, I am the one who made the request to tweemoji Github. 2015-02-09 20:16 GMT+01:00 Markus Scherer : > On Mon, Feb 9, 2015 at 9:54 AM, Andrea Giammarchi < > andrea.giammarchi at gmail.com> wrote: > >> > if a cultural/language TLD is typed with Unicode RIS, then show the >> flag for these culture/language: >> > > This does not work. The "Unicode RIS" are defined to be used in pairs, > with semantics according to corresponding ISO 3166 alpha2 codes. In your > examples, each successive pair will encode a flag. > > AFAIK, this is done in font side. Emoji flags are just ligatures, so a font can provide a ligature for 4 RIS characters. This is not an issue here. I agree some strange behaviour can appear if a 3 RIS string, take CAT, is shown in a system with only 2 RIS support (a Canadian will appear followed by a T). If you want to represent every flag of every locality, you first have to > figure out how to catalog and label them. You are mentioning provinces, one > level down from nation states; I guess there are thousands of them. In much > of Europe, every little village > has its own flag and coat of arms. Where do you want the text encoding and > fonts to stop? > > I don't request flag support for every flag in the world. I requested flags for culture/language communities *with* an approved TLD (Top Level Domain). I know flags are an issue, and I know flags represents territories, not languages, but I think some support should be done for these active communities. As I pointed, some country flag collections expand with a fews non-independent country. See [1], [2] and [3] (search for Scottish or Welsh flag). You can check this [4] petition requesting Catalan flag on WhatsApp. So, there is a demand and they are used in real world. What's the way for encoding them in Unicode standard? Thanks, Joan Montan? [1] http://www.famfamfam.com/lab/icons/flags/ [2] https://www.gosquared.com/resources/flag-icons/ [3] http://www.sherv.net/flag-emoticons.html [4] https://www.change.org/p/whatsapp-inc-incloure-la-senyera-de-catalunya-a-whatsapp -------------- next part -------------- An HTML attachment was scrubbed... URL: From joan at montane.cat Mon Feb 9 15:18:23 2015 From: joan at montane.cat (=?ISO-8859-1?Q?Joan_Montan=E9?=) Date: Mon, 9 Feb 2015 22:18:23 +0100 Subject: About cultural/languages communities flags In-Reply-To: References:

Message-ID: Sorry, my reply was sended CC: to Unicode ML, My apologies, Joan Montan? 2015-02-09 22:11 GMT+01:00 Joan Montan? : > > Hi all, > > I am the one who made the request to tweemoji Github. > > > 2015-02-09 20:16 GMT+01:00 Markus Scherer : > >> On Mon, Feb 9, 2015 at 9:54 AM, Andrea Giammarchi < >> andrea.giammarchi at gmail.com> wrote: >> >>> > if a cultural/language TLD is typed with Unicode RIS, then show the >>> flag for these culture/language: >>> >> >> This does not work. The "Unicode RIS" are defined to be used in pairs, >> with semantics according to corresponding ISO 3166 alpha2 codes. In your >> examples, each successive pair will encode a flag. >> >> > AFAIK, this is done in font side. Emoji flags are just ligatures, so a > font can provide a ligature for 4 RIS characters. This is not an issue here. > > I agree some strange behaviour can appear if a 3 RIS string, take CAT, is > shown in a system with only 2 RIS support (a Canadian will appear followed > by a T). > > > If you want to represent every flag of every locality, you first have to >> figure out how to catalog and label them. You are mentioning provinces, one >> level down from nation states; I guess there are thousands of them. In much >> of Europe, every little village >> has its own flag and coat of >> arms. Where do you want the text encoding and fonts to stop? >> >> > I don't request flag support for every flag in the world. I requested > flags for culture/language communities *with* an approved TLD (Top Level > Domain). > > I know flags are an issue, and I know flags represents territories, not > languages, but I think some support should be done for these active > communities. As I pointed, some country flag collections expand with a fews > non-independent country. See [1], [2] and [3] (search for Scottish or > Welsh flag). You can check this [4] petition requesting Catalan flag on > WhatsApp. > > So, there is a demand and they are used in real world. What's the way for > encoding them in Unicode standard? > > Thanks, > > Joan Montan? > > [1] http://www.famfamfam.com/lab/icons/flags/ > [2] https://www.gosquared.com/resources/flag-icons/ > [3] http://www.sherv.net/flag-emoticons.html > [4] > https://www.change.org/p/whatsapp-inc-incloure-la-senyera-de-catalunya-a-whatsapp > -------------- next part -------------- An HTML attachment was scrubbed... URL: From haberg-1 at telia.com Mon Feb 9 16:33:49 2015 From: haberg-1 at telia.com (Hans Aberg) Date: Mon, 9 Feb 2015 23:33:49 +0100 Subject: Unicode block for programming related symbols and codepoints? In-Reply-To: <4rhiswo1bpjq22xajg9jw62q.1423515992998@email.android.com> References: <4rhiswo1bpjq22xajg9jw62q.1423515992998@email.android.com> Message-ID: <6E316B18-AB55-43B8-9441-E6E06173A31A@telia.com> > On 9 Feb 2015, at 22:07, Jean-Fran?ois Colson wrote: >> > But if a left arrow, for example, might be a better choice for an assignment >> > operator in a programming language, and a two-character ASCII operator >> > like ":=" or "<-" doesn't seem appropriate or causes other confusion, there >> > still isn't a character *encoding* issue here. Just use "?", which already exists (U+2190), >> > and is a fine left arrow! >> >> There are also >> ? COLON EQUALS U+2254 >> and others. >> >> No problems using such characters in Flex: >> >> The problem is the lack of input methods. > > No problem for me: I can input a ? by typing either Alt Gr + 4 (on the numeric keypad) or compose + < + - > I have no way to type "colon equals" but to type it as compose + : + = I should simply add one single line to my ~/.XCompose file: > : U2254 > and restart my text editor. That isn't more difficult than that. The problem is that there are a lot of characters and rather time consuming to design ones own input methods. From doug at ewellic.org Mon Feb 9 16:38:42 2015 From: doug at ewellic.org (Doug Ewell) Date: Mon, 09 Feb 2015 15:38:42 -0700 Subject: About cultural/languages communities flags Message-ID: <20150209153842.665a7a7059d7ee80bb4d670165c8327d.766f5788f4.wbe@email03.secureserver.net> Joan Montan? wrote: > I don't request flag support for every flag in the world. I requested > flags for culture/language communities *with* an approved TLD (Top > Level Domain). Incidentally, about a year and a half ago I discussed this with another list member, on- and off-list. We agreed that some sort of text-based encoding of flags would be an interesting project, but disagreed as to whether this was a Unicode problem. The present discussion seems to approach the issue from the other side: treat it as *only* a Unicode problem, and assume that the encoding problem has been solved by TLD registration. See also http://www.unicode.org/faq/emoji_dingbats.html#12 . This is the Unicode Consortium talking, not me. -- Doug Ewell | Thornton, CO, USA | http://ewellic.org From shervinafshar at gmail.com Mon Feb 9 17:04:30 2015 From: shervinafshar at gmail.com (Shervin Afshar) Date: Mon, 9 Feb 2015 15:04:30 -0800 Subject: Emoji (was: Re: Unicode block for programming related symbols and codepoints?) In-Reply-To: <20150209151736.665a7a7059d7ee80bb4d670165c8327d.e90f409a66.wbe@email03.secureserver.net> References: <20150209151736.665a7a7059d7ee80bb4d670165c8327d.e90f409a66.wbe@email03.secureserver.net> Message-ID: > It was either from the WG2 Principles and Procedures document, or some > other bit of Unicode/10646 folklore that I've read over the past 22 >years of keeping up with Unicode/10646. I should look up the exact > wording. Yes, please. I would like to have that policy noted for my future use. > Of course, Unicode can encode anything they please. That's not in > question. But in order to claim "compatibility" as the basis for > encoding something, these specific, "rigid" definitions and criteria > have historically been required. "Compatibility" with any random JPEG or > meme that makes the rounds on the Internet was not enough. It's not about encoding what "they" please. Compatibility was the issue with the first set of emoji symbols. The rest of symbols are being added for various other reasons; e.g. diversity, parity, requests, etc. Also, random JPEG and meme don't apply here and you're mistaken to assume that GChat and Facebook fit in this category. > Great. Go ahead and encode them, UTC. But don't say it's because your > hands are tied and you have no choice. Quoting an official UTC communication? > I'll take my chances. I've been called out before for discouraging list > members from requesting things that were out of scope according to the > old rules. All I'm saying now is, if the old rules no longer apply, say > so. AFAIK, rules haven't changed. Unicode didn't have a policy regarding emoji and symbols with similar usage. Now it does. For a longer while now, some folks tend to use emoji as means to an end other than what is in the scope of conversation regarding emoji. And that is not acceptable. ? Shervin On Mon, Feb 9, 2015 at 2:17 PM, Doug Ewell wrote: > Shervin Afshar wrote: > > > The issue is with your very rigid interpretation of the criteria for > > encoding new symbols. Is "appearing in an industry character set > > extension" an official phrasing that you keep referring to? > > It was either from the WG2 Principles and Procedures document, or some > other bit of Unicode/10646 folklore that I've read over the past 22 > years of keeping up with Unicode/10646. I should look up the exact > wording. > > Of course, Unicode can encode anything they please. That's not in > question. But in order to claim "compatibility" as the basis for > encoding something, these specific, "rigid" definitions and criteria > have historically been required. "Compatibility" with any random JPEG or > meme that makes the rounds on the Internet was not enough. > > > Robot Face is available on Gmail (GChat), Facebook, and Twitch among > > others (calculating the size of user community is left as an > > assignment for the reader). That's enough usage for consideration by > > the UTC even if the symbol is not present in a character encoding > > standard. Also, since Unicode is an industry standard maintained by > > industry members (among others), then if there is enough request to > > these corporations from communities of users, then there might be some > > reason for considering those symbols. I think that's the case for the > > newer symbols. > > Great. Go ahead and encode them, UTC. But don't say it's because your > hands are tied and you have no choice. > > > IMO, Unicode officers seems to have low patience for such sentiments. > > You might want to reconsider your tone. There is a time and place for > > sarcasm. > > I'll take my chances. I've been called out before for discouraging list > members from requesting things that were out of scope according to the > old rules. All I'm saying now is, if the old rules no longer apply, say > so. > > -- > Doug Ewell | Thornton, CO, USA | http://ewellic.org > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From kenwhistler at att.net Mon Feb 9 17:11:39 2015 From: kenwhistler at att.net (Ken Whistler) Date: Mon, 09 Feb 2015 15:11:39 -0800 Subject: About cultural/languages communities flags In-Reply-To: <20150209153842.665a7a7059d7ee80bb4d670165c8327d.766f5788f4.wbe@email03.secureserver.net> References: <20150209153842.665a7a7059d7ee80bb4d670165c8327d.766f5788f4.wbe@email03.secureserver.net> Message-ID: <54D93EAB.7010409@att.net> To follow up on Doug Ewell's response, the mechanism currently standardized in the Unicode Standard for "regional indicator codes" has an interpretation tied to the two-letter codes of ISO 3166-1, and *not* to TLD's. The two are not directly connected. If anyone really wants to pursue getting a Scots flag into general implementation via Unicode regional indicator codes, the correct way to make that happen is for somebody to get off their duff and convince the BSI (British Standards Institute) to put in for an exceptional reservation of a two-letter code for Scotland in ISO 3166-1 by petitioning the ISO 3166/MA. See: http://en.wikipedia.org/wiki/ISO_3166-1_alpha-2 for the full context, and for the current 26x26 letter matrix which is the basis for the flag glyph implementations of regional indicator code pairs on smartphones. SC, SO, ST are already taken, but might I suggest putting in for registering "AB" for Alba? That one is currently unassigned. Yeah, yeah, what is the likelihood of BSI pushing for a Scots two-letter code?! But seriously, if folks are planning ahead for Scots independence or even some kind of greater autonomy, this is an issue that needs to be worked, anyway. In the meantime, let me reiterate that there is *no* formal relationship between TLD's and the regional indicator codes in Unicode (or the implementations built upon them). Well, yes, a bunch of registered TLD's do match the country codes, but there is no two-letter constraint on TLD's. This should already be apparent, as Scotland has registered ".scot" At this point there isn't even a limitation of TLD's to ASCII letters, so there is no way to map them to the limited set of regional indicator codes in the Unicode Standard. Not having a two letter country code for Scotland that matches the four letter TLD for Scotland might indeed be a problem for someone, but I don't see *this* as a problem that the Unicode Standard needs to solve. --Ken On 2/9/2015 2:38 PM, Doug Ewell wrote: > Joan Montan? wrote: > >> I don't request flag support for every flag in the world. I requested >> flags for culture/language communities *with* an approved TLD (Top >> Level Domain). > Incidentally, about a year and a half ago I discussed this with another > list member, on- and off-list. We agreed that some sort of text-based > encoding of flags would be an interesting project, but disagreed as to > whether this was a Unicode problem. > > The present discussion seems to approach the issue from the other side: > treat it as *only* a Unicode problem, and assume that the encoding > problem has been solved by TLD registration. > > From doug at ewellic.org Mon Feb 9 17:53:44 2015 From: doug at ewellic.org (Doug Ewell) Date: Mon, 09 Feb 2015 16:53:44 -0700 Subject: About cultural/languages communities flags Message-ID: <20150209165344.665a7a7059d7ee80bb4d670165c8327d.8fe2797a38.wbe@email03.secureserver.net> And just another follow-up, to try to explain *why* the mechanism for Regional Indicator Codes might be so closely tied to ISO 3166-1 alpha-2 code elements: ISO 3166-1 codes are derived from code elements published by the United Nations Statistics Division. This is the group that ultimately decides "what is and isn't a country" for the purposes of these codes. While there is inevitably some political influence in the UN, many organizations and projects that use ISO 3166-1 codes do so to avoid getting embroiled in their own debate over "what is a country." The IETF language-tagging project (BCP 47, RFC 5646; see "IETF language tag" in Wikipedia for more information) is one example. Conversely, it is sometimes the case that groups which seek to extend the set of ISO 3166-1 codes unilaterally, or to establish a competing or supplemental coding system, might do so in order to gain acceptance or establish credibility for a nation or territory that is not recognized as such by UNSD. It is entirely reasonable (IMHO) to suggest that if Unicode were to attempt, by whatever means, to enable encoding of flags for entities beyond those encoded in ISO 3166-1, that the door would be opened wide for unrecognized nations and separatist groups to claim that the Unicode Consortium "supports" their cause by supporting display of their flag. It's very possible that Unicode has thought of this and does not want to put itself in that position. -- Doug Ewell | Thornton, CO, USA | http://ewellic.org From chris.fynn at gmail.com Mon Feb 9 22:37:01 2015 From: chris.fynn at gmail.com (Christopher Fynn) Date: Tue, 10 Feb 2015 10:37:01 +0600 Subject: About cultural/languages communities flags In-Reply-To: References: Message-ID: Using flags to indicate particular languages on websites has plenty of problems - languages need a better indicator. Scripts could be indicated by a representative glyph. From mark at macchiato.com Tue Feb 10 00:10:56 2015 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Tue, 10 Feb 2015 07:10:56 +0100 Subject: About cultural/languages communities flags In-Reply-To: <54D93EAB.7010409@att.net> References: <20150209153842.665a7a7059d7ee80bb4d670165c8327d.766f5788f4.wbe@email03.secureserver.net> <54D93EAB.7010409@att.net> Message-ID: On Tue, Feb 10, 2015 at 12:11 AM, Ken Whistler wrote: > for the full context, and for the current 26x26 letter matrix which is > the basis for the flag glyph implementations of regional indicator > code pairs on smartphones. > > SC, SO, ST are already taken, but might I suggest putting in for > registering > "AB" for Alba? That one is currently unassigned. > > Yeah, yeah, what is the likelihood of BSI pushing for a Scots two-letter > code?! But seriously, if folks are planning ahead for Scots independence > or even some kind of greater autonomy, this is an issue that needs to > be worked, anyway. > > In the meantime, let me reiterate that there is *no* formal relationship > between TLD's and the regional indicator codes in Unicode (or the > implementations > built upon them). Well, yes, a bunch of registered TLD's do match the > country > codes, but there is no two-letter constraint on TLD's. This should already > be apparent, as Scotland has registered ".scot" At this point there isn't > even > a limitation of TLD's to ASCII letters, so there is no way to map them > to the limited set of regional indicator codes in the Unicode Standard. > > Not having a two letter country code for Scotland that matches the > four letter TLD for Scotland might indeed be a problem for someone, > but I don't see *this* as a problem that the Unicode Standard needs > to solve. > ?I want to add to that that there are already a fair number of ISO 2-letter codes for regions that are administered as part of another country, like Hong Kong. There are also codes for crown possessions like Guernsey. So having a code for Scotland (and Wales, and N. Ireland) do not really break precedent. But as Ken says, the best mechanism is for the UK to push for a code in ISO and the UN. Mark *? Il meglio ? l?inimico del bene ?* -------------- next part -------------- An HTML attachment was scrubbed... URL: From joan at montane.cat Tue Feb 10 01:32:17 2015 From: joan at montane.cat (=?ISO-8859-1?Q?Joan_Montan=E9?=) Date: Tue, 10 Feb 2015 08:32:17 +0100 Subject: About cultural/languages communities flags In-Reply-To: References: <20150209153842.665a7a7059d7ee80bb4d670165c8327d.766f5788f4.wbe@email03.secureserver.net> <54D93EAB.7010409@att.net> Message-ID: Thanks for your replies, As far as I see, my informal request for expanding current RIS design hasn't a good response. I understand it. Flags are cause of disputes, and it isn't an issue for Unicode encode them. IMHO keept tied to 2-alpha codes is a poor choice for users. May be industry manufactures could find a better approach. Best regards, Joan Montan? -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Tue Feb 10 10:16:14 2015 From: doug at ewellic.org (Doug Ewell) Date: Tue, 10 Feb 2015 09:16:14 -0700 Subject: About cultural/languages communities flags Message-ID: <20150210091614.665a7a7059d7ee80bb4d670165c8327d.a342055f13.wbe@email03.secureserver.net> Joan Montan? wrote: > As far as I see, my informal request for expanding current RIS design > hasn't a good response. I understand it. Flags are cause of disputes, > and it isn't an issue for Unicode encode them. There are technical limitations as well. Because the mechanism is already defined on pairs of symbols, it's not trivial to expand it to three or more symbols. Earlier, you had written: > I agree some strange behaviour can appear if a 3 RIS string, take CAT, > is shown in a system with only 2 RIS support (a Canadian will appear > followed by a T). but in fact, every one of the combinations in the original post will generate incorrect output (if any): > [S][C][O][T] --> it shows Scottish flag Seychelles, "undefined" > [C][Y][M][R][U] --> it shows a Welsh flag Cyprus, Mauritania, unpaired symbol > [B][Z][H] --> it shows a Breton flag Belize, unpaired symbol > [C][A][T] --> it shows Catalan flag Canada, unpaired symbol > [E][U][S] --> it shows a Basque flag "Undefined" (or European Union if the implementation happens to include an extension to ISO 3166 exceptionally reserved code elements), unpaired symbol > [G][A][L] --> it shows a Gallician flag Gabon, unpaired symbol In order to make a system like this work with an arbitrary number of symbols, a terminating symbol would have to be defined. Finding the longest match between a string of symbols and a TLD wouldn't work; someone might really want to encode "Brazil, United States, Sweden, Lesotho" consecutively, and would not want this converted to "Brussels." And as Ken pointed out, TLDs are TLDs; they are not a general-purpose geographic coding system. They don't include every sub-national region or separatist group, only the ones that Donuts and similar companies chose to register. There's no TLD for Abkhazia, for example, or for ISIS. > IMHO keept tied to 2-alpha codes is a poor choice for users. May be > industry manufactures could find a better approach. Let's hope that industry manufacturers adhere to the standard instead of going off on their own. I thought that was the idea when all these cell-phone symbols were added to Unicode in the first place. -- Doug Ewell | Thornton, CO, USA | http://ewellic.org From mark at macchiato.com Tue Feb 10 10:45:24 2015 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Tue, 10 Feb 2015 17:45:24 +0100 Subject: Emoji (was: Re: Unicode block for programming related symbols and codepoints?) In-Reply-To: <62EBF72F-6832-4174-946C-234508DE434D@evertype.com> References: <20150209131658.665a7a7059d7ee80bb4d670165c8327d.de86ceee88.wbe@email03.secureserver.net> <62EBF72F-6832-4174-946C-234508DE434D@evertype.com> Message-ID: We are being pretty conservative about what we add. There are approximately 1,200 emoji characters now (see tr51), and we're anticipating adding perhaps 50 per release. And we are encouraging a "sticker" approach for the longer term. On the other hand, I wouldn't be surprised if the 41 emoji characters that we are planning on for Unicode 8.0 end up having a higher frequency of use than the other 7K characters in the release. Mark *? Il meglio ? l?inimico del bene ?* On Mon, Feb 9, 2015 at 9:36 PM, Michael Everson wrote: > I like symbols a lot. But I know that I and a number of people have been > thinking that too much emphasis is being put on emoji. > > Michael Everson * http://www.evertype.com/ > > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mark at macchiato.com Tue Feb 10 10:48:34 2015 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Tue, 10 Feb 2015 17:48:34 +0100 Subject: Emoji (was: Re: Unicode block for programming related symbols and codepoints?) In-Reply-To: <20150209131658.665a7a7059d7ee80bb4d670165c8327d.de86ceee88.wbe@email03.secureserver.net> References: <20150209131658.665a7a7059d7ee80bb4d670165c8327d.de86ceee88.wbe@email03.secureserver.net> Message-ID: > In what character encoding standard, or extension, does ROBOT FACE appear? Unicode has never been limited to what is in other character encoding standard or extensions, "official" or de facto. Mark *? Il meglio ? l?inimico del bene ?* On Mon, Feb 9, 2015 at 9:16 PM, Doug Ewell wrote: > Shervin Afshar wrote: > > >> There is no longer any requirement that the robot faces and > >> burritos appear first in any sort of industry character set > >> extension, with which Unicode is then obliged to maintain > >> compatibility. > > > > Only if you don't consider existing usage and popular requests as > > requirement and precedence; for example Gmail had Robot Face for a > > long time. > > I said there was no longer a requirement *that the items appear first in > an industry character set extension*, right? > > In what character encoding standard, or extension, does ROBOT FACE > appear? "Gmail has it" is not a character encoding standard. Neither is > "People want to see it." > > "Most popularly requested," as a criterion for adding a character, is > absolutely new to Unicode. Earlier I wrote privately to a Unicode > officer about whether PERSON TAKING SELFIE and GIRL TWERKING and PERSON > DUMPING ICE BUCKET OVER HEAD would be ephemeral enough, and got no > reply. (What, you've forgotten the ice-bucket craze already? That's > exactly why "most popular at the moment" wasn't supposed to be a > criterion.) > > -- > Doug Ewell | Thornton, CO, USA | http://ewellic.org > > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Tue Feb 10 11:00:17 2015 From: doug at ewellic.org (Doug Ewell) Date: Tue, 10 Feb 2015 10:00:17 -0700 Subject: Emoji (was: Re: Unicode block for programming related symbols and =?UTF-8?Q?codepoints=3F=29?= Message-ID: <20150210100017.665a7a7059d7ee80bb4d670165c8327d.729df8c844.wbe@email03.secureserver.net> Shervin Afshar wrote: >>> The issue is with your very rigid interpretation of the criteria for >>> encoding new symbols. Is "appearing in an industry character set >>> extension" an official phrasing that you keep referring to? >> >> It was either from the WG2 Principles and Procedures document, or >> some other bit of Unicode/10646 folklore that I've read over the past >> 22 years of keeping up with Unicode/10646. I should look up the exact >> wording. > > Yes, please. I would like to have that policy noted for my future use. I hadn't said, of course, that no new symbols could ever be encoded unless they appeared in an industry character set or extension. I was responding to a point that Fr?d?ric Grosshans made [1] about these symbols being added for compatibility with Japanese telco usage. That argument could be used for the original emoji set, but not for new emoji; those are supposed to follow the regular criteria. [1] http://unicode.org/pipermail/unicode/2015-February/001246.html Here is a passage from TUS 7.0, Section 2.3 that may shed light: "Conceptually, compatibility characters are characters that would not have been encoded in the Unicode Standard except for compatibility and round-trip convertibility with other standards. Such standards include international, national, and vendor character encoding standards. For the most part, these are widely used standards that pre-dated Unicode, but because continued interoperability with new standards and data sources is one of the primary design goals of the Unicode Standard, additional compatibility characters are added as the situation warrants. "Compatibility characters can be contrasted with ordinary (or non-compatibility) characters in the standard?ones that are generally consistent with the Unicode text model and which would have been accepted for encoding to represent various scripts and sets of symbols, regardless of whether those characters also existed in other character encoding standards." > It's not about encoding what "they" please. Compatibility was the > issue with the first set of emoji symbols. The rest of symbols are > being added for various other reasons; e.g. diversity, parity, > requests, etc. Right. So the "compatibility with Japanese telcos" argument cannot be used here. > Also, random JPEG and meme don't apply here and you're mistaken to > assume that GChat and Facebook fit in this category. If you look at the set of new emoji proposed in L2/15-054 [2], you'll see that quite a few of them are justified by their current popularity on the Web. ("Selfie are very popular" was kind of striking. I guess at least one of my predictions was right.) [2] http://www.unicode.org/L2/L2015/15054r-emoji-tranche5.pdf >> Great. Go ahead and encode them, UTC. But don't say it's because your >> hands are tied and you have no choice. > > Quoting an official UTC communication? Quoting an off-list remark. > For a longer while now, some folks tend to use emoji as means to an > end other than what is in the scope of conversation regarding emoji. > And that is not acceptable. Sorry, I don't understand this. -- Doug Ewell | Thornton, CO, USA | http://ewellic.org From doug at ewellic.org Tue Feb 10 11:03:06 2015 From: doug at ewellic.org (Doug Ewell) Date: Tue, 10 Feb 2015 10:03:06 -0700 Subject: Emoji (was: Re: Unicode block for programming related symbols and =?UTF-8?Q?codepoints=3F=29?= Message-ID: <20150210100306.665a7a7059d7ee80bb4d670165c8327d.cd03b4ab63.wbe@email03.secureserver.net> Mark Davis ?? wrote: >> In what character encoding standard, or extension, does ROBOT FACE >> appear? > > Unicode has never been limited to what is in other character encoding > standard or extensions, "official" or de facto. Of course not. But that's been a stated condition for labeling something as "compatibility." -- Doug Ewell | Thornton, CO, USA | http://ewellic.org From shervinafshar at gmail.com Tue Feb 10 12:07:17 2015 From: shervinafshar at gmail.com (Shervin Afshar) Date: Tue, 10 Feb 2015 10:07:17 -0800 Subject: Emoji (was: Re: Unicode block for programming related symbols and codepoints?) In-Reply-To: <20150210100306.665a7a7059d7ee80bb4d670165c8327d.cd03b4ab63.wbe@email03.secureserver.net> References: <20150210100306.665a7a7059d7ee80bb4d670165c8327d.cd03b4ab63.wbe@email03.secureserver.net> Message-ID: > Of course not. But that's been a stated condition for labeling something > as "compatibility." It *is* compatibility; go back and read my email where I mentioned exactly where it was used. ? Shervin On Tue, Feb 10, 2015 at 9:03 AM, Doug Ewell wrote: > Mark Davis ?? wrote: > > >> In what character encoding standard, or extension, does ROBOT FACE > >> appear? > > > > Unicode has never been limited to what is in other character encoding > > standard or extensions, "official" or de facto. > > Of course not. But that's been a stated condition for labeling something > as "compatibility." > > -- > Doug Ewell | Thornton, CO, USA | http://ewellic.org > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Tue Feb 10 12:27:32 2015 From: doug at ewellic.org (Doug Ewell) Date: Tue, 10 Feb 2015 11:27:32 -0700 Subject: Emoji (was: Re: Unicode block for programming related symbols and =?UTF-8?Q?codepoints=3F=29?= Message-ID: <20150210112732.665a7a7059d7ee80bb4d670165c8327d.7b9a049cd6.wbe@email03.secureserver.net> Shervin Afshar wrote: >> Of course not. But that's been a stated condition for labeling >> something as "compatibility." > > It *is* compatibility; go back and read my email where I mentioned > exactly where it was used. You mean the one where you said that Gmail has had ROBOT FACE for a long time? You mean to say that any time Gmail or someone adds a private-use character or embeddable graphic for TOILET PAPER or TIRE IRON or BEER KEG, that Unicode is essentially obliged to add an emoji to maintain compatibility with it? Well, perhaps that's how it is now. But that isn't the way Unicode used to be. Fuddily-duddily, -- Doug Ewell | Thornton, CO, USA | http://ewellic.org From shervinafshar at gmail.com Tue Feb 10 12:29:43 2015 From: shervinafshar at gmail.com (Shervin Afshar) Date: Tue, 10 Feb 2015 10:29:43 -0800 Subject: Emoji (was: Re: Unicode block for programming related symbols and codepoints?) In-Reply-To: References: <20150210100306.665a7a7059d7ee80bb4d670165c8327d.cd03b4ab63.wbe@email03.secureserver.net> Message-ID: > > I was responding to a point that Fr?d?ric Grosshans made [1] about > these symbols being added for compatibility with Japanese telco usage. > That argument could be used for the original emoji set, but not for new > emoji; those are supposed to follow the regular criteria. The compatibility argument can also be applied to major vendors who are using emoji other than Japanese vendors; you can find a list of 20-30 of them here[3]. Add to that list, Facebook and Google. If it is commonly in use, it has a precedence to be proposed for addition to Unicode. To have an informing, objective conversation, people should first look at the actual criteria[4] (as well as the criteria for encoding symbols[5]) and see if what they are claiming is actually according to the criteria or not. [3]: http://www.emoji-cheat-sheet.com/ [4]: http://www.unicode.org/reports/tr51/#Selection_Factors [5]: http://unicode.org/pending/symbol-guidelines.html > If you look at the set of new emoji proposed in L2/15-054 [2], you'll > see that quite a few of them are justified by their current popularity > on the Web. ("Selfie are very popular" was kind of striking. I guess at > least one of my predictions was right.) > [2] http://www.unicode.org/L2/L2015/15054r-emoji-tranche5.pdf > First of all, these are just proposed and not accepted. Secondly, requests by online communities (either directly to UTC or through corp members) creates a precedence for UTC to consider the symbol for encoding. > > For a longer while now, some folks tend to use emoji as means to an > > end other than what is in the scope of conversation regarding emoji. > > And that is not acceptable. > Sorry, I don't understand this. No worries. I don't blame you. It's just the good ol' circular logic. ? Shervin On Tue, Feb 10, 2015 at 10:07 AM, Shervin Afshar wrote: > > Of course not. But that's been a stated condition for labeling something > > as "compatibility." > > It *is* compatibility; go back and read my email where I mentioned exactly > where it was used. > > > ? Shervin > > On Tue, Feb 10, 2015 at 9:03 AM, Doug Ewell wrote: > >> Mark Davis [image: ?]? wrote: >> >> >> In what character encoding standard, or extension, does ROBOT FACE >> >> appear? >> > >> > Unicode has never been limited to what is in other character encoding >> > standard or extensions, "official" or de facto. >> >> Of course not. But that's been a stated condition for labeling something >> as "compatibility." >> >> -- >> Doug Ewell | Thornton, CO, USA | http://ewellic.org >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: emoji_u2615.png Type: image/png Size: 1890 bytes Desc: not available URL: From chris.fynn at gmail.com Tue Feb 10 12:41:23 2015 From: chris.fynn at gmail.com (Christopher Fynn) Date: Wed, 11 Feb 2015 00:41:23 +0600 Subject: About cultural/languages communities flags In-Reply-To: References: <20150209153842.665a7a7059d7ee80bb4d670165c8327d.766f5788f4.wbe@email03.secureserver.net> <54D93EAB.7010409@att.net> Message-ID: One area where this would be useful is for indicating national teams in football (soccer), rugby and other sports where England, Scotland, Wales and N. Ireland play separately internationally. On 10 February 2015 at 12:10, Mark Davis ?? wrote: > > On Tue, Feb 10, 2015 at 12:11 AM, Ken Whistler wrote: >> >> for the full context, and for the current 26x26 letter matrix which is >> the basis for the flag glyph implementations of regional indicator >> code pairs on smartphones. >> >> SC, SO, ST are already taken, but might I suggest putting in for >> registering >> "AB" for Alba? That one is currently unassigned. >> >> Yeah, yeah, what is the likelihood of BSI pushing for a Scots two-letter >> code?! But seriously, if folks are planning ahead for Scots independence >> or even some kind of greater autonomy, this is an issue that needs to >> be worked, anyway. >> >> In the meantime, let me reiterate that there is *no* formal relationship >> between TLD's and the regional indicator codes in Unicode (or the >> implementations >> built upon them). Well, yes, a bunch of registered TLD's do match the >> country >> codes, but there is no two-letter constraint on TLD's. This should already >> be apparent, as Scotland has registered ".scot" At this point there isn't >> even >> a limitation of TLD's to ASCII letters, so there is no way to map them >> to the limited set of regional indicator codes in the Unicode Standard. >> >> Not having a two letter country code for Scotland that matches the >> four letter TLD for Scotland might indeed be a problem for someone, >> but I don't see *this* as a problem that the Unicode Standard needs >> to solve. > > > I want to add to that that there are already a fair number of ISO 2-letter > codes for regions that are administered as part of another country, like > Hong Kong. There are also codes for crown possessions like Guernsey. So > having a code for Scotland (and Wales, and N. Ireland) do not really break > precedent. But as Ken says, the best mechanism is for the UK to push for a > code in ISO and the UN. > > Mark > > ? Il meglio ? l?inimico del bene ? > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > From shervinafshar at gmail.com Tue Feb 10 12:48:20 2015 From: shervinafshar at gmail.com (Shervin Afshar) Date: Tue, 10 Feb 2015 10:48:20 -0800 Subject: Emoji (was: Re: Unicode block for programming related symbols and codepoints?) In-Reply-To: <20150210112732.665a7a7059d7ee80bb4d670165c8327d.7b9a049cd6.wbe@email03.secureserver.net> References: <20150210112732.665a7a7059d7ee80bb4d670165c8327d.7b9a049cd6.wbe@email03.secureserver.net> Message-ID: This thread turns more and more absurd by the email! I apologize to people on the list who have to tolerate this; it might be noisy and annoying, but it is important. Doug Ewell asked: You mean the one where you said that Gmail has had ROBOT FACE for a long > time? Let me use copy-paste for your convenience: Robot Face is available on Gmail (GChat), Facebook, and Twitch among others > (calculating the size of user community is left as an assignment for the > reader). That's enough usage for consideration by the UTC even if the > symbol is not present in a character encoding standard. and then, Doug Ewell wondered: You mean to say that any time Gmail or someone adds a private-use > character or embeddable graphic for TOILET PAPER or TIRE IRON or BEER > KEG, that Unicode is essentially obliged to add an emoji to maintain > compatibility with it? > Yes, but the industry is already moving away from character-based solutions and towards sticker-based solutions as we speak. Right now, Facebook is moving in this direction, as well as Line, Trello, and many others. But things which were added beforehand have precedence to be proposed to Unicode. > Well, perhaps that's how it is now. But that isn't the way Unicode used > to be. Well...Since you seem to be so keen on Internet memes, here's one[6] for you. [6]: http://www.quickmeme.com/img/2a/2ab86791fe23ec5c73dc6d46c2cc5bef14e5ca47ba9208571b79c078fb2af561.jpg ? Shervin On Tue, Feb 10, 2015 at 10:27 AM, Doug Ewell wrote: > Shervin Afshar wrote: > > >> Of course not. But that's been a stated condition for labeling > >> something as "compatibility." > > > > It *is* compatibility; go back and read my email where I mentioned > > exactly where it was used. > > You mean the one where you said that Gmail has had ROBOT FACE for a long > time? > > You mean to say that any time Gmail or someone adds a private-use > character or embeddable graphic for TOILET PAPER or TIRE IRON or BEER > KEG, that Unicode is essentially obliged to add an emoji to maintain > compatibility with it? > > Well, perhaps that's how it is now. But that isn't the way Unicode used > to be. > > Fuddily-duddily, > > -- > Doug Ewell | Thornton, CO, USA | http://ewellic.org > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From joan at montane.cat Tue Feb 10 14:28:19 2015 From: joan at montane.cat (=?ISO-8859-1?Q?Joan_Montan=E9?=) Date: Tue, 10 Feb 2015 21:28:19 +0100 Subject: About cultural/languages communities flags In-Reply-To: <20150210091614.665a7a7059d7ee80bb4d670165c8327d.a342055f13.wbe@email03.secureserver.net> References: <20150210091614.665a7a7059d7ee80bb4d670165c8327d.a342055f13.wbe@email03.secureserver.net> Message-ID: 2015-02-10 17:16 GMT+01:00 Doug Ewell : > > In order to make a system like this work with an arbitrary number of > symbols, a terminating symbol would have to be defined. Finding the > longest match between a string of symbols and a TLD wouldn't work; > someone might really want to encode "Brazil, United States, Sweden, > Lesotho" consecutively, and would not want this converted to "Brussels." > > And as Ken pointed out, TLDs are TLDs; they are not a general-purpose > geographic coding system. They don't include every sub-national region > or separatist group, only the ones that Donuts and similar companies > chose to register. There's no TLD for Abkhazia, for example, or for > ISIS. > > well, my propose for using GeoTLDs is an answer to the question "where do you put the line?" I agree a terminating symbol would help in expanding RIS system. > IMHO keept tied to 2-alpha codes is a poor choice for users. May be > > industry manufactures could find a better approach. > > Let's hope that industry manufacturers adhere to the standard instead of > going off on their own. I thought that was the idea when all these > cell-phone symbols were added to Unicode in the first place. > > I really full agree. Manufacturers must follow standards. I support standard, but IMHO RIS dessign is very strict. Unicode doesn't define flags. Unicode doesn't define country flags. Unicode define a mechanism to define ISO country (and dependent territories) flags But manufacturers doesn't follow 100% ISO country codes, for instance, dependent territories codes are usually mapped to country flag [1]. This is a choice made by industry manufacturers, but, it's not in ISO. Another choice made by industry is using a private code, like XK for Kosovo, that's good! The issue with Scotland, Walles, Catalonia and similar flags is a chicken and egg situation. If a manufacturer wants to add such flags, standard doesn't allow it!!! (PUA can be used, of course). And Unicode doesn't expand RIS because manufacturers doesn't use these flags. IMHO RIS mechanism should be expanded being more flexible, beyond 2 char RIS. Unicode doesn't define flags, it defines a mechanism. Manufacturers will choice supported flags, just like they are doing now! So, the real question here is: Where do you put the line? Currently it's put on ISO 3166-1 + some customizations made by industry, but always it's tied to 2 char RIS. IMHO this is too poor for covering real world use/request. I suggested using currently ISO country codes + cultural/language TLDs. Maybe there is a better approach Best regards, Joan Montan? [1] https://github.com/googlei18n/region-flags/blob/master/ALIASES -------------- next part -------------- An HTML attachment was scrubbed... URL: From derhoermi at gmx.net Wed Feb 11 14:39:22 2015 From: derhoermi at gmx.net (Bjoern Hoehrmann) Date: Wed, 11 Feb 2015 21:39:22 +0100 Subject: Unicode IDNA Compatibility Processing Proposed Update In-Reply-To: <54DBB951.7040108@unicode.org> References: <54DBB951.7040108@unicode.org> Message-ID: * announcements at unicode.org wrote: > Oh my... -- Bj?rn H?hrmann ? mailto:bjoern at hoehrmann.de ? http://bjoern.hoehrmann.de D-10243 Berlin ? PGP Pub. KeyID: 0xA4357E78 ? http://www.bjoernsworld.de Available for hire in Berlin (early 2015) ? http://www.websitedev.de/ From asmusf at ix.netcom.com Thu Feb 12 15:47:27 2015 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Thu, 12 Feb 2015 13:47:27 -0800 Subject: sex and emoji Message-ID: <54DD1F6F.8090002@ix.netcom.com> An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Thu Feb 12 22:15:56 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Fri, 13 Feb 2015 05:15:56 +0100 Subject: About cultural/languages communities flags In-Reply-To: References:

Message-ID: RIS could represent languages as well, using BCP47 principle, except that they start by an ISO 3166 coide (as there's no territory, you'd normally use a 3166 code for undetermined region, but there's no 3166 code that starts by an hyphen. So to use a BCP47 language tag you could use the hyphen reencoded to RIS as the first character. The problem is that langauge codes in BCP47 have variable sizes. Even if you limit just to the ISO639 compatible repertoire (3 letter codes) you'd need to use 4 RIS codes And the language flags would be represented as RIS(HYPHEN)+RIS(ISO639-3 code). 4 codes would work with font rendering engines that can build 3 successive ligatures from left to right If there's no match for a know flag (or if there's an exact multiple of 4 RIS codes), the default glyphs would just show a blank flag frame showing the RIS Code converted back to ASCII letters (rendered with a small capitals style: where the first glyph shows the flag's hoist and the first RIS code and i.e. the hyphen, the 2nd and 3rd gyphs shows the top/bottom part of the blank frame an the ASCII character the 4th glyph is similar but adds the flying end of the flag, possibly decorated with non rectangular frame). If there remains less than 4 RIS codes, the flag frame would add the flying end of the flag, with no letter (or just the SPACE).. The wole would be in a large dotted frame to exhibit the special format. These default glyphs are easy to produce in the font. Then to support more languages (7000 languages : 7000 flags ? certainly not so many exist...), you just have to map new ligatures to replace the default ligatures by more accurate "flags". But my opinion is that "flags" (even ifshowing them generically) are not the cood concept for languages (I would highly prefer a "speech bubble frame" like on comics, even if some applications could render in them a colorful regional flag., or the letter code within the "sonor waves" of an audio speaker device. 2015-02-09 22:11 GMT+01:00 Joan Montan? : > > Hi all, > > I am the one who made the request to tweemoji Github. > > > 2015-02-09 20:16 GMT+01:00 Markus Scherer : > >> On Mon, Feb 9, 2015 at 9:54 AM, Andrea Giammarchi < >> andrea.giammarchi at gmail.com> wrote: >> >>> > if a cultural/language TLD is typed with Unicode RIS, then show the >>> flag for these culture/language: >>> >> >> This does not work. The "Unicode RIS" are defined to be used in pairs, >> with semantics according to corresponding ISO 3166 alpha2 codes. In your >> examples, each successive pair will encode a flag. >> >> > AFAIK, this is done in font side. Emoji flags are just ligatures, so a > font can provide a ligature for 4 RIS characters. This is not an issue here. > > I agree some strange behaviour can appear if a 3 RIS string, take CAT, is > shown in a system with only 2 RIS support (a Canadian will appear followed > by a T). > > > If you want to represent every flag of every locality, you first have to >> figure out how to catalog and label them. You are mentioning provinces, one >> level down from nation states; I guess there are thousands of them. In much >> of Europe, every little village >> has its own flag and coat of >> arms. Where do you want the text encoding and fonts to stop? >> >> > I don't request flag support for every flag in the world. I requested > flags for culture/language communities *with* an approved TLD (Top Level > Domain). > > I know flags are an issue, and I know flags represents territories, not > languages, but I think some support should be done for these active > communities. As I pointed, some country flag collections expand with a fews > non-independent country. See [1], [2] and [3] (search for Scottish or > Welsh flag). You can check this [4] petition requesting Catalan flag on > WhatsApp. > > So, there is a demand and they are used in real world. What's the way for > encoding them in Unicode standard? > > Thanks, > > Joan Montan? > > [1] http://www.famfamfam.com/lab/icons/flags/ > [2] https://www.gosquared.com/resources/flag-icons/ > [3] http://www.sherv.net/flag-emoticons.html > [4] > https://www.change.org/p/whatsapp-inc-incloure-la-senyera-de-catalunya-a-whatsapp > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From srl at icu-project.org Thu Feb 12 23:12:46 2015 From: srl at icu-project.org (Steven R. Loomis) Date: Thu, 12 Feb 2015 21:12:46 -0800 Subject: About cultural/languages communities flags In-Reply-To: References:

Message-ID: > > El feb 9, 2015, a las 1:21 PM, Markus Scherer escribi?: > > > However, I would much prefer if everyone spent their considerable energy on upgrading protocols (e.g., IETF RFCs for email subject lines) and lobby relevant vendors (e.g., chat services & social network messages) to support images embedded in the text stream, ideally with scaling and other behavior that would make them behave somewhat text-like. This is the "long term solution" listed for emoji in http://www.unicode.org/reports/tr51/#Longer_Term S From verdy_p at wanadoo.fr Thu Feb 12 23:22:42 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Fri, 13 Feb 2015 06:22:42 +0100 Subject: About cultural/languages communities flags In-Reply-To: References:

Message-ID: Another solution isalso to not extend the scope of use of RIS characters (leave them as they are for ISO3166-1 based codes only), but defne a separate set with "Language Indicator Symbols" (LIS) working the same way, but based on ISO 639-2 or -3 (3-letter codes, accepting also the language family codes also encoded on 3 letters, as well as alll -3 macrolanguages such as "zho" for Chinese or "que" for Quechua). Exactly the same principle as RIS, and as easy to produce with a generic font with very few actual glyphs (on the Ligatures OpenType table may look long, but it can be generated automatically by a basic script, to integrate it in the font build project). No need of complex ligature support, all can work based with a single lookup table of pairs (of glyph ids), simply because there's no need for reordering glyphs. And the default glyph id's for indidual LIS charactes would be mapped to the default building blocks shoiowing the "speech bubble frame" (so a baisc renderer not processing the fonct SUBST tables for ligatures would still produce the basic glyphs and produce a consistant result (even if no decorated bubble would show the colorful and decorated content matching a user-expected "flag" that would be produced in font whose design is based on country/region flags. No requirement by Unicode about how the decorated glyphs will look or about their use or color. Just like fonts with various styles for emojis, the font to use could be a user preference for the reader. No requirement as well to use an OpenType renderer, applications can use icons as well in any convenient graphic format (GIF, PNG, SVG...) as long as they match in term of dimension within the standard line height (not more than about 1.25 em in height incluiding top and bottom bearings). No requirement as well about their width. basic font styles (bold, italic) could be rendered as well by the default glyphs, either on their inner letters, or on the type of bubble frame, including for colorful bubbles whose generic "rounded rectangle" frame can also be "italicized" and bolden even when tit has a colorful complex content. Nowhere, that will mean that Unicode defines what is a valid language or not. All well-formed triplets are valid, and users are free to use 3-code sequences of LIS to do what they want as long as this respects the known ISO639 standard (otr its history, including retired codes). So it will be wellformed to use LIS codes to "say": yes or YES, with LIS[Y]+LIS[E]+LIS[S] (but if there's a ISO 639 language matching the code "yes",it is also valid to replace it with a bubble showing inside a culturally associated "flag-like" decoration. French uses could also use LIS[O]+LIS[U]+LIS[I] to "say": "oui" or "OUI", even if there's another ISO639 language matchin the code "oui" (there's inherently no violation of the per-character identity of LIS characters as Unicode does not encode ligatures or require them to be used for rendering. 2015-02-13 5:15 GMT+01:00 Philippe Verdy : > RIS could represent languages as well, using BCP47 principle, except that > they start by an ISO > 3166 coide (as there's no territory, you'd normally use a 3166 code for > undetermined region, but there's no 3166 code that starts by an hyphen. > So to use a BCP47 language tag you could use the hyphen reencoded to RIS > as the first character. > The problem is that langauge codes in BCP47 have variable sizes. Even if > you limit just to the ISO639 compatible repertoire (3 letter codes) you'd > need to use 4 RIS codes > And the language flags would be represented as RIS(HYPHEN)+RIS(ISO639-3 > code). > > 4 codes would work with font rendering engines that can build 3 successive > ligatures from left to right > > If there's no match for a know flag (or if there's an exact multiple of 4 > RIS codes), the default glyphs would just show a blank flag frame showing > the RIS Code converted back to ASCII letters (rendered with a small > capitals style: where the first glyph shows the flag's hoist and the first > RIS code and i.e. the hyphen, the 2nd and 3rd gyphs shows the top/bottom > part of the blank frame an the ASCII character the 4th glyph is similar but > adds the flying end of the flag, possibly decorated with non rectangular > frame). If there remains less than 4 RIS codes, the flag frame would add > the flying end of the flag, with no letter (or just the SPACE).. The wole > would be in a large dotted frame to exhibit the special format. > > These default glyphs are easy to produce in the font. Then to support more > languages (7000 languages : 7000 flags ? certainly not so many exist...), > you just have to map new ligatures to replace the default ligatures by more > accurate "flags". > > But my opinion is that "flags" (even ifshowing them generically) are not > the cood concept for languages (I would highly prefer a "speech bubble > frame" like on comics, even if some applications could render in them a > colorful regional flag., or the letter code within the "sonor waves" of an > audio speaker device. > > > 2015-02-09 22:11 GMT+01:00 Joan Montan? : > >> >> Hi all, >> >> I am the one who made the request to tweemoji Github. >> >> >> 2015-02-09 20:16 GMT+01:00 Markus Scherer : >> >>> On Mon, Feb 9, 2015 at 9:54 AM, Andrea Giammarchi < >>> andrea.giammarchi at gmail.com> wrote: >>> >>>> > if a cultural/language TLD is typed with Unicode RIS, then show the >>>> flag for these culture/language: >>>> >>> >>> This does not work. The "Unicode RIS" are defined to be used in pairs, >>> with semantics according to corresponding ISO 3166 alpha2 codes. In your >>> examples, each successive pair will encode a flag. >>> >>> >> AFAIK, this is done in font side. Emoji flags are just ligatures, so a >> font can provide a ligature for 4 RIS characters. This is not an issue here. >> >> I agree some strange behaviour can appear if a 3 RIS string, take CAT, is >> shown in a system with only 2 RIS support (a Canadian will appear followed >> by a T). >> >> >> If you want to represent every flag of every locality, you first have to >>> figure out how to catalog and label them. You are mentioning provinces, one >>> level down from nation states; I guess there are thousands of them. In much >>> of Europe, every little village >>> has its own flag and coat of >>> arms. Where do you want the text encoding and fonts to stop? >>> >>> >> I don't request flag support for every flag in the world. I requested >> flags for culture/language communities *with* an approved TLD (Top Level >> Domain). >> >> I know flags are an issue, and I know flags represents territories, not >> languages, but I think some support should be done for these active >> communities. As I pointed, some country flag collections expand with a fews >> non-independent country. See [1], [2] and [3] (search for Scottish or >> Welsh flag). You can check this [4] petition requesting Catalan flag on >> WhatsApp. >> >> So, there is a demand and they are used in real world. What's the way for >> encoding them in Unicode standard? >> >> Thanks, >> >> Joan Montan? >> >> [1] http://www.famfamfam.com/lab/icons/flags/ >> [2] https://www.gosquared.com/resources/flag-icons/ >> [3] http://www.sherv.net/flag-emoticons.html >> [4] >> https://www.change.org/p/whatsapp-inc-incloure-la-senyera-de-catalunya-a-whatsapp >> >> _______________________________________________ >> Unicode mailing list >> Unicode at unicode.org >> http://unicode.org/mailman/listinfo/unicode >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From cjsvance at gmail.com Fri Feb 13 00:04:42 2015 From: cjsvance at gmail.com (Christopher Vance) Date: Fri, 13 Feb 2015 17:04:42 +1100 Subject: About cultural/languages communities flags In-Reply-To: References:

Message-ID: With ISO3166, there's almost always an objective answer to "what is the flag?". UA may be breaking up, but many of those opposed to the Kyiv government would prefer not to be in UA anyway. Sometimes there's a dispute as to which group is running a country, like in SY at the moment, but I'm guessing few would yet claim it's time to change the flag there. EH may be a problem. For languages, there's often no objective answer, unless you ask "which country has the most speakers?", and then you'd have to ask about first language vs second/third/etc. What flag for English? India, UK, US, or something else? What about sub-national language? I have been told there are more Tokelauans (and therefore to a first approximation speakers of Tokelauan) in Wellington NZ, than there are in Tokelau itself. Which flag for them? On Fri, Feb 13, 2015 at 4:22 PM, Philippe Verdy wrote: > Another solution isalso to not extend the scope of use of RIS characters > (leave them as they are for ISO3166-1 based codes only), but defne a > separate set with "Language Indicator Symbols" (LIS) working the same way, > but based on ISO 639-2 or -3 (3-letter codes, accepting also the language > family codes also encoded on 3 letters, as well as alll -3 macrolanguages > such as "zho" for Chinese or "que" for Quechua). > > Exactly the same principle as RIS, and as easy to produce with a generic > font with very few actual glyphs (on the Ligatures OpenType table may look > long, but it can be generated automatically by a basic script, to integrate > it in the font build project). No need of complex ligature support, all can > work based with a single lookup table of pairs (of glyph ids), simply > because there's no need for reordering glyphs. And the default glyph id's > for indidual LIS charactes would be mapped to the default building blocks > shoiowing the "speech bubble frame" (so a baisc renderer not processing the > fonct SUBST tables for ligatures would still produce the basic glyphs and > produce a consistant result (even if no decorated bubble would show the > colorful and decorated content matching a user-expected "flag" that would > be produced in font whose design is based on country/region flags. > > No requirement by Unicode about how the decorated glyphs will look or > about their use or color. Just like fonts with various styles for emojis, > the font to use could be a user preference for the reader. No requirement > as well to use an OpenType renderer, applications can use icons as well in > any convenient graphic format (GIF, PNG, SVG...) as long as they match in > term of dimension within the standard line height (not more than about 1.25 > em in height incluiding top and bottom bearings). No requirement as well > about their width. basic font styles (bold, italic) could be rendered as > well by the default glyphs, either on their inner letters, or on the type > of bubble frame, including for colorful bubbles whose generic "rounded > rectangle" frame can also be "italicized" and bolden even when tit has a > colorful complex content. > > Nowhere, that will mean that Unicode defines what is a valid language or > not. All well-formed triplets are valid, and users are free to use 3-code > sequences of LIS to do what they want as long as this respects the known > ISO639 standard (otr its history, including retired codes). So it will be > wellformed to use LIS codes to "say": yes or YES, with LIS[Y]+LIS[E]+LIS[S] > (but if there's a ISO 639 language matching the code "yes",it is also valid > to replace it with a bubble showing inside a culturally associated > "flag-like" decoration. French uses could also use LIS[O]+LIS[U]+LIS[I] to > "say": "oui" or "OUI", even if there's another ISO639 language matchin the > code "oui" (there's inherently no violation of the per-character identity > of LIS characters as Unicode does not encode ligatures or require them to > be used for rendering. > > > 2015-02-13 5:15 GMT+01:00 Philippe Verdy : > >> RIS could represent languages as well, using BCP47 principle, except that >> they start by an ISO >> 3166 coide (as there's no territory, you'd normally use a 3166 code for >> undetermined region, but there's no 3166 code that starts by an hyphen. >> So to use a BCP47 language tag you could use the hyphen reencoded to RIS >> as the first character. >> The problem is that langauge codes in BCP47 have variable sizes. Even if >> you limit just to the ISO639 compatible repertoire (3 letter codes) you'd >> need to use 4 RIS codes >> And the language flags would be represented as RIS(HYPHEN)+RIS(ISO639-3 >> code). >> >> 4 codes would work with font rendering engines that can build 3 >> successive ligatures from left to right >> >> If there's no match for a know flag (or if there's an exact multiple of 4 >> RIS codes), the default glyphs would just show a blank flag frame showing >> the RIS Code converted back to ASCII letters (rendered with a small >> capitals style: where the first glyph shows the flag's hoist and the first >> RIS code and i.e. the hyphen, the 2nd and 3rd gyphs shows the top/bottom >> part of the blank frame an the ASCII character the 4th glyph is similar but >> adds the flying end of the flag, possibly decorated with non rectangular >> frame). If there remains less than 4 RIS codes, the flag frame would add >> the flying end of the flag, with no letter (or just the SPACE).. The wole >> would be in a large dotted frame to exhibit the special format. >> >> These default glyphs are easy to produce in the font. Then to support >> more languages (7000 languages : 7000 flags ? certainly not so many >> exist...), you just have to map new ligatures to replace the default >> ligatures by more accurate "flags". >> >> But my opinion is that "flags" (even ifshowing them generically) are not >> the cood concept for languages (I would highly prefer a "speech bubble >> frame" like on comics, even if some applications could render in them a >> colorful regional flag., or the letter code within the "sonor waves" of an >> audio speaker device. >> >> >> 2015-02-09 22:11 GMT+01:00 Joan Montan? : >> >>> >>> Hi all, >>> >>> I am the one who made the request to tweemoji Github. >>> >>> >>> 2015-02-09 20:16 GMT+01:00 Markus Scherer : >>> >>>> On Mon, Feb 9, 2015 at 9:54 AM, Andrea Giammarchi < >>>> andrea.giammarchi at gmail.com> wrote: >>>> >>>>> > if a cultural/language TLD is typed with Unicode RIS, then show the >>>>> flag for these culture/language: >>>>> >>>> >>>> This does not work. The "Unicode RIS" are defined to be used in pairs, >>>> with semantics according to corresponding ISO 3166 alpha2 codes. In your >>>> examples, each successive pair will encode a flag. >>>> >>>> >>> AFAIK, this is done in font side. Emoji flags are just ligatures, so a >>> font can provide a ligature for 4 RIS characters. This is not an issue here. >>> >>> I agree some strange behaviour can appear if a 3 RIS string, take CAT, >>> is shown in a system with only 2 RIS support (a Canadian will appear >>> followed by a T). >>> >>> >>> If you want to represent every flag of every locality, you first have to >>>> figure out how to catalog and label them. You are mentioning provinces, one >>>> level down from nation states; I guess there are thousands of them. In much >>>> of Europe, every little village >>>> has its own flag and coat >>>> of arms. Where do you want the text encoding and fonts to stop? >>>> >>>> >>> I don't request flag support for every flag in the world. I requested >>> flags for culture/language communities *with* an approved TLD (Top Level >>> Domain). >>> >>> I know flags are an issue, and I know flags represents territories, not >>> languages, but I think some support should be done for these active >>> communities. As I pointed, some country flag collections expand with a fews >>> non-independent country. See [1], [2] and [3] (search for Scottish or >>> Welsh flag). You can check this [4] petition requesting Catalan flag on >>> WhatsApp. >>> >>> So, there is a demand and they are used in real world. What's the way >>> for encoding them in Unicode standard? >>> >>> Thanks, >>> >>> Joan Montan? >>> >>> [1] http://www.famfamfam.com/lab/icons/flags/ >>> [2] https://www.gosquared.com/resources/flag-icons/ >>> [3] http://www.sherv.net/flag-emoticons.html >>> [4] >>> https://www.change.org/p/whatsapp-inc-incloure-la-senyera-de-catalunya-a-whatsapp >>> >>> _______________________________________________ >>> Unicode mailing list >>> Unicode at unicode.org >>> http://unicode.org/mailman/listinfo/unicode >>> >>> >> > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > > -- Christopher Vance -------------- next part -------------- An HTML attachment was scrubbed... URL: From pandey at umich.edu Fri Feb 13 01:03:01 2015 From: pandey at umich.edu (Anshuman Pandey) Date: Fri, 13 Feb 2015 02:03:01 -0500 Subject: sex and emoji In-Reply-To: <54DD1F6F.8090002@ix.netcom.com> References: <54DD1F6F.8090002@ix.netcom.com> Message-ID: Never would have imagined 'sex' and 'Unicode' in the memetic scene, but a big ol' ?? to the UTC! Kudos, rather ??. > On Feb 12, 2015, at 4:47 PM, Asmus Freytag wrote: > > To quote: "While this probably isn?t news to fans of the eggplant emoji, ...." > > More here: > > http://time.com/3694763/match-com-dating-survey-emoji-sex/ > > A./ > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Fri Feb 13 05:04:54 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Fri, 13 Feb 2015 12:04:54 +0100 Subject: About cultural/languages communities flags In-Reply-To: References:

Message-ID: 2015-02-13 7:04 GMT+01:00 Christopher Vance : > With ISO3166, there's almost always an objective answer to "what is the > flag?". UA may be breaking up, but many of those opposed to the Kyiv > government would prefer not to be in UA anyway. Sometimes there's a dispute > as to which group is running a country, like in SY at the moment, but I'm > guessing few would yet claim it's time to change the flag there. EH may be > a problem. > > For languages, there's often no objective answer, unless you ask "which > country has the most speakers?", and then you'd have to ask about first > language vs second/third/etc. What flag for English? India, UK, US, or > something else? What about sub-national language? I have been told there > are more Tokelauans (and therefore to a first approximation speakers of > Tokelauan) in Wellington NZ, than there are in Tokelau itself. Which flag > for them? > This is completely a non-issue with the Unicode standard itself. There's an ample enough space to use various designs that match character properties as well as user expectations *without* breaking the character identity itself. So even if the US flag is often used for English, in Britanic sites they will use the British flag. In the Republic of Ireland they'll won't use the Irish flag for the English language (prefered for the Irish language itself) and will unlikely use the British flag. In South Africa or India to, they won't use their national flag for English (multiple official languages there, and English is not even the preferred language). In those last cases they will prefer a neutral flag with just the letters "en" to using the alternative with the US flag, or they will use a "pachwork" flag mixing the US flag and the British flag.... It's up to appplications to use the set of glyphs that are appropriate for their own users, or to offer them the choice of fonts or icon sets, either in the UI of their input method, or keyboards (even physical keyboards if they can display icons with small displays on top of keycaps, or on a row of virtual keys added on a touch display panel on top of the keyboard (with the appropriate drivers for installing the support for the secondary display adapter and touch device), or to vendors to sell stickers or custom keycaps. Applications can also offer the same choice by preference in their text renderer (or web browser). Word processors can also offer it with their font selector, for those that want to produce preset documents with a design determined by the author or the web designers or some predetermined graphic charter for collective works. This choice can include prefilled sets matching several common cultures, or various styles (such as falt rectangular flags, or free flying flags, or basic text in a blank flag frame). If users don't want to see the official national flags but prefer to see other icon matchin his culture (including objects such as an Eiffel Tower for France or the logos of their regional council, or the logo of region capitals, or a small locator map of the region), they can do so. All this remains valid for "flags" used to repesent ISO regions, but as well will be vali for -------------- next part -------------- An HTML attachment was scrubbed... URL: From shervinafshar at gmail.com Fri Feb 13 09:20:49 2015 From: shervinafshar at gmail.com (Shervin Afshar) Date: Fri, 13 Feb 2015 07:20:49 -0800 Subject: sex and emoji In-Reply-To: <54DD1F6F.8090002@ix.netcom.com> References: <54DD1F6F.8090002@ix.netcom.com> Message-ID: Related opinion piece: "Are you a smug emoji snob? Chances are you're not getting laid" http://gu.com/p/45n8e/stw On Feb 12, 2015 1:52 PM, "Asmus Freytag" wrote: > To quote: "While this probably isn?t news to fans of the eggplant emoji > , > ...." > > More here: > > http://time.com/3694763/match-com-dating-survey-emoji-sex/ > > A./ > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From shervinafshar at gmail.com Fri Feb 13 09:37:13 2015 From: shervinafshar at gmail.com (Shervin Afshar) Date: Fri, 13 Feb 2015 07:37:13 -0800 Subject: About cultural/languages communities flags In-Reply-To: References:

Message-ID: On Feb 13, 2015 3:12 AM, "Philippe Verdy" wrote: > This is completely a non-issue with the Unicode standard itself. There's an ample enough space to use various designs that match character properties as well as user expectations *without* breaking the character identity itself. So even if the US flag is often used for English, in Britanic sites they will use the British flag. In the Republic of Ireland they'll won't use the Irish flag for the English language (prefered for the Irish language itself) and will unlikely use the British flag. In South Africa or India to, they won't use their national flag for English (multiple official languages there, and English is not even the preferred language). Are these statements about use of flags for language selectors on websites, based on some UX study, survey, or commonly accepted guideline, or are they just speculations? -------------- next part -------------- An HTML attachment was scrubbed... URL: From kenwhistler at att.net Fri Feb 13 10:12:51 2015 From: kenwhistler at att.net (Ken Whistler) Date: Fri, 13 Feb 2015 08:12:51 -0800 Subject: Language tags redux (was: Re: About cultural/languages communities flags) In-Reply-To: References:

Message-ID: <54DE2283.8090407@att.net> Philippe may have overlooked the fact that this has been tried (years ago) in the Unicode Standard. See: language tags. http://www.unicode.org/versions/Unicode7.0.0/ch23.pdf#G26419 The syntax for those even goes beyond just ISO 639-2/3 to incorporate the full range of BCP 47 tags, in principle. But the catch is that the language tag characters ended up *deprecated*, precisely because attempting to do this kind of thing in plain text is the wrong thing to do -- it interferes with the level-appropriate language tagging mechanisms available in markup. I see no point in speculating about reinventing this particular broken wheel one more time for the Unicode Standard. --Ken On 2/12/2015 9:22 PM, Philippe Verdy wrote: > Another solution isalso to not extend the scope of use of RIS > characters (leave them as they are for ISO3166-1 based codes only), > but defne a separate set with "Language Indicator Symbols" (LIS) > working the same way, but based on ISO 639-2 or -3 (3-letter codes, > accepting also the language family codes also encoded on 3 letters, as > well as alll -3 macrolanguages such as "zho" for Chinese or "que" for > Quechua). > > > Nowhere, that will mean that Unicode defines what is a valid language > or not. All well-formed triplets are valid, and users are free to use > 3-code sequences of LIS to do what they want as long as this respects > the known ISO639 standard (otr its history, including retired codes). ... > > From verdy_p at wanadoo.fr Fri Feb 13 11:09:52 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Fri, 13 Feb 2015 18:09:52 +0100 Subject: Language tags redux (was: Re: About cultural/languages communities flags) In-Reply-To: <54DE2283.8090407@att.net> References:

<54DE2283.8090407@att.net> Message-ID: I do not propose it as a "language markup" but only as "visible" icons (independant of the language markup used in text), similar to RIS icons in the Emoji set. This is *not* the same usage. In other words, these icons may be rendered with *translated* levels inside, or localized locally to the appropriate culture (just like flag icons) to represent the same "referenced language" (not necessarily the same "used language" in the document, with the language markup... 2015-02-13 17:12 GMT+01:00 Ken Whistler : > Philippe may have overlooked the fact that this has been tried (years ago) > in the > Unicode Standard. See: language tags. > > http://www.unicode.org/versions/Unicode7.0.0/ch23.pdf#G26419 > > The syntax for those even goes beyond just ISO 639-2/3 to incorporate > the full range of BCP 47 tags, in principle. > > But the catch is that the language tag characters ended up *deprecated*, > precisely because attempting to do this kind of thing in plain text is the > wrong thing to do -- it interferes with the level-appropriate language > tagging mechanisms available in markup. > > I see no point in speculating about reinventing this particular broken > wheel one > more time for the Unicode Standard. > > --Ken > > On 2/12/2015 9:22 PM, Philippe Verdy wrote: > >> Another solution isalso to not extend the scope of use of RIS characters >> (leave them as they are for ISO3166-1 based codes only), but defne a >> separate set with "Language Indicator Symbols" (LIS) working the same way, >> but based on ISO 639-2 or -3 (3-letter codes, accepting also the language >> family codes also encoded on 3 letters, as well as alll -3 macrolanguages >> such as "zho" for Chinese or "que" for Quechua). >> >> >> Nowhere, that will mean that Unicode defines what is a valid language or >> not. All well-formed triplets are valid, and users are free to use 3-code >> sequences of LIS to do what they want as long as this respects the known >> ISO639 standard (otr its history, including retired codes). ... >> >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Fri Feb 13 11:13:10 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Fri, 13 Feb 2015 18:13:10 +0100 Subject: About cultural/languages communities flags In-Reply-To: References:

Message-ID: This is just experience of visiting sites commonly using these flags to represent (inappropriately) languages *visually*. And even if it is not the best way to represent languages, this is what happens (Unicode cannot interfer with the freedom of speech and the choice of authors if they prefer visual icons to plain words). 2015-02-13 16:37 GMT+01:00 Shervin Afshar : > > On Feb 13, 2015 3:12 AM, "Philippe Verdy" wrote: > > > This is completely a non-issue with the Unicode standard itself. There's > an ample enough space to use various designs that match character > properties as well as user expectations *without* breaking the character > identity itself. So even if the US flag is often used for English, in > Britanic sites they will use the British flag. In the Republic of Ireland > they'll won't use the Irish flag for the English language (prefered for the > Irish language itself) and will unlikely use the British flag. In South > Africa or India to, they won't use their national flag for English > (multiple official languages there, and English is not even the preferred > language). > > Are these statements about use of flags for language selectors on > websites, based on some UX study, survey, or commonly accepted guideline, > or are they just speculations? > -------------- next part -------------- An HTML attachment was scrubbed... URL: From shervinafshar at gmail.com Fri Feb 13 11:41:15 2015 From: shervinafshar at gmail.com (Shervin Afshar) Date: Fri, 13 Feb 2015 09:41:15 -0800 Subject: Use of Flags as Language Identifier on the Web (was: About cultural/languages communities flags) Message-ID: I'm neither proposing nor implying what should or should not be done or whether Unicode can or can not interfere with anything anywhere. I'm just curious about use of flags in language selectors or as visual language identifier on websites which you wrote about. I know of some organizations that strictly avoid using flags altogether to represent languages. Did you encounter that during your research? Also, do you have your research on this matter documented somewhere else so I can refer my colleagues in i18n to it? ? Shervin On Fri, Feb 13, 2015 at 9:13 AM, Philippe Verdy wrote: > This is just experience of visiting sites commonly using these flags to > represent (inappropriately) languages *visually*. And even if it is not the > best way to represent languages, this is what happens (Unicode cannot > interfer with the freedom of speech and the choice of authors if they > prefer visual icons to plain words). > > > 2015-02-13 16:37 GMT+01:00 Shervin Afshar : > >> >> On Feb 13, 2015 3:12 AM, "Philippe Verdy" wrote: >> >> > This is completely a non-issue with the Unicode standard itself. >> There's an ample enough space to use various designs that match character >> properties as well as user expectations *without* breaking the character >> identity itself. So even if the US flag is often used for English, in >> Britanic sites they will use the British flag. In the Republic of Ireland >> they'll won't use the Irish flag for the English language (prefered for the >> Irish language itself) and will unlikely use the British flag. In South >> Africa or India to, they won't use their national flag for English >> (multiple official languages there, and English is not even the preferred >> language). >> >> Are these statements about use of flags for language selectors on >> websites, based on some UX study, survey, or commonly accepted guideline, >> or are they just speculations? >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Fri Feb 13 12:20:23 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Fri, 13 Feb 2015 19:20:23 +0100 Subject: Use of Flags as Language Identifier on the Web (was: About cultural/languages communities flags) In-Reply-To: References: Message-ID: There are many examples and notably on home pages of a lot of commercial sites un their top bar and in startup selectors of many mobile apps or in popular games or on various including translation tools or catalogues of dictionnaires ans manu printed dictionbaries show these flags on their cover, including wellknown ones from famous brands such as Harraps or Larousse. Or on official sites of various tourism information offices and museums on their printed leaflets or on museums. They do not support all languages with accurate translations but are giving a visual choice or indicator of the language this way. Many physical products use these flags on their printed labels or boxes and embedded leaflets for listing used components or describe their use. As this saves space on the limited size of the label or box. Most people cannot identify standard language codes correctly but recognize the flag commonly used to designate their language. These icons also replace bullet separators for their visual impact, they are true symbols acting like ponctuation, but more visible si they allow saving newlines as well. Even if country flags are not culturally neutral for those languages they are very often sufficient for the few listed languages. And with the same frequency we see packagings showing country codes instead of language codes. When they realize that country flags are too much culturally/politically oriented they do not want tout show them will juste use region codes, more less decorated (not always standard ISO codes but like on car plates). These uses are on fact very old, before standardisation of language codes and they have notre disappeared and will likely not in any expected short time frame. Now with the internet available around the world, massively advertized and used daily in multiple times or activities, people know their country code but still not their langage code... Le 13 f?vr. 2015 18:42, "Shervin Afshar" a ?crit : > I'm neither proposing nor implying what should or should not be done or > whether Unicode can or can not interfere with anything anywhere. I'm just > curious about use of flags in language selectors or as visual language > identifier on websites which you wrote about. > > I know of some organizations that strictly avoid using flags altogether to > represent languages. Did you encounter that during your research? > > Also, do you have your research on this matter documented somewhere else > so I can refer my colleagues in i18n to it? > > > ? Shervin > > On Fri, Feb 13, 2015 at 9:13 AM, Philippe Verdy > wrote: > >> This is just experience of visiting sites commonly using these flags to >> represent (inappropriately) languages *visually*. And even if it is not the >> best way to represent languages, this is what happens (Unicode cannot >> interfer with the freedom of speech and the choice of authors if they >> prefer visual icons to plain words). >> >> >> 2015-02-13 16:37 GMT+01:00 Shervin Afshar : >> >>> >>> On Feb 13, 2015 3:12 AM, "Philippe Verdy" wrote: >>> >>> > This is completely a non-issue with the Unicode standard itself. >>> There's an ample enough space to use various designs that match character >>> properties as well as user expectations *without* breaking the character >>> identity itself. So even if the US flag is often used for English, in >>> Britanic sites they will use the British flag. In the Republic of Ireland >>> they'll won't use the Irish flag for the English language (prefered for the >>> Irish language itself) and will unlikely use the British flag. In South >>> Africa or India to, they won't use their national flag for English >>> (multiple official languages there, and English is not even the preferred >>> language). >>> >>> Are these statements about use of flags for language selectors on >>> websites, based on some UX study, survey, or commonly accepted guideline, >>> or are they just speculations? >>> >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From shervinafshar at gmail.com Fri Feb 13 13:37:32 2015 From: shervinafshar at gmail.com (Shervin Afshar) Date: Fri, 13 Feb 2015 11:37:32 -0800 Subject: Use of Flags as Language Identifier on the Web (was: About cultural/languages communities flags) In-Reply-To: References:

Message-ID: Some of what you mentioned are relevant to the general topic in a very broad sense, but not relevant to the focus of the conversation we're having here; e.g. saving space in package design, replacing bullet separators, etc. Although not relevant to the conversation, still as an i18n practitioner, I'd like to see them in a document with some figures and some references. See this[1] as an exquisite example. > These uses are on fact very old, before standardisation of language codes > and they have notre disappeared and will likely not in any expected short > time frame. Is there an example of a multilingual document pre-dating ISO/TC 37 and ISO/R 639 which uses flags to distinguish text in different languages? Most people cannot identify standard language codes correctly but recognize > the flag commonly used to designate their language. [...] Even if country flags are not culturally neutral for those languages they > are very often sufficient for the few listed languages. I agree with what you're saying about language codes being sometimes obscure to common user. I also agree with what you said yesterday in the other thread about flags not being good to visually represent languages: On Thu, Feb 12, 2015 at 8:15 PM, Philippe Verdy wrote: > But my opinion is that "flags" (even ifshowing them generically) are not > the cood concept for languages All said and done, it seems to me there are always better ways to represent languages in software UIs. A very large scale and illustrative example is Wikimedia Foundation's Universal Language Selector[2]. It is used on most WMF projects to switch between hundreds of languages and it doesn't use neither flags nor language codes in its UI. See the design notes[3]. [1]: http://www.w3.org/TR/jlreq/ [2]: https://www.mediawiki.org/wiki/Universal_Language_Selector [3]: https://www.mediawiki.org/w/index.php?title=Universal_Language_Selector/Interaction_Design_Framework#Iconography_to_represent_languages ? Shervin On Fri, Feb 13, 2015 at 10:20 AM, Philippe Verdy wrote: > There are many examples and notably on home pages of a lot of commercial > sites un their top bar and in startup selectors of many mobile apps or in > popular games or on various including translation tools or catalogues of > dictionnaires ans manu printed dictionbaries show these flags on their > cover, including wellknown ones from famous brands such as Harraps or > Larousse. > Or on official sites of various tourism information offices and museums on > their printed leaflets or on museums. They do not support all languages > with accurate translations but are giving a visual choice or indicator of > the language this way. > Many physical products use these flags on their printed labels or boxes > and embedded leaflets for listing used components or describe their use. As > this saves space on the limited size of the label or box. > Most people cannot identify standard language codes correctly but > recognize the flag commonly used to designate their language. > These icons also replace bullet separators for their visual impact, they > are true symbols acting like ponctuation, but more visible si they allow > saving newlines as well. > Even if country flags are not culturally neutral for those languages they > are very often sufficient for the few listed languages. > And with the same frequency we see packagings showing country codes > instead of language codes. > When they realize that country flags are too much culturally/politically > oriented they do not want tout show them will juste use region codes, more > less decorated (not always standard ISO codes but like on car plates). > These uses are on fact very old, before standardisation of language codes > and they have notre disappeared and will likely not in any expected short > time frame. Now with the internet available around the world, massively > advertized and used daily in multiple times or activities, people know > their country code but still not their langage code... > Le 13 f?vr. 2015 18:42, "Shervin Afshar" a > ?crit : > > I'm neither proposing nor implying what should or should not be done or >> whether Unicode can or can not interfere with anything anywhere. I'm just >> curious about use of flags in language selectors or as visual language >> identifier on websites which you wrote about. >> >> I know of some organizations that strictly avoid using flags altogether >> to represent languages. Did you encounter that during your research? >> >> Also, do you have your research on this matter documented somewhere else >> so I can refer my colleagues in i18n to it? >> >> >> ? Shervin >> >> On Fri, Feb 13, 2015 at 9:13 AM, Philippe Verdy >> wrote: >> >>> This is just experience of visiting sites commonly using these flags to >>> represent (inappropriately) languages *visually*. And even if it is not the >>> best way to represent languages, this is what happens (Unicode cannot >>> interfer with the freedom of speech and the choice of authors if they >>> prefer visual icons to plain words). >>> >>> >>> 2015-02-13 16:37 GMT+01:00 Shervin Afshar : >>> >>>> >>>> On Feb 13, 2015 3:12 AM, "Philippe Verdy" wrote: >>>> >>>> > This is completely a non-issue with the Unicode standard itself. >>>> There's an ample enough space to use various designs that match character >>>> properties as well as user expectations *without* breaking the character >>>> identity itself. So even if the US flag is often used for English, in >>>> Britanic sites they will use the British flag. In the Republic of Ireland >>>> they'll won't use the Irish flag for the English language (prefered for the >>>> Irish language itself) and will unlikely use the British flag. In South >>>> Africa or India to, they won't use their national flag for English >>>> (multiple official languages there, and English is not even the preferred >>>> language). >>>> >>>> Are these statements about use of flags for language selectors on >>>> websites, based on some UX study, survey, or commonly accepted guideline, >>>> or are they just speculations? >>>> >>> >>> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Fri Feb 13 16:33:17 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Fri, 13 Feb 2015 23:33:17 +0100 Subject: Use of Flags as Language Identifier on the Web (was: About cultural/languages communities flags) In-Reply-To: References:

Message-ID: 2015-02-13 20:37 GMT+01:00 Shervin Afshar : > Some of what you mentioned are relevant to the general topic in a very > broad sense, but not relevant to the focus of the conversation we're having > here; e.g. saving space in package design, replacing bullet separators, > etc. Although not relevant to the conversation, still as an i18n > practitioner, I'd like to see them in a document with some figures and some > references. See this[1] as an exquisite example. > > >> These uses are on fact very old, before standardisation of language codes >> and they have notre disappeared and will likely not in any expected short >> time frame. > > > Is there an example of a multilingual document pre-dating ISO/TC 37 and > ISO/R 639 which uses flags to distinguish text in different languages? > >> My sentence was more generic than that. It was about the old practice of using things identifies countries/regions where the real meaning was to represent languages (independantly of regions where it is supposed to be "mostly" spoken (false for languages that are much more spoken in other places than their native region. So various things associated to places (rather than languages) have been used and continue to be used: * more or less abbreviated coutnry/region names (often altered locally or using imaginative/poetic descriptions at best, or frequently as well using insulting slang words for these regions names) * the standard name of these regions (even if the language is no longer spoken there: it has the side effect that those that speak the language today are considered as "strangers" within their current country. * the new name of the region once it has become an region occupied by another ruler (the old name used when that region was still self-governing is prohibited. * iconic representations of various objects typical of this region (e.g. using an icon of the Eiffel Tower to designate Paris, or France, or an iconic representation of the Colyseum to erpresent Rome Italy, or the Tower of Pise as well, or a Pyramid to represent Egypt) as a way to designate the language that is mostly spoken there or originates from there; wellknown monuments in this region are the most used * But you'll see also (notably in sports) a frog or a peacok to represent France, an other natural elements symbolizing historical events in nations of UK. Frequently these elements may be also part of today's flags (e.g. the mapple leaf for Canada, the hermine for Britanny) * Flags **of course** for these regions (but there are disagreements about the choice of Flag, as well as to the graographical border of the region where that language is spoken or originates) * Coats of arms * National colors in some arrangements (far from the effective form of the flag even if it includes these colors). * Iconic representation of the region borders (often only the borders remaining in today's countries) * Religious and esotheric symbols * Other non inconic symbols of these regions (flags are not the only official symbols of today's countries) : it could be some notes of an anthem, or a a famous song or music from a musician of that region (which European country do you think the three apples may mean in Romance countries ? you have to think about it phonetically, and then to which European language will you associate these three apples ?) * Photos of portraits, or scultpures of famous persons from that region, notably the most famous artists (e.g. look into per-language categories of the "Languages" category on several editions of Wiktionnary),frequentlty these are poets, writers, dramaturges. * Common sentences attributing object to the country or region (a standard used in East Asian regions, and replacing country names without using any phonologic similarity). Those sentences are also depicted iconically on their flags (e.g. Japan). ... In all those cases, there's a common confusion between designating regions and languages (and politically it seems that most countries want to define their concept of nation and associated territory to a language and want that language to be named according to the way theur also name the region. So most frequenty, the "gentil?s" derived friom the region name to designate people of that region are used as adjectives qualifying every subject used by people of this region or from hat region (and these include theur language) Human history, since many centuries, has a huge record of dramatic events caused by this confusion of cultures/languages/peoples with regions by their current winning rulers as well as by their occupants and occupied countruesx. This is stil lthe case today and new events are coming almost every day to recall it. This contaminates the basic concept of "nation" and even th way we write and pronounce languages. -------------- next part -------------- An HTML attachment was scrubbed... URL: From shervinafshar at gmail.com Fri Feb 13 16:46:05 2015 From: shervinafshar at gmail.com (Shervin Afshar) Date: Fri, 13 Feb 2015 14:46:05 -0800 Subject: Use of Flags as Language Identifier on the Web (was: About cultural/languages communities flags) In-Reply-To: References:

Message-ID: I see. It all make sense to me now. For some reason, I was of the impression that we are talking about flags and language codes here. ? Shervin On Fri, Feb 13, 2015 at 2:33 PM, Philippe Verdy wrote: > > > 2015-02-13 20:37 GMT+01:00 Shervin Afshar : > >> Some of what you mentioned are relevant to the general topic in a very >> broad sense, but not relevant to the focus of the conversation we're having >> here; e.g. saving space in package design, replacing bullet separators, >> etc. Although not relevant to the conversation, still as an i18n >> practitioner, I'd like to see them in a document with some figures and some >> references. See this[1] as an exquisite example. >> >> >>> These uses are on fact very old, before standardisation of language >>> codes and they have notre disappeared and will likely not in any expected >>> short time frame. >> >> >> Is there an example of a multilingual document pre-dating ISO/TC 37 and >> ISO/R 639 which uses flags to distinguish text in different languages? >> >>> > My sentence was more generic than that. It was about the old practice of > using things identifies countries/regions where the real meaning was to > represent languages (independantly of regions where it is supposed to be > "mostly" spoken (false for languages that are much more spoken in other > places than their native region. > So various things associated to places (rather than languages) have been > used and continue to be used: > * more or less abbreviated coutnry/region names (often altered locally or > using imaginative/poetic descriptions at best, or frequently as well using > insulting slang words for these regions names) > * the standard name of these regions (even if the language is no longer > spoken there: it has the side effect that those that speak the language > today are considered as "strangers" within their current country. > * the new name of the region once it has become an region occupied by > another ruler (the old name used when that region was still self-governing > is prohibited. > * iconic representations of various objects typical of this region (e.g. > using an icon of the Eiffel Tower to designate Paris, or France, or an > iconic representation of the Colyseum to erpresent Rome Italy, or the Tower > of Pise as well, or a Pyramid to represent Egypt) as a way to designate the > language that is mostly spoken there or originates from there; wellknown > monuments in this region are the most used > * But you'll see also (notably in sports) a frog or a peacok to represent > France, an other natural elements symbolizing historical events in nations > of UK. Frequently these elements may be also part of today's flags (e.g. > the mapple leaf for Canada, the hermine for Britanny) > * Flags **of course** for these regions (but there are disagreements about > the choice of Flag, as well as to the graographical border of the region > where that language is spoken or originates) > * Coats of arms > * National colors in some arrangements (far from the effective form of the > flag even if it includes these colors). > * Iconic representation of the region borders (often only the borders > remaining in today's countries) > * Religious and esotheric symbols > * Other non inconic symbols of these regions (flags are not the only > official symbols of today's countries) : it could be some notes of an > anthem, or a a famous song or music from a musician of that region (which > European country do you think the three apples may mean in Romance > countries ? you have to think about it phonetically, and then to which > European language will you associate these three apples ?) > * Photos of portraits, or scultpures of famous persons from that region, > notably the most famous artists (e.g. look into per-language categories of > the "Languages" category on several editions of Wiktionnary),frequentlty > these are poets, writers, dramaturges. > * Common sentences attributing object to the country or region (a standard > used in East Asian regions, and replacing country names without using any > phonologic similarity). Those sentences are also depicted iconically on > their flags (e.g. Japan). > ... > > In all those cases, there's a common confusion between designating regions > and languages (and politically it seems that most countries want to define > their concept of nation and associated territory to a language and want > that language to be named according to the way theur also name the region. > So most frequenty, the "gentil?s" derived friom the region name to > designate people of that region are used as adjectives qualifying every > subject used by people of this region or from hat region (and these include > theur language) > > Human history, since many centuries, has a huge record of dramatic events > caused by this confusion of cultures/languages/peoples with regions by > their current winning rulers as well as by their occupants and occupied > countruesx. This is stil lthe case today and new events are coming almost > every day to recall it. This contaminates the basic concept of "nation" and > even th way we write and pronounce languages. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Sat Feb 14 07:53:16 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Sat, 14 Feb 2015 14:53:16 +0100 Subject: Unicode block for programming related symbols and codepoints? In-Reply-To: <54D7DCDD.9060003@colson.eu> References: <54D7C3EA.6080000@web.de> <54D7DCDD.9060003@colson.eu> Message-ID: But the TAB is still the whitespace character you describe that is accepted in the programming language using it. Defining a new codepoint would require the lexical analyzer of these languages to be modified (you modify those languages). Clearly, given that the lexiccal items of the programming languages for the functions you describe are is a very closed subset, you cannot substitute them. All you describe is a matter of design for the UI of code editors, which will still scan the edited sources looking for TABs any not your custom character, in order to display it in a custom way, accoding to preferences of the programmer. We are in fact not talking about the character identities (the only significant identiy here is the identity of the original characters in the source text, and the code editor will not alter it even if they display it differently (but they only "display" them, they don't replace them, unless the progrzmmer effectively makes a change to the source code (such as reindeting or compressing whitespaces, or using a source code beautifer/reformatter (which is safe to use in those editors ONLY if these editors effectively recognize not only the source characters, but also the syntax of the source language (so not only it must be able to read and scan te source, but it must also know which programming language you are using (generally it uses the file extension of the source file, but if you have still not given a filename to your source by saving it (or by adding a nod eto yuor source tree in your IDE), you can still select the programming language in the menu of the editor. The same editor can then present the source program in any convenient presentation that matches the expectations and needs of the programmers using it: it will typically provide syntax coloring, it will group/ungroup blocks of source lines (by detecting the syntax used to delimit blocks (punctuations, begin/end keywords,indentation, statement separators or operators, priority orders of operators...) The presentation made will never depend of your new "character" (and a new symbolic character is not the unique and best way to present the programming structure because the needs for progrzammers is at a higher level than isolated characters, but based on the upper-level parsing seyntax of programming blocks, statements and operations: the program can then be presented in a treeview listing nodes with sorted listed of properties, where property values can also be another tree). The tree is also not the only option: you could as well have rectangular blocks that you can expand/reduce, appearing as multine blocks of rich text containing other blocks. Additionally there could be several superposed structures that are not hierarchically embedded (e.g. one for a line-base preprocessor, another for the code as it would be understtod by the next layer, after the preprocessing layer) And even in programminag languages, there exists structures that do not obey the hierarchic structure (e.g. SGML and HTML where elements can rreely close the scope of extension of /many/ previously opened /blocks/, and not just the one that is in the top of stack When you close an eement that is not at the top of stack, the existing top of stack /may/ remain at the top of stack, or could be closed implicitly, according to complex matching rules (which depend of properties of all elements in the stack between the element you are explciitly closing and the element at top of stck) 2015-02-08 23:02 GMT+01:00 Jean-Fran?ois Colson : > Le 08/02/15 22:32, Pierpaolo Bernardi a ?crit : > > On Sun, Feb 8, 2015 at 9:15 PM, Alfred Zett wrote: > > [?] > > > > -- unlike tabs or space, it wouldn't be whitespace > > [?] > > > > a Tab is exactly what you described. > > Not exactly: a tab IS whitespace. > It may sometimes be displayed in a different color or with a special > symbol on request if the editor allows it, but in most cases it is > whitespace. > > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Sat Feb 14 08:23:10 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Sat, 14 Feb 2015 15:23:10 +0100 Subject: Unicode block for programming related symbols and codepoints? In-Reply-To: References: <54D7C3EA.6080000@web.de> <54D7E2E2.6080705@web.de> Message-ID: 2015-02-08 23:54 GMT+01:00 Pierpaolo Bernardi : > On Sun, Feb 8, 2015 at 11:27 PM, Alfred Zett wrote: > > > That was exactly my thought, so I figured it couldn't harm to have these > > >> a Tab is exactly what you described. > > > > No. It's only half of what I described. > > It's still a typographical character that implies whitespace and may > appear > > everywhere in the text. > > How would your proposed character be displayed as plain text? > You new language will have to invent another language syntax for exporting and serializing its native source into a plain text file. It will certainly use an escaping syntax (such as the commn use of backslahes), but that syntax will be a traditional syntax for traditional programs. And standard ASCII or UTF-8 encodings using standard characters will be largely enough. Your programming toll will need a separate serializer and a separate parser that that alternate syntax, or it could reuse some existing parsers (such as XML and JSON serializers and parsers,or existing generic libraries handing rich text documents containinig embedded collections, and an API more or less like DOM APIs offered with an adapting layer of "bindings" for lots of other languages, with a binary interface, or an SQL-like interface, or other convenient interfaces such as common collections and associative arrays, or containers like ZIP/JAR files) : The programmer will in fact not have to edit these complex source files, but may look inside with tricky tools can could corrupt its internal structure of references. They will just use the specific IDE made for your language, will select a file or resource (e.g. a network service) using that custom syntax, it will be loaded (or will perform queries) to edit some viewable and editable parts of the program, and many internal data used in the native format (notably the purely internal references and pointers) will be hidden to them and will change without notice, while preserving the intended structure of your langage. In many modern environments, in fact a single programmer cannot reprogram the whole project but can only edit some parts of it, and there are privileged operations (reservd to some groups of users) and some parts that will change in parallel and can be edited in teams of programmers/designers/correctors and that require another system to coordinate works and resolve edit conflicts, or to create alternate branches that someone else will merge into the common trunk: the programmers create their own branches not seen by others, until the programmer submits its proposed branch for review by more privileged users. It does not mean that, even if that branch is rejected for merging in the trunck, the bracnh will be necesarily deleted: that programmer/designer can still use his own branch without effecting other users using the common trunk or designing or using their own branch (o that want to keep an older version of the trunk, ignoring new versions). We are clkearly out of scope of Unciode because we are not speaking about text, but about programming tools and services, and about models of operations for working or cooperating teams (and those teams will include various types of peoiple, not just designers and programmers, but (as well) final users and customers creating their own customizations and adding their own features and data and interoperating using various "programming languages" and tools with various UIs, more friendly than traditional linear and text-based programming languages). -------------- next part -------------- An HTML attachment was scrubbed... URL: From 0.le.phare.ouest at gmail.com Sat Feb 14 12:12:35 2015 From: 0.le.phare.ouest at gmail.com (=?UTF-8?B?QW50b2luZSBNw6lyaWM=?=) Date: Sat, 14 Feb 2015 19:12:35 +0100 Subject: sex and emoji In-Reply-To: References: <54DD1F6F.8090002@ix.netcom.com> Message-ID: <54DF9013.2020406@gmail.com> I was wondering, has the question ? Given the massive usage over time of the glyph, and the number of academic papers about it, should we consider adding a PHALLIC REPRESENTATION to the unicode standard ? ? ever been asked ? Seriously, Antoine M?RIC Le 13/02/2015 08:03, Anshuman Pandey a ?crit : > Never would have imagined 'sex' and 'Unicode' in the memetic scene, > but a big ol' ?? to the UTC! Kudos, rather ??. > > > > On Feb 12, 2015, at 4:47 PM, Asmus Freytag > wrote: > >> To quote: "While this probably isn?t news to fans of the eggplant >> emoji >> , >> ...." >> >> More here: >> >> http://time.com/3694763/match-com-dating-survey-emoji-sex/ >> >> A./ >> _______________________________________________ >> Unicode mailing list >> Unicode at unicode.org >> http://unicode.org/mailman/listinfo/unicode > > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode -------------- next part -------------- An HTML attachment was scrubbed... URL: From timpart at perdix.demon.co.uk Mon Feb 16 01:25:12 2015 From: timpart at perdix.demon.co.uk (Tim Partridge) Date: Mon, 16 Feb 2015 07:25:12 +0000 Subject: sex and emoji In-Reply-To: <54DF9013.2020406@gmail.com> References: <54DD1F6F.8090002@ix.netcom.com> , <54DF9013.2020406@gmail.com> Message-ID: <8C324C32065663409974565298FC6EC52D1BACB8@exmbx04.thus.corp> Antoine M?ric ?said > I was wondering, has the question ? Given the massive usage over time > of the glyph, and the number of academic papers about it, should we consider >adding a PHALLIC REPRESENTATION to the unicode standard ? ? ever been asked ? The Ancient Egyptians had some glyphs to act as determinatives for words that relate to that topic, and they are encoded in the standard. See U+130B8 to A. Tim -------------- next part -------------- An HTML attachment was scrubbed... URL: From ian.clifton at chem.ox.ac.uk Mon Feb 16 05:48:07 2015 From: ian.clifton at chem.ox.ac.uk (Ian Clifton) Date: Mon, 16 Feb 2015 11:48:07 +0000 Subject: sex and emoji References: <54DD1F6F.8090002@ix.netcom.com> <54DF9013.2020406@gmail.com> <8C324C32065663409974565298FC6EC52D1BACB8@exmbx04.thus.corp> Message-ID: <4qd25a2i2w.fsf@chem-arachne.chem.ox.ac.uk> Tim Partridge writes: > Antoine M?ric ?said >> I was wondering, has the question ? Given the massive usage over > time >> of the glyph, and the number of academic papers about it, should we > consider >>adding a PHALLIC REPRESENTATION to the unicode standard ? ? ever been > asked ? > > The Ancient Egyptians had some glyphs to act as determinatives for > words that relate to that topic, and they are encoded in the standard. > See U+130B8 to A. Good grief, I don?t like the look of U+130B9. Maybe I don?t want to know what?s going on ??. -- Ian ? From timpart at perdix.demon.co.uk Mon Feb 16 13:42:17 2015 From: timpart at perdix.demon.co.uk (Tim Partridge) Date: Mon, 16 Feb 2015 19:42:17 +0000 Subject: sex and emoji In-Reply-To: <4qd25a2i2w.fsf@chem-arachne.chem.ox.ac.uk> References: <54DD1F6F.8090002@ix.netcom.com> <54DF9013.2020406@gmail.com> <8C324C32065663409974565298FC6EC52D1BACB8@exmbx04.thus.corp>, <4qd25a2i2w.fsf@chem-arachne.chem.ox.ac.uk> Message-ID: <8C324C32065663409974565298FC6EC52D1BAD13@exmbx04.thus.corp> Ian Clifton said: > Good grief, I don?t like the look of U+130B9. Maybe I don?t want to know > what?s going on ??. I'm not an egyptologist, but I think it's just a scribal ligature between U+132F4 and U+130B8. The former is just a folded piece of cloth representing the sound /s/. Usually a long thin sign is combined with a following sign to save space. These two wouldn't fit together well, so I guess the scribes decided to just put one on top of the other. There are other similar examples in the code charts. Tim From eliz at gnu.org Thu Feb 19 04:55:20 2015 From: eliz at gnu.org (Eli Zaretskii) Date: Thu, 19 Feb 2015 12:55:20 +0200 Subject: Compatibility decomposition for Hebrew and Greek final letters Message-ID: <83d25641d3.fsf@gnu.org> Does anyone know why does the UCD define compatibility decompositions for Arabic initial, medial, and final forms, but doesn't do the same for Hebrew final letters, like U+05DD HEBREW LETTER FINAL MEM? Or for that matter, for U+03C2 GREEK SMALL LETTER FINAL SIGMA? The relevant application where this would matter is text search, where these letters might be folded to the same code point for the purposes of comparison. TIA From everson at evertype.com Thu Feb 19 05:21:19 2015 From: everson at evertype.com (Michael Everson) Date: Thu, 19 Feb 2015 11:21:19 +0000 Subject: Compatibility decomposition for Hebrew and Greek final letters In-Reply-To: <83d25641d3.fsf@gnu.org> References: <83d25641d3.fsf@gnu.org> Message-ID: <03FFAE2B-CD9A-470C-BCCB-62001B95F4CF@evertype.com> On 19 Feb 2015, at 10:55, Eli Zaretskii wrote: > Does anyone know why does the UCD define compatibility decompositions > for Arabic initial, medial, and final forms, but doesn't do the same > for Hebrew final letters, like U+05DD HEBREW LETTER FINAL MEM? Or for > that matter, for U+03C2 GREEK SMALL LETTER FINAL SIGMA? > > The relevant application where this would matter is text search, where > these letters might be folded to the same code point for the purposes > of comparison. Such comparisons happen at a different level, I think. Michael Everson * http://www.evertype.com/ From eliz at gnu.org Thu Feb 19 05:30:22 2015 From: eliz at gnu.org (Eli Zaretskii) Date: Thu, 19 Feb 2015 13:30:22 +0200 Subject: Compatibility decomposition for Hebrew and Greek final letters In-Reply-To: <03FFAE2B-CD9A-470C-BCCB-62001B95F4CF@evertype.com> References: <83d25641d3.fsf@gnu.org> <03FFAE2B-CD9A-470C-BCCB-62001B95F4CF@evertype.com> Message-ID: <837fve3zqp.fsf@gnu.org> > From: Michael Everson > Date: Thu, 19 Feb 2015 11:21:19 +0000 > > On 19 Feb 2015, at 10:55, Eli Zaretskii wrote: > > > Does anyone know why does the UCD define compatibility decompositions > > for Arabic initial, medial, and final forms, but doesn't do the same > > for Hebrew final letters, like U+05DD HEBREW LETTER FINAL MEM? Or for > > that matter, for U+03C2 GREEK SMALL LETTER FINAL SIGMA? > > > > The relevant application where this would matter is text search, where > > these letters might be folded to the same code point for the purposes > > of comparison. > > Such comparisons happen at a different level, I think. Sorry, I'm not sure I follow: different from what? In any case, regardless of the level, if there's no data to support such "folding", how can applications implement it (except by inventing its own data)? Also, perhaps there are some deep linguistic reasons why such folding might be inappropriate, and that's why the UCD doesn't define such decompositions? Thanks. From jcb+unicode at inf.ed.ac.uk Thu Feb 19 05:47:24 2015 From: jcb+unicode at inf.ed.ac.uk (Julian Bradfield) Date: Thu, 19 Feb 2015 11:47:24 GMT Subject: Compatibility decomposition for Hebrew and Greek final letters References: <83d25641d3.fsf@gnu.org> Message-ID: On 2015-02-19, Eli Zaretskii wrote: > Does anyone know why does the UCD define compatibility decompositions > for Arabic initial, medial, and final forms, but doesn't do the same > for Hebrew final letters, like U+05DD HEBREW LETTER FINAL MEM? Or for > that matter, for U+03C2 GREEK SMALL LETTER FINAL SIGMA? As far as I understand it: In Arabic, the variant of a letter is determined entirely by its position, so there is no compelling need to represent the forms separately (as characters rather than glyphs) save for the existence of legacy standards (and if there is, you can use the ZWJ/ZWNJ hacks). Thus the forms would not have been encoded but for the legacy standards. Whereas in Hebrew, non-final forms appear finally in certain contexts in normal text; and in Greek, while Greek text may have a determinate choice between ? and ?, there are many contexts where the two symbols are distinguished (not least maths). -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. From eliz at gnu.org Thu Feb 19 05:59:44 2015 From: eliz at gnu.org (Eli Zaretskii) Date: Thu, 19 Feb 2015 13:59:44 +0200 Subject: Compatibility decomposition for Hebrew and Greek final letters In-Reply-To: References: <83d25641d3.fsf@gnu.org> Message-ID: <834mqi3ydr.fsf@gnu.org> > Date: Thu, 19 Feb 2015 11:47:24 GMT > From: Julian Bradfield > > In Arabic, the variant of a letter is determined entirely by its > position, so there is no compelling need to represent the forms separately > (as characters rather than glyphs) save for the existence of legacy > standards (and if there is, you can use the ZWJ/ZWNJ hacks). Thus the > forms would not have been encoded but for the legacy standards. > Whereas in Hebrew, non-final forms appear finally in certain contexts > in normal text; and in Greek, while Greek text may have a determinate > choice between ? and ?, there are many contexts where the two symbols > are distinguished (not least maths). Got it, thanks. From verdy_p at wanadoo.fr Thu Feb 19 13:31:07 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Thu, 19 Feb 2015 20:31:07 +0100 Subject: Compatibility decomposition for Hebrew and Greek final letters In-Reply-To: <834mqi3ydr.fsf@gnu.org> References: <83d25641d3.fsf@gnu.org> <834mqi3ydr.fsf@gnu.org> Message-ID: The decompositions are not needed for plain text searches, that can use the collation data (with the collation data, you can unify at the primary level differences such as capitalisation and ignore diacritics, or transform some base groups of letters into a single entry, or make some significant primary difference when there are diacritics (for example in German equating 'ae' and '?' at the primary level). Yes, collation must use the canonical decompositions, but does not need to follow the compatibility decompositions for all locales (even if this is done for the root locale and the DUCET... with some exceptions considering the rules for the most important language using an encoded letter and all its *canonical* equivalents). Compatibility decompositions in the UCD have little use, they should be preserved in encoded texts and transformations of text, they are just suggestions which *may* be useful: - for rendering text (the most important use is in character mappings within fonts, or in fallback mappings implemented in the rendering engine), - or for mappings to legacy encodings (e.g. when converting to GSM for SMS services, or converting for display in text-only devices and terminals using a limited OEM charset) 2015-02-19 12:59 GMT+01:00 Eli Zaretskii : > > Date: Thu, 19 Feb 2015 11:47:24 GMT > > From: Julian Bradfield > > > > In Arabic, the variant of a letter is determined entirely by its > > position, so there is no compelling need to represent the forms > separately > > (as characters rather than glyphs) save for the existence of legacy > > standards (and if there is, you can use the ZWJ/ZWNJ hacks). Thus the > > forms would not have been encoded but for the legacy standards. > > Whereas in Hebrew, non-final forms appear finally in certain contexts > > in normal text; and in Greek, while Greek text may have a determinate > > choice between ? and ?, there are many contexts where the two symbols > > are distinguished (not least maths). > > Got it, thanks. > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > -------------- next part -------------- An HTML attachment was scrubbed... URL: From eliz at gnu.org Thu Feb 19 14:17:30 2015 From: eliz at gnu.org (Eli Zaretskii) Date: Thu, 19 Feb 2015 22:17:30 +0200 Subject: Compatibility decomposition for Hebrew and Greek final letters In-Reply-To: References: <83d25641d3.fsf@gnu.org> <834mqi3ydr.fsf@gnu.org> Message-ID: <83zj89lkpx.fsf@gnu.org> > From: Philippe Verdy > Date: Thu, 19 Feb 2015 20:31:07 +0100 > Cc: Julian Bradfield , > unicode Unicode Discussion > > The decompositions are not needed for plain text searches, that can use the > collation data (with the collation data, you can unify at the primary level > differences such as capitalisation and ignore diacritics, or transform some > base groups of letters into a single entry, or make some significant primary > difference when there are diacritics (for example in German equating 'ae' and > '?' at the primary level). Sorry, I disagree. First, collation data is overkill for search, since the order information is not required, so the weights are simply wasting storage. Second, people do want to find, e.g., "?" when they search for "2" etc. I'm not saying that they _always_ want that, but sometimes they do. There's no reason a sophisticated text editor shouldn't support such a feature, under user control. From markus.icu at gmail.com Thu Feb 19 15:08:57 2015 From: markus.icu at gmail.com (Markus Scherer) Date: Thu, 19 Feb 2015 13:08:57 -0800 Subject: Compatibility decomposition for Hebrew and Greek final letters In-Reply-To: <83zj89lkpx.fsf@gnu.org> References: <83d25641d3.fsf@gnu.org> <834mqi3ydr.fsf@gnu.org> <83zj89lkpx.fsf@gnu.org> Message-ID: On Thu, Feb 19, 2015 at 12:17 PM, Eli Zaretskii wrote: > Sorry, I disagree. First, collation data is overkill for search, > since the order information is not required, so the weights are simply > wasting storage. Second, people do want to find, e.g., "?" when they > search for "2" etc. > Depends on what you do. "the weights are simply wasting storage" is not really true, you do have to encode something for which characters are same or different, and it turns out that that comes close to defining a sort order. Some people also want to ignore accents, others don't. As to your original question, Unicode collation would give you primary-equal "mem" and "sigma" characters. 05DE; [63 1E, 05, 05] # Hebr Lo [1F81.0020.0002] * HEBREW LETTER MEM FB26; [63 1E, 05, 20] # Hebr Lo [1F81.0020.0005] * HEBREW LETTER WIDE FINAL MEM 05DD; [63 1E, 05, 2E] # Hebr Lo [1F81.0020.0019] * HEBREW LETTER FINAL MEM FB3E; [63 1E, 05, 05][, E5 B1, 05] # Hebr Lo [1F81.0020.0002][0000.005F.0002] * HEBREW LETTER MEM WITH DAGESH 03C3; [5F 42, 05, 05] # Grek Ll [1C95.0020.0002] * GREEK SMALL LETTER SIGMA 03F2; [5F 42, 05, 10] # Grek Ll [1C95.0020.0004] * GREEK LUNATE SIGMA SYMBOL 1D6D3; [5F 42, 05, 17] # Zyyy Ll [1C95.0020.0005] * MATHEMATICAL BOLD SMALL FINAL SIGMA ... 03C2; [5F 42, 05, 33] # Grek Ll [1C95.0020.0019] * GREEK SMALL LETTER FINAL SIGMA You can certainly simplify a few things when you don't care about the order, therefore CLDR defines "search" tailorings. Some popular browsers use collation-based search for ctrl-F in-page search, either with strength=primary (ignore accent/case/etc. variants), or with asymmetric search. ICU implements those algorithms and carries the CLDR tailorings. See http://www.unicode.org/reports/tr10/#Searching Best regards, markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Thu Feb 19 16:02:57 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Thu, 19 Feb 2015 22:02:57 +0000 Subject: Compatibility decomposition for Hebrew and Greek final letters In-Reply-To: <83zj89lkpx.fsf@gnu.org> References: <83d25641d3.fsf@gnu.org>