From doug at ewellic.org Sun Feb 1 17:18:18 2015 From: doug at ewellic.org (Doug Ewell) Date: Sun, 1 Feb 2015 16:18:18 -0700 Subject: N'Ko - which character? 02BC vs. 2019 In-Reply-To: References: Message-ID: <34B5AEA2EAC449CA9DDFE10107DAD3A6@DougEwell> Markus Scherer wrote: > Dear Unicoders, which is the proper second character in "N'Ko"? > See below for details. > > ---------- Forwarded message ---------- > From: Doug Ewell For the record, I did not ask on ietf-languages for any re-evaluation of the apostrophe character used in the name N'Ko. My question, and that of the group, was about the apostrophes used in the names of Khoisan and Bantu languages. -- Doug Ewell | Thornton, CO, USA | http://ewellic.org ? From chris.fynn at gmail.com Mon Feb 2 00:12:55 2015 From: chris.fynn at gmail.com (Christopher Fynn) Date: Mon, 2 Feb 2015 12:12:55 +0600 Subject: N'Ko - which character? 02BC vs. 2019 In-Reply-To: <34B5AEA2EAC449CA9DDFE10107DAD3A6@DougEwell> References: <34B5AEA2EAC449CA9DDFE10107DAD3A6@DougEwell> Message-ID: If used as characters that are part of a word, especially when they occur at the beginning or end of a word, ASCII apostrophes and and both right and left quotation marks easily get changed to something else by the auto quotes features of word-processors. From Andrew.Glass at microsoft.com Mon Feb 2 12:14:31 2015 From: Andrew.Glass at microsoft.com (Andrew Glass (WINDOWS)) Date: Mon, 2 Feb 2015 18:14:31 +0000 Subject: N'Ko - which character? 02BC vs. 2019 In-Reply-To: References: <34B5AEA2EAC449CA9DDFE10107DAD3A6@DougEwell> Message-ID: For what it's worth, the N'ko Institute of America uses U+2019. But that is probably a reflection of the font situation and the fact that U+2019 is often more accessible in word processors. http://nkoinstitute.com/the-n-character/ -----Original Message----- From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Christopher Fynn Sent: Sunday, February 1, 2015 10:13 PM To: Doug Ewell Cc: Markus Scherer; unicode at unicode.org Subject: Re: N'Ko - which character? 02BC vs. 2019 If used as characters that are part of a word, especially when they occur at the beginning or end of a word, ASCII apostrophes and and both right and left quotation marks easily get changed to something else by the auto quotes features of word-processors. _______________________________________________ Unicode mailing list Unicode at unicode.org http://unicode.org/mailman/listinfo/unicode From everson at evertype.com Mon Feb 2 12:36:58 2015 From: everson at evertype.com (Michael Everson) Date: Mon, 2 Feb 2015 18:36:58 +0000 Subject: N'Ko - which character? 02BC vs. 2019 In-Reply-To: References: Message-ID: On 31 Jan 2015, at 22:04, Markus Scherer wrote: > Dear Unicoders, which is the proper second character in "N'Ko"? > See below for details. U+2019. It is not a letter in N?Ko. Moreover, the reference fonts for N?Ko didn?t even have U+02BC. For N?Ko, this is not arguable. I would like to point out (perhaps again) that in my Hawaiian, Samoan, and Tongan, editions of Alice?s Adventures in Wonderland, and in the forthcoming Hawaiian Hobbit, U+02BB has been drawn 133% taller, but of the same width, as U+2018. I believe this really must be considered good practice. In these novels, with ?quotation marks ?and nested quotation marks?,? making this distinction is really rather essential. Michael Everson * http://www.evertype.com/ From verdy_p at wanadoo.fr Mon Feb 2 12:54:55 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Mon, 2 Feb 2015 19:54:55 +0100 Subject: N'Ko - which character? 02BC vs. 2019 In-Reply-To: References: <34B5AEA2EAC449CA9DDFE10107DAD3A6@DougEwell> Message-ID: On this page the N'ko Institute hesitates ans uses U+2018 (?) in English i.e. the reverse direction. It has advantages that it is used immediately after letter N/n and if ever it appears at end of words, it won't match a pair of single quotation marks (U+2018 is a punctuation only at start of lines, or after whitespaces and punctuations; U+2019 is not always a quotation punctuation after a letter, even if it's followed by whitespace or punctuation, it may also be an orthographic apostrophe). 2015-02-02 19:14 GMT+01:00 Andrew Glass (WINDOWS) < Andrew.Glass at microsoft.com>: > For what it's worth, the N'ko Institute of America uses U+2019. But that > is probably a reflection of the font situation and the fact that U+2019 is > often more accessible in word processors. > > http://nkoinstitute.com/the-n-character/ > > > -----Original Message----- > From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of > Christopher Fynn > Sent: Sunday, February 1, 2015 10:13 PM > To: Doug Ewell > Cc: Markus Scherer; unicode at unicode.org > Subject: Re: N'Ko - which character? 02BC vs. 2019 > > If used as characters that are part of a word, especially when they occur > at the beginning or end of a word, ASCII apostrophes and and both right and > left quotation marks easily get changed to something else by the auto > quotes features of word-processors. > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Mon Feb 2 12:55:17 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Mon, 2 Feb 2015 19:55:17 +0100 Subject: N'Ko - which character? 02BC vs. 2019 In-Reply-To: References: <34B5AEA2EAC449CA9DDFE10107DAD3A6@DougEwell> Message-ID: The link did not pass : http://nkoinstitute.com/nko-alphabet/ 2015-02-02 19:54 GMT+01:00 Philippe Verdy : > On this page > > the N'ko Institute hesitates ans uses U+2018 (?) in English i.e. the > reverse direction. > It has advantages that it is used immediately after letter N/n and if ever > it appears at end of words, it won't match a pair of single quotation marks > (U+2018 is a punctuation only at start of lines, or after whitespaces and > punctuations; U+2019 is not always a quotation punctuation after a letter, > even if it's followed by whitespace or punctuation, it may also be an > orthographic apostrophe). > > > 2015-02-02 19:14 GMT+01:00 Andrew Glass (WINDOWS) < > Andrew.Glass at microsoft.com>: > >> For what it's worth, the N'ko Institute of America uses U+2019. But that >> is probably a reflection of the font situation and the fact that U+2019 is >> often more accessible in word processors. >> >> http://nkoinstitute.com/the-n-character/ >> >> >> -----Original Message----- >> From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of >> Christopher Fynn >> Sent: Sunday, February 1, 2015 10:13 PM >> To: Doug Ewell >> Cc: Markus Scherer; unicode at unicode.org >> Subject: Re: N'Ko - which character? 02BC vs. 2019 >> >> If used as characters that are part of a word, especially when they occur >> at the beginning or end of a word, ASCII apostrophes and and both right and >> left quotation marks easily get changed to something else by the auto >> quotes features of word-processors. >> _______________________________________________ >> Unicode mailing list >> Unicode at unicode.org >> http://unicode.org/mailman/listinfo/unicode >> >> _______________________________________________ >> Unicode mailing list >> Unicode at unicode.org >> http://unicode.org/mailman/listinfo/unicode >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From kent.karlsson14 at telia.com Mon Feb 2 17:31:11 2015 From: kent.karlsson14 at telia.com (Kent Karlsson) Date: Tue, 03 Feb 2015 00:31:11 +0100 Subject: N'Ko - which character? 02BC vs. 2019 In-Reply-To: Message-ID: Den 2015-02-02 19:36, skrev "Michael Everson" : > Hawaiian Hobbit, U+02BB has been drawn 133% taller, but of the same width, as > U+2018. I believe this really must be considered good practice. In these I think you mean 33 % taller, i.e. height 133 % relative to its "normal" height. 133 % taller would be more than double its normal height, making it about as tall as an uppercase letter... That would be excessive... /Kent K From everson at evertype.com Mon Feb 2 18:00:17 2015 From: everson at evertype.com (Michael Everson) Date: Tue, 3 Feb 2015 00:00:17 +0000 Subject: N'Ko - which character? 02BC vs. 2019 In-Reply-To: References: Message-ID: <17FCA56A-E578-4031-A885-8FB5AD8A853D@evertype.com> On 2 Feb 2015, at 23:31, Kent Karlsson wrote: >> Hawaiian Hobbit, U+02BB has been drawn 133% taller, but of the same width, as >> U+2018. I believe this really must be considered good practice. In these > > I think you mean 33 % taller, i.e. height 133 % relative to its "normal" > height. 133 % taller would be more than double its normal height, making > it about as tall as an uppercase letter... That would be excessive? Yes, that?s right. I just type ?133? into the font editor. Michael Everson * http://www.evertype.com/ From ishida at w3.org Wed Feb 4 06:40:01 2015 From: ishida at w3.org (Richard Ishida) Date: Wed, 04 Feb 2015 12:40:01 +0000 Subject: Bopomofo light tone mark on the Web Message-ID: <54D21321.8030904@w3.org> At the W3C we are trying to understand how to handle the bopomofo in phonetic annotations (for the CSS Ruby spec). Please see a write up of the background and some relevant questions at http://rishida.net/scripts/bopomofo/ontheweb A key question relates to the light tone. The light tone falls out from most IMEs and is displayed, for example, by Keynote's phonetic guide function, after the bopomofo letters. In pretty much all the vertical bopomofo we have seen, and in pretty much all dictionaries we have seen (horizontal or vertically set) the light tone, however, is displayed before the bopomofo letters. Note that modern dictionaries appear to be actually moving the character code into first position in the syllable to achieve this. We'd like to know: 1. is anyone aware of any ruling about where the light tone should appear and/or be stored in the text stream? 2. does it (really) matter if text sometimes contains the light tone character before the syllable and sometimes trailing, depending on where people prefer to put it? (Obviously, there's a theoretical issue for sorting and searching if it is sometimes in one place and sometimes in another, but it may be that both places are actually viable positions.) 3. is there any font/rendering software out there that makes the light tone appear at the start of a syllable, when the character is actually at the end of the syllable? cheers, ri From verdy_p at wanadoo.fr Wed Feb 4 19:45:48 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Thu, 5 Feb 2015 02:45:48 +0100 Subject: Bopomofo light tone mark on the Web In-Reply-To: <54D21321.8030904@w3.org> References: <54D21321.8030904@w3.org> Message-ID: Does it really matter, given that the sign is written orthogonally direction of writing of the bopomofo line ? Does it have to be a combining character when it could be a standard spacing character on that line so that users can place it before or after (for collation it would be a problem only for the ternary level, but it can be ignorable in the first and second level). Wouldn't the common middle dot be usable ? Or could a variant be encoded after the specific Bobomofo light tone spacing mark, to indicate its preferred placement ("above" or "below", probably with '"above" being the default) in the vertical writing style (this variant being ignored for the horizontal writing style for example in IME) ? 2015-02-04 13:40 GMT+01:00 Richard Ishida : > At the W3C we are trying to understand how to handle the bopomofo in > phonetic annotations (for the CSS Ruby spec). > > Please see a write up of the background and some relevant questions at > http://rishida.net/scripts/bopomofo/ontheweb > > A key question relates to the light tone. > > The light tone falls out from most IMEs and is displayed, for example, by > Keynote's phonetic guide function, after the bopomofo letters. In pretty > much all the vertical bopomofo we have seen, and in pretty much all > dictionaries we have seen (horizontal or vertically set) the light tone, > however, is displayed before the bopomofo letters. > > Note that modern dictionaries appear to be actually moving the character > code into first position in the syllable to achieve this. > > We'd like to know: > > 1. is anyone aware of any ruling about where the light tone should appear > and/or be stored in the text stream? > > 2. does it (really) matter if text sometimes contains the light tone > character before the syllable and sometimes trailing, depending on where > people prefer to put it? > > (Obviously, there's a theoretical issue for sorting and searching if it is > sometimes in one place and sometimes in another, but it may be that both > places are actually viable positions.) > > 3. is there any font/rendering software out there that makes the light > tone appear at the start of a syllable, when the character is actually at > the end of the syllable? > > cheers, > > ri > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Wed Feb 4 19:52:48 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Thu, 5 Feb 2015 02:52:48 +0100 Subject: Bopomofo light tone mark on the Web In-Reply-To: References: <54D21321.8030904@w3.org> Message-ID: An alternative could be to encode TWO separate tone marks: - one for the usual mode where it appears to the right (horizontal writing) or top (vertical writing) and where it is then a combining character. - one for the alternate "phonetic" mode where it will be forced to appear always before : it will be a spacing mark and will be encoded in the reading and typing order (but for this specific usage, the common middle dot would be enough and will work on both writing directions). 2015-02-05 2:45 GMT+01:00 Philippe Verdy : > Does it really matter, given that the sign is written orthogonally > direction of writing of the bopomofo line ? Does it have to be a combining > character when it could be a standard spacing character on that line so > that users can place it before or after (for collation it would be a > problem only for the ternary level, but it can be ignorable in the first > and second level). > Wouldn't the common middle dot be usable ? > Or could a variant be encoded after the specific Bobomofo light tone > spacing mark, to indicate its preferred placement ("above" or "below", > probably with '"above" being the default) in the vertical writing style > (this variant being ignored for the horizontal writing style for example in > IME) ? > > > 2015-02-04 13:40 GMT+01:00 Richard Ishida : > >> At the W3C we are trying to understand how to handle the bopomofo in >> phonetic annotations (for the CSS Ruby spec). >> >> Please see a write up of the background and some relevant questions at >> http://rishida.net/scripts/bopomofo/ontheweb >> >> A key question relates to the light tone. >> >> The light tone falls out from most IMEs and is displayed, for example, by >> Keynote's phonetic guide function, after the bopomofo letters. In pretty >> much all the vertical bopomofo we have seen, and in pretty much all >> dictionaries we have seen (horizontal or vertically set) the light tone, >> however, is displayed before the bopomofo letters. >> >> Note that modern dictionaries appear to be actually moving the character >> code into first position in the syllable to achieve this. >> >> We'd like to know: >> >> 1. is anyone aware of any ruling about where the light tone should appear >> and/or be stored in the text stream? >> >> 2. does it (really) matter if text sometimes contains the light tone >> character before the syllable and sometimes trailing, depending on where >> people prefer to put it? >> >> (Obviously, there's a theoretical issue for sorting and searching if it >> is sometimes in one place and sometimes in another, but it may be that both >> places are actually viable positions.) >> >> 3. is there any font/rendering software out there that makes the light >> tone appear at the start of a syllable, when the character is actually at >> the end of the syllable? >> >> cheers, >> >> ri >> _______________________________________________ >> Unicode mailing list >> Unicode at unicode.org >> http://unicode.org/mailman/listinfo/unicode >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jf at colson.eu Fri Feb 6 07:30:32 2015 From: jf at colson.eu (=?UTF-8?B?SmVhbi1GcmFuw6dvaXMgQ29sc29u?=) Date: Fri, 06 Feb 2015 14:30:32 +0100 Subject: Wrong plane numbers Message-ID: <54D4C1F8.9050908@colson.eu> In the file NamesList.txt, I see: @@ 2FF80 Unassigned 2FFFF @@ 3FF80 Unassigned 3FFFF @@ 4FF80 Unassigned 4FFFF @@ 5FF80 Unassigned 5FFFF @@ 6FF80 Unassigned 6FFFF @@ 7FF80 Unassigned 7FFFF @@ 8FF80 Unassigned 8FFFF @@ 9FF80 Unassigned 9FFFF @@ AFF80 Unassigned AFFFF @@ BFF80 Unassigned BFFFF @@ CFF80 Unassigned CFFFF @@ DFF80 Unassigned DFFFF @@ EFF80 Unassigned EFFFF @@ FFF80 Supplementary Private Use Area-A FFFFF @@ 10FF80 Supplementary Private Use Area-B 10FFFF Shouldn?t 2FF80 3FF80 4FF80 5FF80 6FF80 7FF80 8FF80 9FF80 AFF80 BFF80 CFF80 DFF80 EFF80 FFF80 10FF80 become 20000 30000 40000 50000 60000 70000 80000 90000 A0000 B0000 C0000 D0000 E01F0 F0000 100000 ? From jf at colson.eu Fri Feb 6 07:33:36 2015 From: jf at colson.eu (=?UTF-8?B?SmVhbi1GcmFuw6dvaXMgQ29sc29u?=) Date: Fri, 06 Feb 2015 14:33:36 +0100 Subject: Wrong plane numbers In-Reply-To: <54D4C1F8.9050908@colson.eu> References: <54D4C1F8.9050908@colson.eu> Message-ID: <54D4C2B0.8090907@colson.eu> Le 06/02/15 14:30, Jean-Fran?ois Colson a ?crit : > In the file NamesList.txt, I see: > > @@ 2FF80 Unassigned 2FFFF > @@ 3FF80 Unassigned 3FFFF > @@ 4FF80 Unassigned 4FFFF > @@ 5FF80 Unassigned 5FFFF > @@ 6FF80 Unassigned 6FFFF > @@ 7FF80 Unassigned 7FFFF > @@ 8FF80 Unassigned 8FFFF > @@ 9FF80 Unassigned 9FFFF > @@ AFF80 Unassigned AFFFF > @@ BFF80 Unassigned BFFFF > @@ CFF80 Unassigned CFFFF > @@ DFF80 Unassigned DFFFF > @@ EFF80 Unassigned EFFFF > @@ FFF80 Supplementary Private Use Area-A FFFFF > @@ 10FF80 Supplementary Private Use Area-B 10FFFF > > > Shouldn?t > > 2FF80 3FF80 4FF80 5FF80 6FF80 7FF80 8FF80 9FF80 AFF80 BFF80 CFF80 > DFF80 EFF80 FFF80 10FF80 > > become > > 20000 30000 40000 50000 60000 70000 80000 90000 A0000 B0000 C0000 > D0000 E01F0 F0000 100000 > Of course I meant 2FA1E, not 20000? > ? > From markus.icu at gmail.com Fri Feb 6 09:06:14 2015 From: markus.icu at gmail.com (Markus Scherer) Date: Fri, 6 Feb 2015 07:06:14 -0800 Subject: Wrong plane numbers In-Reply-To: <54D4C2B0.8090907@colson.eu> References: <54D4C1F8.9050908@colson.eu> <54D4C2B0.8090907@colson.eu> Message-ID: These are not block boundaries. These lines are for book chart production, where we don't need to print every unsigned code point. markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From jf at colson.eu Fri Feb 6 09:15:57 2015 From: jf at colson.eu (=?UTF-8?B?SmVhbi1GcmFuw6dvaXMgQ29sc29u?=) Date: Fri, 06 Feb 2015 16:15:57 +0100 Subject: Wrong plane numbers In-Reply-To: References: <54D4C1F8.9050908@colson.eu> <54D4C2B0.8090907@colson.eu> Message-ID: <54D4DAAD.8060407@colson.eu> Le 06/02/15 16:06, Markus Scherer a ?crit : > > These are not block boundaries. These lines are for book chart > production, where we don't need to print every unsigned code point. > markus > OK. But what about @@ FFF80 Supplementary Private Use Area-A FFFFF ? The Supplementary Private Use Area-A doesn?t begin at FFF80: it begins at F0000. It doesn?t end at FFFFF: it ends at FFFFD. In @@ 1E00 Latin Extended Additional 1EFF 1E00 and 1EFF are the limits of the block ?Latin Extended Additional?. Why isn?t it so with @@ FFF80 Supplementary Private Use Area-A FFFFF ? From kenwhistler at att.net Fri Feb 6 09:50:21 2015 From: kenwhistler at att.net (Ken Whistler) Date: Fri, 06 Feb 2015 07:50:21 -0800 Subject: Wrong plane numbers In-Reply-To: <54D4DAAD.8060407@colson.eu> References: <54D4C1F8.9050908@colson.eu> <54D4C2B0.8090907@colson.eu> <54D4DAAD.8060407@colson.eu> Message-ID: <54D4E2BD.6080305@att.net> Markus has already explained this. But the following explanation fills out some details. These @@ lines are conveniences for chart production. They are headers read by the unibook chart layout tool, which help guide where chart layout for a block starts and stops. The @@ lines are *NOT* block boundary definitions. So please do not try to interpret them as such. The normative definitions of block boundaries can be found in: http://www.unicode.org/Public/UCD/latest/ucd/Blocks.txt Incidentally, the block for the Supplementary Private Use Area-A *does* end at FFFFF, not at FFFFD, as demonstrated by the *normative* block definition from Blocks.txt: F0000..FFFFF; Supplementary Private Use Area-A The syntax used in the NamesList.txt file to drive chart production is fully described in: http://www.unicode.org/Public/UCD/lastest/ucd/NamesList.html Much of the content of NamesList.txt is indirectly normative, of course, because it is used to generate the code charts for versions of the Unicode Standard, but much of the content of the file is just markup that assists in the layout of the charts and/or various informative, annotational material. It is *NOT* safe or recommended to attempt to reverse engineer content out of the bare text file, nor to try to infer content implications for the standard by extracting it from the bare text file. Also, the unibook tool is regularly used by proposal writers to do chart layout for encoding proposals, where block definitions obviously do not even exist yet. The @@ header lines are used there, too, to specify ranges used in the charts for the proposals. By the way, there is over fifteen years of development history here for the interaction of syntax in NamesList.txt and the ongoing maintenance of the unibook chart production tool. The mismatch between @@ blockheader ranges and normative block definitions has been noted (and explained) a number of times now. --Ken On 2/6/2015 7:15 AM, Jean-Fran?ois Colson wrote: > > Le 06/02/15 16:06, Markus Scherer a ?crit : >> >> These are not block boundaries. These lines are for book chart >> production, where we don't need to print every unsigned code point. >> markus >> > OK. But what about > @@ FFF80 Supplementary Private Use Area-A FFFFF > ? > > The Supplementary Private Use Area-A doesn?t begin at FFF80: it begins > at F0000. > It doesn?t end at FFFFF: it ends at FFFFD. > > In > @@ 1E00 Latin Extended Additional 1EFF > 1E00 and 1EFF are the limits of the block ?Latin Extended Additional?. > > Why isn?t it so with > @@ FFF80 Supplementary Private Use Area-A FFFFF > ? > > _ From jf at colson.eu Fri Feb 6 10:21:15 2015 From: jf at colson.eu (=?UTF-8?B?SmVhbi1GcmFuw6dvaXMgQ29sc29u?=) Date: Fri, 06 Feb 2015 17:21:15 +0100 Subject: Not so wrong plane numbers In-Reply-To: <54D4E2BD.6080305@att.net> References: <54D4C1F8.9050908@colson.eu> <54D4C2B0.8090907@colson.eu> <54D4DAAD.8060407@colson.eu> <54D4E2BD.6080305@att.net> Message-ID: <54D4E9FB.4020706@colson.eu> Le 06/02/15 16:50, Ken Whistler a ?crit : > By the way, there is over fifteen years of development history here for > the interaction of syntax in NamesList.txt and the ongoing maintenance > of the unibook chart production tool. The mismatch between @@ blockheader > ranges and normative block definitions has been noted (and explained) > a number of times now. OK. Sorry for the noise? From jf at colson.eu Fri Feb 6 10:22:31 2015 From: jf at colson.eu (=?UTF-8?B?SmVhbi1GcmFuw6dvaXMgQ29sc29u?=) Date: Fri, 06 Feb 2015 17:22:31 +0100 Subject: Not so wrong plane numbers In-Reply-To: <54D4E2BD.6080305@att.net> References: <54D4C1F8.9050908@colson.eu> <54D4C2B0.8090907@colson.eu> <54D4DAAD.8060407@colson.eu> <54D4E2BD.6080305@att.net> Message-ID: <54D4EA47.2060302@colson.eu> Le 06/02/15 16:50, Ken Whistler a ?crit : > By the way, there is over fifteen years of development history here for > the interaction of syntax in NamesList.txt and the ongoing maintenance > of the unibook chart production tool. The mismatch between @@ blockheader > ranges and normative block definitions has been noted (and explained) > a number of times now. OK. Sorry for the noise? -------------- next part -------------- An HTML attachment was scrubbed... URL: From alfred_z at web.de Sun Feb 8 14:15:38 2015 From: alfred_z at web.de (Alfred Zett) Date: Sun, 08 Feb 2015 21:15:38 +0100 Subject: Unicode block for programming related symbols and codepoints? Message-ID: <54D7C3EA.6080000@web.de> Hello everyone, is there such a unicode block for programming related codepoints? Conventional search engines as well as wolfram alpha can't answer that, with the former one leading to all the programming problems that occur... If such a block doesn't exist, I'd like to make a proposal - if possible - to add one with at least the following codepoints/characters: - Indentation codepoint, with no fixed defined graphical representation. For indentation based programming languages. Because: -- specific clients may want to show it different (for example as arrows, lines etc., using another color): --- browsers could let the web page creator let decide the visual representation (character and size) via CSS --- the same with editors, independent from the actual font --- in case of visual impairment, the user could even change the accoustical representation if the editor allows it -- unlike a space symbol, it wouldn't need more than one character per indentation -- unlike tabs or space, it wouldn't be whitespace -- unlike normal arrow characters, one could customize the length in an editor and wouldn't have to insert extra spaces for a better visual imagery - A codepoint for string literal quotes, that would spare one the escaping. - A statement separator symbol. - Other ideas? You may now think, this is highly specific and you are right. However, so are EMOJI signs, in particular those like PINE DECORATION. These days, there are a lot of tools to create small embedded scripting languages and DSLs, which are used in-program in special editors. And there is a lot of people using them. Exactly these could really profit from such a codeblock instead of using conventional ASCII subset characters. Also, there is a lot of potential with really good text editors and IDEs where semantics may matter a lot. Excuse my english, I hope this was understandable. Best regards, A. Z. From olopierpa at gmail.com Sun Feb 8 15:32:03 2015 From: olopierpa at gmail.com (Pierpaolo Bernardi) Date: Sun, 8 Feb 2015 22:32:03 +0100 Subject: Unicode block for programming related symbols and codepoints? In-Reply-To: <54D7C3EA.6080000@web.de> References: <54D7C3EA.6080000@web.de> Message-ID: On Sun, Feb 8, 2015 at 9:15 PM, Alfred Zett wrote: > Hello everyone, > > is there such a unicode block for programming related codepoints? > > Conventional search engines as well as wolfram alpha can't answer that, with > the former one leading to all the programming problems that occur... > > If such a block doesn't exist, I'd like to make a proposal - if possible - > to add one with at least the following codepoints/characters: Once upon a time I would have said that this is out of scope for Unicode. But now anything goes, so who knows. > - Indentation codepoint, with no fixed defined graphical representation. For > indentation based programming languages. > Because: > -- specific clients may want to show it different (for example as arrows, > lines etc., using another color): > --- browsers could let the web page creator let decide the visual > representation (character and size) via CSS > --- the same with editors, independent from the actual font > --- in case of visual impairment, the user could even change the accoustical > representation if the editor allows it > -- unlike a space symbol, it wouldn't need more than one character per > indentation > -- unlike tabs or space, it wouldn't be whitespace > -- unlike normal arrow characters, one could customize the length in an > editor and wouldn't have to insert extra spaces for a better visual imagery a Tab is exactly what you described. > - A codepoint for string literal quotes, that would spare one the escaping. How would this work exactly? > - A statement separator symbol. What's wrong with ; , . : # % ^ & and other hundreds of punctuation symbols? Cheers P. From jf at colson.eu Sun Feb 8 15:51:58 2015 From: jf at colson.eu (=?windows-1252?Q?Jean-Fran=E7ois_Colson?=) Date: Sun, 08 Feb 2015 22:51:58 +0100 Subject: Unicode block for programming related symbols and codepoints? In-Reply-To: <54D7C3EA.6080000@web.de> References: <54D7C3EA.6080000@web.de> Message-ID: <54D7DA7E.6010009@colson.eu> Le 08/02/15 21:15, Alfred Zett a ?crit : > Hello everyone, > > is there such a unicode block for programming related codepoints? > > Conventional search engines as well as wolfram alpha can't answer > that, with the former one leading to all the programming problems that > occur... > > If such a block doesn't exist, I'd like to make a proposal - if > possible - to add one with at least the following codepoints/characters: > > - Indentation codepoint, with no fixed defined graphical > representation. For indentation based programming languages. That wouldn?t be compliant with existing languages and future languages might use any existing character. > Because: > -- specific clients may want to show it different (for example as > arrows, lines etc., using another color): Can?t good editors display tabs in a different color when required ? > --- browsers could let the web page creator let decide the visual > representation (character and size) via CSS > --- the same with editors, independent from the actual font > --- in case of visual impairment, the user could even change the > accoustical representation if the editor allows it > -- unlike a space symbol, it wouldn't need more than one character per > indentation > -- unlike tabs or space, it wouldn't be whitespace > -- unlike normal arrow characters, one could customize the length in > an editor and wouldn't have to insert extra spaces for a better visual > imagery > > - A codepoint for string literal quotes, that would spare one the > escaping. I rarely escape quotes. In a text, I use ? (U+2019) as an apostrophe and ?????? as quotes, so I don?t need to escape them. When I use PHP to generate some HTML code, I try to alternate simple and double quotes as much as possible. That way I rarely need to escape them. > - A statement separator symbol. To replace the semicolon in C and the languages based on its syntax? > - Other ideas? Aren?t you trying to reinvent APL? > > You may now think, this is highly specific and you are right. > However, so are EMOJI signs, in particular those like PINE DECORATION. > > These days, there are a lot of tools to create small embedded > scripting languages and DSLs, which are used in-program in special > editors. And there is a lot of people using them. > Exactly these could really profit from such a codeblock instead of > using conventional ASCII subset characters. > Also, there is a lot of potential with really good text editors and > IDEs where semantics may matter a lot. > > Excuse my english, I hope this was understandable. > > Best regards, > > A. Z. > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode -------------- next part -------------- An HTML attachment was scrubbed... URL: From jf at colson.eu Sun Feb 8 16:02:05 2015 From: jf at colson.eu (=?windows-1252?Q?Jean-Fran=E7ois_Colson?=) Date: Sun, 08 Feb 2015 23:02:05 +0100 Subject: Unicode block for programming related symbols and codepoints? In-Reply-To: References: <54D7C3EA.6080000@web.de> Message-ID: <54D7DCDD.9060003@colson.eu> Le 08/02/15 22:32, Pierpaolo Bernardi a ?crit : > On Sun, Feb 8, 2015 at 9:15 PM, Alfred Zett wrote: > [?] > > -- unlike tabs or space, it wouldn't be whitespace > [?] > > a Tab is exactly what you described. Not exactly: a tab IS whitespace. It may sometimes be displayed in a different color or with a special symbol on request if the editor allows it, but in most cases it is whitespace. From alfred_z at web.de Sun Feb 8 16:07:52 2015 From: alfred_z at web.de (Alfred Zett) Date: Sun, 08 Feb 2015 23:07:52 +0100 Subject: Unicode block for programming related symbols and codepoints? In-Reply-To: <54D7D37B.8090900@colson.eu> References: <54D7C3EA.6080000@web.de> <54D7D37B.8090900@colson.eu> Message-ID: <54D7DE38.7090300@web.de> Hi Jean-Francois Colson, I hope this doesn't mess up the mailing list. >> >> - Indentation codepoint, with no fixed defined graphical >> representation. For indentation based programming languages. > > That wouldn?t be compliant with existing languages and future > languages might use any existing character. This was for new languages. Creators of future languages mostly orient on whatever is available and make sense, so I may make this proposal as well, so they don't have to choose the half-assed workarounds they use now. Also, as long as there is stuff like https://github.com/sferik/active_emoji it still makes more sense. >> Because: >> -- specific clients may want to show it different (for example as >> arrows, lines etc., using another color): > > Can?t good editors display tabs in a different color when required ? Not as reliable and customizable as a special codepoint. For example > >> --- browsers could let the web page creator let decide the visual >> representation (character and size) via CSS can't be done and on-the-fly copy and paste conversion with JavaScript is horrid and broken for security reasons. But it's an issue even in good editors as well. You need a lexing plugin that may work or not. And the size and other factors are still fixed. After all, tabs have whitespace semantics that may appear everywhere in the text. >> --- the same with editors, independent from the actual font >> --- in case of visual impairment, the user could even change the >> accoustical representation if the editor allows it >> -- unlike a space symbol, it wouldn't need more than one character >> per indentation >> -- unlike tabs or space, it wouldn't be whitespace >> -- unlike normal arrow characters, one could customize the length in >> an editor and wouldn't have to insert extra spaces for a better >> visual imagery >> >> - A codepoint for string literal quotes, that would spare one the >> escaping. > > I rarely escape quotes. > In a text, I use ? (U+2019) as an apostrophe and ?????? as quotes, so > I don?t need to escape them. > When I use PHP to generate some HTML code, I try to alternate simple > and double quotes as much as possible. That way I rarely need to > escape them. OK, but that's just your scenario. With a language design from the past. With probably an editor from the past that allows non-unicode encodings. In a better world, manual code point inserting was a last resort. Imagine someone wants to make his text look like written with a typewriter. Or something else. > >> - A statement separator symbol. > > To replace the semicolon in C and the languages based on its syntax? Again, for future uses. To be honest, this might sound questionable, but this could blur the line between visual line breaks and visual characters like semicolons. Line-break ended comments are separator ended comments. Of course, that's the least required part of those three proposed characters, but I thought for the sake and completeness that shouldn't miss. Come to think of it, two sets of opening and closing block symbols couldn't harm either. And a continue-after-linebreak symbol as well. > >> - Other ideas? > > Aren?t you trying to reinvent APL? > No. APL places a lot of alien-looking, annoying characters to anyone except mathematicians into your code that are hard to input. In particular from the context. My proposal on the other hand - if implemented right - introduces some really intuitive looking and easy to input characters, because a bold arrow at the left doesn't need further explanation and your IDE of the future can easily place them when pressing tab in the right position. From alfred_z at web.de Sun Feb 8 16:27:46 2015 From: alfred_z at web.de (Alfred Zett) Date: Sun, 08 Feb 2015 23:27:46 +0100 Subject: Unicode block for programming related symbols and codepoints? In-Reply-To: References: <54D7C3EA.6080000@web.de> Message-ID: <54D7E2E2.6080705@web.de> Hi Pierpaolo Bernardi, given that you did include my adress as well as the unicode adress I'm doing the same. > On Sun, Feb 8, 2015 at 9:15 PM, Alfred Zett wrote: >> Hello everyone, >> >> is there such a unicode block for programming related codepoints? >> >> Conventional search engines as well as wolfram alpha can't answer that, with >> the former one leading to all the programming problems that occur... >> >> If such a block doesn't exist, I'd like to make a proposal - if possible - >> to add one with at least the following codepoints/characters: > Once upon a time I would have said that this is out of scope for > Unicode. But now anything goes, so who knows. That was exactly my thought, so I figured it couldn't harm to have these comfy special characters in there :) >> - Indentation codepoint, with no fixed defined graphical representation. For >> indentation based programming languages. >> Because: >> -- specific clients may want to show it different (for example as arrows, >> lines etc., using another color): >> --- browsers could let the web page creator let decide the visual >> representation (character and size) via CSS >> --- the same with editors, independent from the actual font >> --- in case of visual impairment, the user could even change the accoustical >> representation if the editor allows it >> -- unlike a space symbol, it wouldn't need more than one character per >> indentation >> -- unlike tabs or space, it wouldn't be whitespace >> -- unlike normal arrow characters, one could customize the length in an >> editor and wouldn't have to insert extra spaces for a better visual imagery > a Tab is exactly what you described. No. It's only half of what I described. It's still a typographical character that implies whitespace and may appear everywhere in the text. Custom size behavior (but not too custom) is the only similarity to that indentation character. > >> - A codepoint for string literal quotes, that would spare one the escaping. > How would this work exactly? Imagine you type " in your IDE, but because your IDE does know that this new programming language requires this special character as literal token, it replaces it with a special looking quotation mark. Now you are free to type any type of quotation mark until you hit ESC or something which places a closing special quotation mark and your caret right to it. Of course, IDEs could render this without special marks and a different background colour instead; or whatever float the IDE creators boat. >> - A statement separator symbol. > What's wrong with ; , . : # % ^ & and other hundreds of punctuation symbols? Nothing, they are just semantically not as nice and customizable. Best regards A.Z. From shervinafshar at gmail.com Sun Feb 8 16:36:14 2015 From: shervinafshar at gmail.com (Shervin Afshar) Date: Sun, 8 Feb 2015 14:36:14 -0800 Subject: Unicode block for programming related symbols and codepoints? In-Reply-To: <54D7C3EA.6080000@web.de> References: <54D7C3EA.6080000@web.de> Message-ID: All of the requirements mentioned here can be (and are) implemented in higher levels of software (like IDEs). IMO, there isn't any need for adding new characters to Unicode to address these issues. Additionally, people tend to forget that simply because Unicode is doing emoji out of compatibility (or other) requirements, it does not mean that "now anything goes". I refer folks to TR51[1] (specifically sections 1.3, 8, and Annex C). [1]: http://www.unicode.org/reports/tr51 ? Shervin On Sun, Feb 8, 2015 at 12:15 PM, Alfred Zett wrote: > Hello everyone, > > is there such a unicode block for programming related codepoints? > > Conventional search engines as well as wolfram alpha can't answer that, > with the former one leading to all the programming problems that occur... > > If such a block doesn't exist, I'd like to make a proposal - if possible - > to add one with at least the following codepoints/characters: > > - Indentation codepoint, with no fixed defined graphical representation. > For indentation based programming languages. > Because: > -- specific clients may want to show it different (for example as arrows, > lines etc., using another color): > --- browsers could let the web page creator let decide the visual > representation (character and size) via CSS > --- the same with editors, independent from the actual font > --- in case of visual impairment, the user could even change the > accoustical representation if the editor allows it > -- unlike a space symbol, it wouldn't need more than one character per > indentation > -- unlike tabs or space, it wouldn't be whitespace > -- unlike normal arrow characters, one could customize the length in an > editor and wouldn't have to insert extra spaces for a better visual imagery > > - A codepoint for string literal quotes, that would spare one the escaping. > - A statement separator symbol. > - Other ideas? > > You may now think, this is highly specific and you are right. > However, so are EMOJI signs, in particular those like PINE DECORATION. > > These days, there are a lot of tools to create small embedded scripting > languages and DSLs, which are used in-program in special editors. And there > is a lot of people using them. > Exactly these could really profit from such a codeblock instead of using > conventional ASCII subset characters. > Also, there is a lot of potential with really good text editors and IDEs > where semantics may matter a lot. > > Excuse my english, I hope this was understandable. > > Best regards, > > A. Z. > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jf at colson.eu Sun Feb 8 16:45:27 2015 From: jf at colson.eu (=?UTF-8?B?SmVhbi1GcmFuw6dvaXMgQ29sc29u?=) Date: Sun, 08 Feb 2015 23:45:27 +0100 Subject: Unicode block for programming related symbols and codepoints? In-Reply-To: <54D7DE38.7090300@web.de> References: <54D7C3EA.6080000@web.de> <54D7D37B.8090900@colson.eu> <54D7DE38.7090300@web.de> Message-ID: <54D7E707.10104@colson.eu> Le 08/02/15 23:07, Alfred Zett a ?crit : > Hi Jean-Francois Colson, > > I hope this doesn't mess up the mailing list. > >>> >>> - Indentation codepoint, with no fixed defined graphical >>> representation. For indentation based programming languages. >> >> That wouldn?t be compliant with existing languages and future >> languages might use any existing character. > > This was for new languages. Creators of future languages mostly orient > on whatever is available and make sense, so I may make this proposal > as well, so they don't have to choose the half-assed workarounds they > use now. I need a few tens of characters for a conlang I?m developping. ? The problem is that Unicode only encodes characters which are effectively used today or which have been used in the past. It doesn?t encode characters which could perhaps be used in a hypothetical new programing language in the future. > > Also, as long as there is stuff like > https://github.com/sferik/active_emoji it still makes more sense. > >>> Because: >>> -- specific clients may want to show it different (for example as >>> arrows, lines etc., using another color): >> >> Can?t good editors display tabs in a different color when required ? > Not as reliable and customizable as a special codepoint. For example > >> >>> --- browsers could let the web page creator let decide the visual >>> representation (character and size) via CSS > > can't be done and on-the-fly copy and paste conversion with JavaScript > is horrid and broken for security reasons. > But it's an issue even in good editors as well. You need a lexing > plugin that may work or not. And the size and other factors are still > fixed. After all, tabs have whitespace semantics that may appear > everywhere in the text. > >>> --- the same with editors, independent from the actual font >>> --- in case of visual impairment, the user could even change the >>> accoustical representation if the editor allows it >>> -- unlike a space symbol, it wouldn't need more than one character >>> per indentation >>> -- unlike tabs or space, it wouldn't be whitespace >>> -- unlike normal arrow characters, one could customize the length in >>> an editor and wouldn't have to insert extra spaces for a better >>> visual imagery >>> >>> - A codepoint for string literal quotes, that would spare one the >>> escaping. >> >> I rarely escape quotes. >> In a text, I use ? (U+2019) as an apostrophe and ?????? as quotes, so >> I don?t need to escape them. >> When I use PHP to generate some HTML code, I try to alternate simple >> and double quotes as much as possible. That way I rarely need to >> escape them. > OK, but that's just your scenario. With a language design from the > past. With probably an editor from the past that allows non-unicode > encodings. In a better world, manual code point inserting was a last > resort. > > Imagine someone wants to make his text look like written with a > typewriter. Or something else. > >> >>> - A statement separator symbol. >> >> To replace the semicolon in C and the languages based on its syntax? > Again, for future uses. To be honest, this might sound questionable, > but this could blur the line between visual line breaks and visual > characters like semicolons. > Line-break ended comments are separator ended comments. > Of course, that's the least required part of those three proposed > characters, but I thought for the sake and completeness that shouldn't > miss. > > Come to think of it, two sets of opening and closing block symbols > couldn't harm either. And a continue-after-linebreak symbol as well. > >> >>> - Other ideas? >> >> Aren?t you trying to reinvent APL? >> > No. APL places a lot of alien-looking, annoying characters to anyone > except mathematicians into your code that are hard to input. In > particular from the context. > > My proposal on the other hand - if implemented right - introduces some > really intuitive looking and easy to input characters, because a bold > arrow at the left doesn't need further explanation and your IDE of the > future can easily place them when pressing tab in the right position. > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode From olopierpa at gmail.com Sun Feb 8 16:54:11 2015 From: olopierpa at gmail.com (Pierpaolo Bernardi) Date: Sun, 8 Feb 2015 23:54:11 +0100 Subject: Unicode block for programming related symbols and codepoints? In-Reply-To: <54D7E2E2.6080705@web.de> References: <54D7C3EA.6080000@web.de> <54D7E2E2.6080705@web.de> Message-ID: On Sun, Feb 8, 2015 at 11:27 PM, Alfred Zett wrote: > That was exactly my thought, so I figured it couldn't harm to have these >> a Tab is exactly what you described. > > No. It's only half of what I described. > It's still a typographical character that implies whitespace and may appear > everywhere in the text. How would your proposed character be displayed as plain text? >>> - A codepoint for string literal quotes, that would spare one the >>> escaping. >> >> How would this work exactly? > > Imagine you type " in your IDE, but because your IDE does know that this new > programming language requires this special character as literal token, it > replaces it with a special looking quotation mark. Unicode is a standard for plain text. If you require a special IDE for your programming language then why use plain text at all? From ritt.ks at gmail.com Sun Feb 8 17:27:59 2015 From: ritt.ks at gmail.com (Konstantin Ritt) Date: Mon, 9 Feb 2015 03:27:59 +0400 Subject: Unicode block for programming related symbols and codepoints? In-Reply-To: References: <54D7C3EA.6080000@web.de> <54D7E2E2.6080705@web.de> Message-ID: > My proposal on the other hand - if implemented right - introduces some really intuitive looking and easy to input characters, Easier than latin1, a layout one could find on [almost] every keyboard? Good luck. Konstantin 2015-02-09 2:54 GMT+04:00 Pierpaolo Bernardi : > On Sun, Feb 8, 2015 at 11:27 PM, Alfred Zett wrote: > > > That was exactly my thought, so I figured it couldn't harm to have these > > >> a Tab is exactly what you described. > > > > No. It's only half of what I described. > > It's still a typographical character that implies whitespace and may > appear > > everywhere in the text. > > How would your proposed character be displayed as plain text? > > >>> - A codepoint for string literal quotes, that would spare one the > >>> escaping. > >> > >> How would this work exactly? > > > > Imagine you type " in your IDE, but because your IDE does know that this > new > > programming language requires this special character as literal token, it > > replaces it with a special looking quotation mark. > > Unicode is a standard for plain text. If you require a special IDE > for your programming language then why use plain text at all? > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jf at colson.eu Sun Feb 8 17:36:19 2015 From: jf at colson.eu (=?UTF-8?B?SmVhbi1GcmFuw6dvaXMgQ29sc29u?=) Date: Mon, 09 Feb 2015 00:36:19 +0100 Subject: Unicode block for programming related symbols and codepoints? In-Reply-To: <54D7DE38.7090300@web.de> References: <54D7C3EA.6080000@web.de> <54D7D37B.8090900@colson.eu> <54D7DE38.7090300@web.de> Message-ID: <54D7F2F3.5080102@colson.eu> Le 08/02/15 23:07, Alfred Zett a ?crit : > Hi Jean-Francois Colson, >>> >>> - A codepoint for string literal quotes, that would spare one the >>> escaping. >> >> I rarely escape quotes. >> In a text, I use ? (U+2019) as an apostrophe and ?????? as quotes, so >> I don?t need to escape them. >> When I use PHP to generate some HTML code, I try to alternate simple >> and double quotes as much as possible. That way I rarely need to >> escape them. > OK, but that's just your scenario. With a language design from the > past. With probably an editor from the past that allows non-unicode > encodings. ????? That?s mainly with gcc on GNU/Linux with a UTF-8 locale or with PHP with a in the XHTML document. > In a better world, manual code point inserting was a last resort. What do you call ?manual inserting?? > > Imagine someone wants to make his text look like written with a > typewriter. That?s a very special case and a few \ are not a big problem. You could use existing characters as ?string litteral quotes?. I?ve never used APL so I don?t remember the meanings of its symbols, but couldn?t ? U+2358 APL FUNCTIONAL SYMBOL QUOTE UNDERBAR or ? U+235E APL FUNCTIONAL SYMBOL QUOTE QUAD work as ?string litteral quotes? in a new programming language? > >> Aren?t you trying to reinvent APL? >> > No. APL places a lot of alien-looking, annoying characters to anyone > except mathematicians into your code that are hard to input. Hard to input? Not harder than the new symbols you?d like to propose. That?s only a matter of keyboard layout and input method. > In particular from the context. > > My proposal on the other hand - if implemented right - introduces some > really intuitive looking and easy to input characters, In what would they be easier to input? > because a bold arrow at the left doesn't need further explanation and > your IDE of the future can easily place them when pressing tab in the > right position. If the IDE inputs your new character when you press tab, then your new character is a tab? From jf at colson.eu Sun Feb 8 18:04:10 2015 From: jf at colson.eu (=?windows-1252?Q?Jean-Fran=E7ois_Colson?=) Date: Mon, 09 Feb 2015 01:04:10 +0100 Subject: Unicode block for programming related symbols and codepoints? In-Reply-To: References: <54D7C3EA.6080000@web.de> <54D7E2E2.6080705@web.de> Message-ID: <54D7F97A.1060909@colson.eu> Le 09/02/15 00:27, Konstantin Ritt a ?crit : > > My proposal on the other hand - if implemented right - introduces > some really intuitive looking and easy to input characters, > > Easier than latin1, a layout one could find on [almost] every > keyboard? Good luck. Latin-1 is not a keyboard layout, it?s a character set: ISO/CEI 8859-1. Latin-1 is not available on almost every keyboard: It is not available on most US keyboards except for the minority who uses a US international driver; It is not available on most Russian keyboards which only provide Cyrillic letters and ASCII (unaccented) Latin letters; It is not fully available on many Western European keyboards (With a French azerty keyboard on M$ Windows, using the default driver, you have no way to type a capital ? or a capital ? except by typing Alt + 0 2 0 1 or Alt + 0 1 9 9.); It is not available on keyboards of Central and Eastern European keyboards (to the East of Germany, Latin-2); It is not available on Maltese or Turkish keyboards (Latin-3); It is not available on keyboards of the Baltic countries (Latin-4); Etc. > > Konstantin > > 2015-02-09 2:54 GMT+04:00 Pierpaolo Bernardi >: > > On Sun, Feb 8, 2015 at 11:27 PM, Alfred Zett > wrote: > > > That was exactly my thought, so I figured it couldn't harm to > have these > > >> a Tab is exactly what you described. > > > > No. It's only half of what I described. > > It's still a typographical character that implies whitespace and > may appear > > everywhere in the text. > > How would your proposed character be displayed as plain text? > > >>> - A codepoint for string literal quotes, that would spare one the > >>> escaping. > >> > >> How would this work exactly? > > > > Imagine you type " in your IDE, but because your IDE does know > that this new > > programming language requires this special character as literal > token, it > > replaces it with a special looking quotation mark. > > Unicode is a standard for plain text. If you require a special IDE > for your programming language then why use plain text at all? > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > > > > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode -------------- next part -------------- An HTML attachment was scrubbed... URL: From alfred_z at web.de Mon Feb 9 06:55:02 2015 From: alfred_z at web.de (Alfred Zett) Date: Mon, 09 Feb 2015 13:55:02 +0100 Subject: Unicode block for programming related symbols and codepoints? In-Reply-To: References: <54D7C3EA.6080000@web.de> Message-ID: <54D8AE26.3030409@web.de> OK, I will now try to answer all of you in one mail, otherwise it gets hard to overlook... Shervin Afshar: > All of the requirements mentioned here can be (and are) implemented in > higher levels of software (like IDEs). IMO, there isn't any need for > adding new characters to Unicode to address these issues. But then it would be incompatible from IDE to IDE, like Python is incompatible using 2 spaces, 4 spaces and tabs. It's the data that is important, not the software. > > Additionally, people tend to forget that simply because Unicode is > doing emoji out of compatibility (or other) requirements, it does not > mean that "now anything goes". I refer folks to TR51[1] (specifically > sections 1.3, 8, and Annex C). > > [1]: http://www.unicode.org/reports/tr51 > You know, the fact that this consortium ever took emoji into consideration immediately justifies to include everything everyone ever wanted. There is no such thing as important data including emoji. :) Jean-Francois Colson: > I need a few tens of characters for a conlang I?m developping. ? Except two or three control characters don't make a con language. Also, if you don't like con languages in Unicode, what's this: http://unicode.org/charts/PDF/U1F700.pdf > The problem is that Unicode only encodes characters which are > effectively used today or which have been used in the past. It doesn?t > encode characters which could perhaps be used in a hypothetical new > programing language in the future. So you want the font encoding scheme to be a limitating factor for new things? Pierpaolo Bernardi: > How would your proposed character be displayed as plain text? There is no such thing as plain text. Even line breaks and tabs are a matter of interpretation. It's just that they usually have typographic semantics, even in programming editors, with all the side effects. In very simple (and with that I mean shitty or not even remotely programming oriented) editors, it may show like a control character, like ?. Browsers and any editor passing the "based on scintilla" complexity mark of course should display something that makes more sense, like an arrow or ? plus surrounding space. > Unicode is a standard for plain text. If you require a special IDE > for your programming language then why use plain text at all? Because binary custom encoded databases or blob files are the death of interoperability. Konstantin Ritt: > Easier than latin1, a layout one could find on [almost] every > keyboard? Good luck. Also: Jean-Francois Colson: > Hard to input? Not harder than the new symbols you?d like to propose. > That?s only a matter of keyboard layout and input method. Indent by pressing tab and insert the literal thing by pressing ". Nothing changes, the IDE/editor does the work on the fly. Just that you have clean semantics, interoperability and customizability. Beat that, APL. Where you would >10 key bindings or an annoying software keyboard. > I?ve never used APL so I don?t remember the meanings of its symbols, > but couldn?t ? U+2358 APL FUNCTIONAL SYMBOL QUOTE UNDERBAR or ? U+235E > APL FUNCTIONAL SYMBOL QUOTE QUAD work as ?string litteral quotes? in a > new programming language? That's a good idea. That still leaves the indentation character, which is harder than that, because one would want a control character with certain semantics. E.G.: For programming editors it would make sense to only allow it after line breaks and convert other occurences into tabs. > If the IDE inputs your new character when you press tab, then your new > character is a tab? Not if it detects the beginning of a line. Best regards A. Z. From frederic.grosshans at gmail.com Mon Feb 9 08:08:39 2015 From: frederic.grosshans at gmail.com (=?UTF-8?B?RnLDqWTDqXJpYyBHcm9zc2hhbnM=?=) Date: Mon, 09 Feb 2015 15:08:39 +0100 Subject: Unicode block for programming related symbols and codepoints? In-Reply-To: <54D8AE26.3030409@web.de> References: <54D7C3EA.6080000@web.de> <54D8AE26.3030409@web.de> Message-ID: <54D8BF67.3050100@gmail.com> Le 09/02/2015 13:55, Alfred Zett a ?crit : > >> Additionally, people tend to forget that simply because Unicode is >> doing emoji out of compatibility (or other) requirements, it does not >> mean that "now anything goes". I refer folks to TR51[1] (specifically >> sections 1.3, 8, and Annex C). >> >> [1]: http://www.unicode.org/reports/tr51 >> > You know, the fact that this consortium ever took emoji into > consideration immediately justifies to include everything everyone > ever wanted. There is no such thing as important data including emoji. :) The including of emoji was a considerable debate here, with people strongly against and strongly for. The trick is that they were already used as digital characters by Japanese Telcos and their millions of customers. They were de facto encoded as characters in Japanese text messages. At the time of encoding, the spread of smartphones made them appear in other places (emails, web forums, etc.) > > > Jean-Francois Colson: >> I need a few tens of characters for a conlang I?m developping. ? > Except two or three control characters don't make a con language. > Also, if you don't like con languages in Unicode, what's this: > http://unicode.org/charts/PDF/U1F700.pdf I doubt that ?not liking con languages? is a faithful description of Jean-Fran?ois ;-) On a more serious notes, this block is actually a set of ?scientific? (at his time) notations used by Isaac Newton in its time. They were encoded in Unicode following an academic project to digitize his manuscripts. So here, you have characters used 3 centuries ago by no less than Isaac Newton, most of them having a much longer history, and useful for science historians. See http://www.unicode.org/L2/L2009/09037r2-alchemy.pdf for details. This does not compares with a few characters invented for a conlang invented by an amateur and used by no one but himself. I think that is the point Jean-Fran?ois wanted to make. A closer counter-example to Jean-Fran?ois's ?wish? would be Shavian (10450..1047F), but this alphabet has shown some use, and I guess that its encoding would have been much harder without its association with someone as famous as George Berard Shaw or without the existing publication of a full text in Shavian. > >> The problem is that Unicode only encodes characters which are >> effectively used today or which have been used in the past. It >> doesn?t encode characters which could perhaps be used in a >> hypothetical new programing language in the future. > So you want the font encoding scheme to be a limitating factor for new > things? It is more or less the rule, expt that is not a font encoding, but a standard encoding. Once something is encoded , it can never be unencoded. And the Unicode standard is built to stay relevant as long as possible (decades or centuries). So you ask for your character top be encoded in billions of devices for decades. It is more than a mere font encoding. There are a few exceptions, but only when a widespread use is really expected, like for monetary symbols (it was the case for the Euro). What you are asking, is a character for an untested idea. You are convinced it is useful, but cannot prove anyone beyond yourself will use it, hence Jean-Fran?ois?s parallel with conlangs. In order to have a chance of success, design a language using existing characters (e.g. some APL + ? for TAB) and/or private use codepoints. Once your language start gathering steam, come back and argue that using an arrow or a tab is awkward, and that U+XXXX SHINY TAB FOR PROGRAMMERS would be an improvement for a significant community. I know it is a lot of work, but that is probably what it takes. > > Pierpaolo Bernardi: >> How would your proposed character be displayed as plain text? > There is no such thing as plain text. When you say that, you don?t accept the premise of Unicode encoding. Unicode?s goal is to encode all plain text characters, but only plain text characters. > Even line breaks and tabs are a matter of interpretation. It's just > that they usually have typographic semantics, even in programming > editors, with all the side effects. > > In very simple (and with that I mean shitty or not even remotely > programming oriented) editors, it may show like a control character, > like ?. > > Browsers and any editor passing the "based on scintilla" complexity > mark of course should display something that makes more sense, like an > arrow or ? plus surrounding space. I think everyone her knows what you are saying, and that the notion of plain text is a bit fuzzy. But if you cannot argue that your character has a meaning in plaint text, for some value of ?plain text?, then you can not hope for an encoding in Unicode. From alfred_z at web.de Mon Feb 9 08:57:15 2015 From: alfred_z at web.de (Alfred Zett) Date: Mon, 09 Feb 2015 15:57:15 +0100 Subject: Unicode block for programming related symbols and codepoints? In-Reply-To: <54D8BF67.3050100@gmail.com> References: <54D7C3EA.6080000@web.de> <54D8AE26.3030409@web.de> <54D8BF67.3050100@gmail.com> Message-ID: <54D8CACB.1060308@web.de> Fr?d?ric Grosshans: > Le 09/02/2015 13:55, Alfred Zett a ?crit : >> >>> Additionally, people tend to forget that simply because Unicode is >>> doing emoji out of compatibility (or other) requirements, it does >>> not mean that "now anything goes". I refer folks to TR51[1] >>> (specifically sections 1.3, 8, and Annex C). >>> >>> [1]: http://www.unicode.org/reports/tr51 >>> >> You know, the fact that this consortium ever took emoji into >> consideration immediately justifies to include everything everyone >> ever wanted. There is no such thing as important data including >> emoji. :) > The including of emoji was a considerable debate here, with people > strongly against and strongly for. The trick is that they were already > used as digital characters by Japanese Telcos and their millions of > customers. They were de facto encoded as characters in Japanese text > messages. At the time of encoding, the spread of smartphones made them > appear in other places (emails, web forums, etc.) > The trick is that one doesn't bargain with Telcos and similar criminals. Gotta drop them hard and the pest will go away from itself after five years or so. >> Jean-Francois Colson: >>> I need a few tens of characters for a conlang I?m developping. ? >> Except two or three control characters don't make a con language. >> Also, if you don't like con languages in Unicode, what's this: >> http://unicode.org/charts/PDF/U1F700.pdf > I doubt that ?not liking con languages? is a faithful description of > Jean-Fran?ois ;-) > > On a more serious notes, this block is actually a set of ?scientific? > (at his time) notations used by Isaac Newton in its time. They were > encoded in Unicode following an academic project to digitize his > manuscripts. So here, you have characters used 3 centuries ago by no > less than Isaac Newton, most of them having a much longer history, and > useful for science historians. See > http://www.unicode.org/L2/L2009/09037r2-alchemy.pdf for details. > That's actually interesting. Good to know, thanks. > I think everyone her knows what you are saying, and that the notion of > plain text is a bit fuzzy. But if you cannot argue that your character > has a meaning in plaint text, for some value of ?plain text?, then you > can not hope for an encoding in Unicode. > OK, in this case I agree it makes little sense to hope for such characters. Best regards, A. Z. From john at mitre.org Mon Feb 9 09:37:38 2015 From: john at mitre.org (John D Burger) Date: Mon, 9 Feb 2015 10:37:38 -0500 Subject: Unicode block for programming related symbols and codepoints? In-Reply-To: <54D7DA7E.6010009@colson.eu> References: <54D7C3EA.6080000@web.de> <54D7DA7E.6010009@colson.eu> Message-ID: <6EE199C2-134D-4A63-91D4-DBF75B2C85CA@mitre.org> >> - Indentation codepoint, with no fixed defined graphical representation. For indentation based programming languages. > > That wouldn?t be compliant with existing languages and future languages might use any existing character. > >> Because: >> -- specific clients may want to show it different (for example as arrows, lines etc., using another color): > > Can?t good editors display tabs in a different color when required ? Lots of them already do, e.g. Emacs in various modes. - John Burger MITRE > >> --- browsers could let the web page creator let decide the visual representation (character and size) via CSS >> --- the same with editors, independent from the actual font >> --- in case of visual impairment, the user could even change the accoustical representation if the editor allows it >> -- unlike a space symbol, it wouldn't need more than one character per indentation >> -- unlike tabs or space, it wouldn't be whitespace >> -- unlike normal arrow characters, one could customize the length in an editor and wouldn't have to insert extra spaces for a better visual imagery >> >> - A codepoint for string literal quotes, that would spare one the escaping. > > I rarely escape quotes. > In a text, I use ? (U+2019) as an apostrophe and ?????? as quotes, so I don?t need to escape them. > When I use PHP to generate some HTML code, I try to alternate simple and double quotes as much as possible. That way I rarely need to escape them. > >> - A statement separator symbol. > > To replace the semicolon in C and the languages based on its syntax? > >> - Other ideas? > > Aren?t you trying to reinvent APL? > >> >> You may now think, this is highly specific and you are right. >> However, so are EMOJI signs, in particular those like PINE DECORATION. >> >> These days, there are a lot of tools to create small embedded scripting languages and DSLs, which are used in-program in special editors. And there is a lot of people using them. >> Exactly these could really profit from such a codeblock instead of using conventional ASCII subset characters. >> Also, there is a lot of potential with really good text editors and IDEs where semantics may matter a lot. >> >> Excuse my english, I hope this was understandable. >> >> Best regards, >> >> A. Z. >> _______________________________________________ >> Unicode mailing list >> Unicode at unicode.org >> http://unicode.org/mailman/listinfo/unicode > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode From wjgo_10009 at btinternet.com Mon Feb 9 04:48:18 2015 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Mon, 9 Feb 2015 10:48:18 +0000 (GMT) Subject: Unicode block for programming related symbols and codepoints? In-Reply-To: <54D7E707.10104@colson.eu> References: <54D7C3EA.6080000@web.de> <54D7D37B.8090900@colson.eu> <54D7DE38.7090300@web.de> <54D7E707.10104@colson.eu> Message-ID: <30873394.16253.1423478898115.JavaMail.defaultUser@defaultHost> > The problem is that Unicode only encodes characters which are effectively used today or which have been used in the past. Well, that was the case, but the situation appears to be changing. There is my feedback note that is the last item in the following linked document. http://www.unicode.org/L2/L2015/15019-pubrev.html > It doesn?t encode characters which could perhaps be used in a hypothetical new programing language in the future. Well, that was the case and might still be the case. We will only find out for sure, and then only for a particular case, when a situation arises where the Unicode Technical Committee rules about a petition submitted to the committee requesting the encoding of some such characters. The fact that the rules over what can be encoded are changing rapidly opens up great possibilities for future developments from ideas put forward from the community. If the changes in policy continue then this will be very beneficial to progress as a regular Unicode encoding makes an encoding of free equal use for all with no proprietary aspect to the encoding. William Overington 9 February 2015 From A.Schappo at lboro.ac.uk Mon Feb 9 10:41:14 2015 From: A.Schappo at lboro.ac.uk (Andre Schappo) Date: Mon, 9 Feb 2015 16:41:14 +0000 Subject: Unicode block for programming related symbols and codepoints? In-Reply-To: <54D7C3EA.6080000@web.de> References: <54D7C3EA.6080000@web.de> Message-ID: <9B99B8A6-BB16-4DC0-A193-5D5869274040@lboro.ac.uk> I think this is a very good idea. There are so many multiple uses of ASCII characters in programming languages that really does need sorting out. The fundamental separation of character semantics and glyph visual representation works really well for this proposal. Let me take as an example the use of = in programming. The = is used for test of equality and assignment in various programming languages. The equality and assignment operations should have different characters. e.g. U+XXX1 TEST FOR EQUALITY U+XXX2 ASSIGNMENT OPERATOR Initially the glyphs used for these characters could be = but then this mechanism can be used to transition to a new and less ambiguous visual representation. The new visual representation could be something like U+XXX1 TEST FOR EQUALITY = U+XXX2 ASSIGNMENT OPERATOR ? Such a visual and character distinction between the 2 functions must surely make it easier for those learning to program and for interpreter and compiler writers. I think it would also make for easier to read/understand program code. Andr? On 8 Feb 2015, at 20:15, Alfred Zett wrote: Hello everyone, is there such a unicode block for programming related codepoints? Conventional search engines as well as wolfram alpha can't answer that, with the former one leading to all the programming problems that occur... If such a block doesn't exist, I'd like to make a proposal - if possible - to add one with at least the following codepoints/characters: - Indentation codepoint, with no fixed defined graphical representation. For indentation based programming languages. Because: -- specific clients may want to show it different (for example as arrows, lines etc., using another color): --- browsers could let the web page creator let decide the visual representation (character and size) via CSS --- the same with editors, independent from the actual font --- in case of visual impairment, the user could even change the accoustical representation if the editor allows it -- unlike a space symbol, it wouldn't need more than one character per indentation -- unlike tabs or space, it wouldn't be whitespace -- unlike normal arrow characters, one could customize the length in an editor and wouldn't have to insert extra spaces for a better visual imagery - A codepoint for string literal quotes, that would spare one the escaping. - A statement separator symbol. - Other ideas? You may now think, this is highly specific and you are right. However, so are EMOJI signs, in particular those like PINE DECORATION. These days, there are a lot of tools to create small embedded scripting languages and DSLs, which are used in-program in special editors. And there is a lot of people using them. Exactly these could really profit from such a codeblock instead of using conventional ASCII subset characters. Also, there is a lot of potential with really good text editors and IDEs where semantics may matter a lot. Excuse my english, I hope this was understandable. Best regards, A. Z. _______________________________________________ Unicode mailing list Unicode at unicode.org http://unicode.org/mailman/listinfo/unicode ???????????????? http://twitter.com/andreschappo http://schappo.blogspot.co.uk http://weibo.com/andreschappo http://blog.sina.com.cn/andreschappo -------------- next part -------------- An HTML attachment was scrubbed... URL: From martin at v.loewis.de Mon Feb 9 11:11:47 2015 From: martin at v.loewis.de (=?UTF-8?B?Ik1hcnRpbiB2LiBMw7Z3aXMi?=) Date: Mon, 09 Feb 2015 18:11:47 +0100 Subject: Unicode block for programming related symbols and codepoints? In-Reply-To: <54D8CACB.1060308@web.de> References: <54D7C3EA.6080000@web.de> <54D8AE26.3030409@web.de> <54D8BF67.3050100@gmail.com> <54D8CACB.1060308@web.de> Message-ID: <54D8EA53.5010907@v.loewis.de> Am 09.02.15 um 15:57 schrieb Alfred Zett: > That's actually interesting. Good to know, thanks. >> I think everyone her knows what you are saying, and that the notion of >> plain text is a bit fuzzy. But if you cannot argue that your character >> has a meaning in plaint text, for some value of ?plain text?, then you >> can not hope for an encoding in Unicode. >> > OK, in this case I agree it makes little sense to hope for such characters. That Unicode encodes "plain text" is indeed in its fundamentals (see 2.2, Unicode Design Principles). Also, the Criteria for Encoding Symbols speak against your characters, on the grounds of Jean-Fran?ois objections: http://www.unicode.org/pending/symbol-guidelines.html "The fact that a symbol merely "seems to be useful or potentially useful" is precisely not a reason to code it. Demonstrated usage, or demonstrated demand, on the other hand, does constitute a good reason to encode the symbol." So if you can't demonstrate usage, you should at least demonstrate demand (rather than just claiming that there might be demand). The canonical example for adding symbols with no demonstrated usage are apparently the currency symbols, where it is easy to demonstrate demand (by referring to the legislation that brings the currency to life). Welcome the NEW DRACHMA SIGN :-) Regards, Martin From alfred_z at web.de Mon Feb 9 11:53:43 2015 From: alfred_z at web.de (Alfred Zett) Date: Mon, 09 Feb 2015 18:53:43 +0100 Subject: Unicode block for programming related symbols and codepoints? In-Reply-To: <13507203.47311.1423496435218.JavaMail.defaultUser@defaultHost> References: <54D7C3EA.6080000@web.de> <13507203.47311.1423496435218.JavaMail.defaultUser@defaultHost> Message-ID: <54D8F427.6070106@web.de> @ John D Burger: And out of the sudden a war wages what counts as good editor. :D @ Andre Schappo: That's a good idea. We need it in the name of science and education. :D William_J_G Overington: > Hi > > You might like the following post. > > http://www.unicode.org/mail-arch/unicode-ml/y2010-m06/0001.html > > William > Hi, I'm really not sure what this is about, but it seems like an interface to deliver instructions to the rendering VM? Martin v. L?wis: > So if you can't demonstrate usage, you should at least demonstrate > demand (rather than just claiming that there might be demand). The problem is, you can't do that with the topic at hand. Because most programmers don't even see the possibilities. It's like asking a blind what colors look like. Although that may sound kind of arrogant. Among language designers and people interested in stuff like this, there is only a small fraction that doesn't hold the ill-minded opinion that syntax doesn't matter at all. Among those who care for syntax there is only a small fraction that really knows enough about Unicode. And who can blame them, I still see broken characters on a weekly base. Among those there is only a small fraction that cares enough. Among those there is only a small fraction that has the nerves/balls to put up with a consortium. This small subset is a handful of people, like Andr?, me and maybe 3 other persons. I don't really feel comfortable to sound that elitist, but in this case I dare say that the consortium shouldn't care for established popularity, the same way they should have handled emoji characters. Best regards A. Z. From andrea.giammarchi at gmail.com Mon Feb 9 11:54:18 2015 From: andrea.giammarchi at gmail.com (Andrea Giammarchi) Date: Mon, 9 Feb 2015 18:54:18 +0100 Subject: About cultural/languages communities flags Message-ID: Hello everyone, I've had an interesting request [1] that makes sense to me, but I'd like to understand Unicode position about it. The TL;DR version of the request is the following: There are communities, let's take Scottish people as example, that have even a domain but not an emoji flag. Some flag s related project adopted more than what we have now in emoji, inclucing 239 flags: http://www.famfamfam.com/archive/flag-icons-released/ The proposal is quite simple, and I am quoting from the request: > if a cultural/language TLD is typed with Unicode RIS, then show the flag for these culture/language: ???????? --> it shows Scottish flag ?????????? --> it shows a Welsh flag ?????? --> it shows a Breton flag ?????? --> it shows Catalan flag ?????? --> it shows a Basque flag ?????? --> it shows a Gallician flag Thanks in advance for any sort of outcome. Best Regards [1] https://github.com/twitter/twemoji/issues/40 -------------- next part -------------- An HTML attachment was scrubbed... URL: From kenwhistler at att.net Mon Feb 9 12:17:09 2015 From: kenwhistler at att.net (Ken Whistler) Date: Mon, 09 Feb 2015 10:17:09 -0800 Subject: Unicode block for programming related symbols and codepoints? In-Reply-To: <9B99B8A6-BB16-4DC0-A193-5D5869274040@lboro.ac.uk> References: <54D7C3EA.6080000@web.de> <9B99B8A6-BB16-4DC0-A193-5D5869274040@lboro.ac.uk> Message-ID: <54D8F9A5.9070302@att.net> I think this discussion is confusing the need for separate syntactic functions in formal language definitions with the need for *encoding* of characters. The distinction between assignment and test for equality has been around for decades in formal languages, and of course it is almost always carefully distinguished in the formal syntax: C, C++ and kindred Use "=" for assignment. Use "==" for equivalence operator. Pascal and kindred Use ":=" for assignment. Use "=" for equivalence operator. Lisp Assignment: let (a 6) Equivalence evaluation: (= a 6) And so on. The fact that these formal languages do not use a *single* distinct character for each of these syntactic functions is not a formal defect -- there are many, many concepts in formal languages which are defined using sequences of characters, rather than a single character. As has already been alluded to in this thread, trying to stack all functionality into single character definitions heads back in the direction of relatively illegible APL program text. It might have its place, but isn't much of a choice for widely used general programming languages. There are two basic issues with using sequences of (typically ASCII) characters for fundamental operators: 1. It marginally complicates parsing. 2. If chosen badly, they can confuse programmers using the syntax. #1 is basically trivial, as long as the formal syntax passes the bar of not introducing syntactic ambiguity. #2 is the *real* problem, imo. The use in C of "=" and "==" was badly designed from the start, and is the source of bezillions of inadvertent programming errors in practice. But if a left arrow, for example, might be a better choice for an assignment operator in a programming language, and a two-character ASCII operator like ":=" or "<-" doesn't seem appropriate or causes other confusion, there still isn't a character *encoding* issue here. Just use "?", which already exists (U+2190), and is a fine left arrow! What is *not* appropriate for Unicode consideration here is trying to encode programming *functions* per se. That turns the problem on its head really. There are lots and lots of symbols already defined in the standard: it is the job of formal language designers to simply pick from them and *define* their formal functions in their language design. Just because the UTC occasionally invents new control functions and encodes them in characters -- as for the bidirectional algorithm -- does not mean that every new function conceived for a programming language is automatically a character encoding problem. Coming to the UTC looking to encode a "new functional character" on spec should be a matter of *last* resort -- not a first resort. It requires a carefully built case demonstrating a real use and showing that alternative approaches using existing characters do not (and cannot) work. --Ken P.S. Arrow symbols like U+2190 have been in the Unicode Standard since Unicode 1.0 in 1991. They are far, far more widely supported nowadays than any new, language-specific functional symbol addition would be. Even if the UTC agreed to such character additions at the next meeting in May, its earliest opportunity for publication would be Unicode 10 in June, 2017. That amounts to a 26 year impedance mismatch for implementations. Why would a designer of a new formal language syntax want to buy into that kind of grief for character availability, when there are hundreds of symbols in the standard to choose from that have been encoded for decades now? On 2/9/2015 8:41 AM, Andre Schappo wrote: > > > Let me take as an example the use of = in programming. The = is used > for test of equality and assignment in various programming languages. > The equality and assignment operations should have different > characters. e.g. > > U+XXX1 TEST FOR EQUALITY > U+XXX2 ASSIGNMENT OPERATOR > > Initially the glyphs used for these characters could be = but then > this mechanism can be used to transition to a new and less ambiguous > visual representation. The new visual representation could be > something like > > U+XXX1 TEST FOR EQUALITY = > U+XXX2 ASSIGNMENT OPERATOR ? > > Such a visual and character distinction between the 2 functions must > surely make it easier for those learning to program and for > interpreter and compiler writers. I think it would also make for > easier to read/understand program code. > > Andr? > -------------- next part -------------- An HTML attachment was scrubbed... URL: From markus.icu at gmail.com Mon Feb 9 13:16:37 2015 From: markus.icu at gmail.com (Markus Scherer) Date: Mon, 9 Feb 2015 11:16:37 -0800 Subject: About cultural/languages communities flags In-Reply-To: References: Message-ID: On Mon, Feb 9, 2015 at 9:54 AM, Andrea Giammarchi < andrea.giammarchi at gmail.com> wrote: > > if a cultural/language TLD is typed with Unicode RIS, then show the flag > for these culture/language: > This does not work. The "Unicode RIS" are defined to be used in pairs, with semantics according to corresponding ISO 3166 alpha2 codes. In your examples, each successive pair will encode a flag. If you want to represent every flag of every locality, you first have to figure out how to catalog and label them. You are mentioning provinces, one level down from nation states; I guess there are thousands of them. In much of Europe, every little village has its own flag and coat of arms. Where do you want the text encoding and fonts to stop? markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From shervinafshar at gmail.com Mon Feb 9 13:23:15 2015 From: shervinafshar at gmail.com (Shervin Afshar) Date: Mon, 9 Feb 2015 11:23:15 -0800 Subject: Unicode block for programming related symbols and codepoints? In-Reply-To: <54D8AE26.3030409@web.de> References: <54D7C3EA.6080000@web.de> <54D8AE26.3030409@web.de> Message-ID: > But then it would be incompatible from IDE to IDE, like Python is incompatible using 2 spaces, 4 spaces and tabs. > It's the data that is important, not the software. Specifically talking about Python, we should not solve what PEP 8[1] is intended for in Unicode. Pythonistas and their IDEs are encouraged to use linters to address syntactical discrepancies. This, more or less, applies to other programming language as well. [1]: https://www.python.org/dev/peps/pep-0008/#tabs-or-spaces > You know, the fact that this consortium ever took emoji into consideration immediately justifies to include everything everyone ever wanted. There is no such thing as important data including emoji. :) If you read the background information (in TR51 or elsewhere) on Unicode emoji, you will see how common and widespread use of PUA by Japanese providers introduced interoperability issues with the rest of the world. And no...Addressing that major compatibility/interoperability issue (and any future issue raised from address that) do not justify inclusion of "everything everyone ever wanted". ? Shervin On Mon, Feb 9, 2015 at 4:55 AM, Alfred Zett wrote: > OK, I will now try to answer all of you in one mail, otherwise it gets > hard to overlook... > > Shervin Afshar: > >> All of the requirements mentioned here can be (and are) implemented in >> higher levels of software (like IDEs). IMO, there isn't any need for adding >> new characters to Unicode to address these issues. >> > But then it would be incompatible from IDE to IDE, like Python is > incompatible using 2 spaces, 4 spaces and tabs. > It's the data that is important, not the software. > >> >> Additionally, people tend to forget that simply because Unicode is doing >> emoji out of compatibility (or other) requirements, it does not mean that >> "now anything goes". I refer folks to TR51[1] (specifically sections 1.3, >> 8, and Annex C). >> >> [1]: http://www.unicode.org/reports/tr51 >> >> You know, the fact that this consortium ever took emoji into > consideration immediately justifies to include everything everyone ever > wanted. There is no such thing as important data including emoji. :) > > Jean-Francois Colson: > >> I need a few tens of characters for a conlang I?m developping. ? >> > Except two or three control characters don't make a con language. > Also, if you don't like con languages in Unicode, what's this: > http://unicode.org/charts/PDF/U1F700.pdf > > The problem is that Unicode only encodes characters which are effectively >> used today or which have been used in the past. It doesn?t encode >> characters which could perhaps be used in a hypothetical new programing >> language in the future. >> > So you want the font encoding scheme to be a limitating factor for new > things? > > Pierpaolo Bernardi: > >> How would your proposed character be displayed as plain text? >> > There is no such thing as plain text. > Even line breaks and tabs are a matter of interpretation. It's just that > they usually have typographic semantics, even in programming editors, with > all the side effects. > > In very simple (and with that I mean shitty or not even remotely > programming oriented) editors, it may show like a control character, like ?. > > Browsers and any editor passing the "based on scintilla" complexity mark > of course should display something that makes more sense, like an arrow or > ? plus surrounding space. > > Unicode is a standard for plain text. If you require a special IDE >> for your programming language then why use plain text at all? >> > Because binary custom encoded databases or blob files are the death of > interoperability. > > Konstantin Ritt: > >> Easier than latin1, a layout one could find on [almost] every keyboard? >> Good luck. >> > Also: > > Jean-Francois Colson: > >> Hard to input? Not harder than the new symbols you?d like to propose. >> That?s only a matter of keyboard layout and input method. >> > > Indent by pressing tab and insert the literal thing by pressing ". Nothing > changes, the IDE/editor does the work on the fly. > Just that you have clean semantics, interoperability and customizability. > > Beat that, APL. Where you would >10 key bindings or an annoying software > keyboard. > > I?ve never used APL so I don?t remember the meanings of its symbols, but >> couldn?t ? U+2358 APL FUNCTIONAL SYMBOL QUOTE UNDERBAR or ? U+235E APL >> FUNCTIONAL SYMBOL QUOTE QUAD work as ?string litteral quotes? in a new >> programming language? >> > That's a good idea. > > That still leaves the indentation character, which is harder than that, > because one would want a control character with certain semantics. > E.G.: For programming editors it would make sense to only allow it after > line breaks and convert other occurences into tabs. > > If the IDE inputs your new character when you press tab, then your new >> character is a tab? >> > Not if it detects the beginning of a line. > > Best regards > > > A. Z. > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Mon Feb 9 13:25:30 2015 From: doug at ewellic.org (Doug Ewell) Date: Mon, 09 Feb 2015 12:25:30 -0700 Subject: Unicode block for programming related symbols and =?UTF-8?Q?codepoints=3F?= Message-ID: <20150209122530.665a7a7059d7ee80bb4d670165c8327d.98030114b2.wbe@email03.secureserver.net> Fr?d?ric Grosshans wrote: > The including of emoji was a considerable debate here, with people > strongly against and strongly for. The trick is that they were already > used as digital characters by Japanese Telcos and their millions of > customers. They were de facto encoded as characters in Japanese text > messages. At the time of encoding, the spread of smartphones made them > appear in other places (emails, web forums, etc.) Sorry, I can't let the "compatibility" argument go unchallenged again. It can be argued ? and was, repeatedly and persuasively ? that the initial collection of emoji in Unicode 6.1 [1] were added for compatibility with Japanese telco extensions to JIS. But the additional emoji added to Unicode 6.2 and 7.0, and planned for 8.0, do not have even this provenance; they were added on foot of novel proposals sent directly to Unicode, or (more recently) by "popular request." There is no longer any requirement that the robot faces and burritos appear first in any sort of industry character set extension, with which Unicode is then obliged to maintain compatibility. [1] No, I am not counting the ARIB symbols or any other long-encoded symbols that have been retroactively defined as emoji, to help legitimize the latter. Alfred Zett The trick is that one doesn't bargain with Telcos and similar > criminals. Gotta drop them hard and the pest will go away from itself > after five years or so. This does not help to make a case for or against encoding of anything. -- Doug Ewell | Thornton, CO, USA | http://ewellic.org From doug at ewellic.org Mon Feb 9 13:28:44 2015 From: doug at ewellic.org (Doug Ewell) Date: Mon, 09 Feb 2015 12:28:44 -0700 Subject: Unicode block for programming related symbols and =?UTF-8?Q?codepoints=3F?= Message-ID: <20150209122844.665a7a7059d7ee80bb4d670165c8327d.241b973136.wbe@email03.secureserver.net> I can't count: > It can be argued ? and was, repeatedly and persuasively ? that > the initial collection of emoji in Unicode 6.1 6.0 > But the additional emoji added to Unicode 6.2 and 7.0 6.1 and 7.0 -- Doug Ewell | Thornton, CO, USA | http://ewellic.org From shervinafshar at gmail.com Mon Feb 9 13:44:54 2015 From: shervinafshar at gmail.com (Shervin Afshar) Date: Mon, 9 Feb 2015 11:44:54 -0800 Subject: Unicode block for programming related symbols and codepoints? In-Reply-To: <20150209122530.665a7a7059d7ee80bb4d670165c8327d.98030114b2.wbe@email03.secureserver.net> References: <20150209122530.665a7a7059d7ee80bb4d670165c8327d.98030114b2.wbe@email03.secureserver.net> Message-ID: > > There is no longer any requirement that the robot faces and > burritos appear first in any sort of industry character set extension, > with which Unicode is then obliged to maintain compatibility. Only if you don't consider existing usage and popular requests as requirement and precedence; for example Gmail had Robot Face for a long time. ? Shervin On Mon, Feb 9, 2015 at 11:25 AM, Doug Ewell wrote: > Fr?d?ric Grosshans wrote: > > > The including of emoji was a considerable debate here, with people > > strongly against and strongly for. The trick is that they were already > > used as digital characters by Japanese Telcos and their millions of > > customers. They were de facto encoded as characters in Japanese text > > messages. At the time of encoding, the spread of smartphones made them > > appear in other places (emails, web forums, etc.) > > Sorry, I can't let the "compatibility" argument go unchallenged again. > > It can be argued ? and was, repeatedly and persuasively ? that the > initial collection of emoji in Unicode 6.1 [1] were added for > compatibility with Japanese telco extensions to JIS. > > But the additional emoji added to Unicode 6.2 and 7.0, and planned for > 8.0, do not have even this provenance; they were added on foot of novel > proposals sent directly to Unicode, or (more recently) by "popular > request." There is no longer any requirement that the robot faces and > burritos appear first in any sort of industry character set extension, > with which Unicode is then obliged to maintain compatibility. > > [1] No, I am not counting the ARIB symbols or any other long-encoded > symbols that have been retroactively defined as emoji, to help > legitimize the latter. > > Alfred Zett > > The trick is that one doesn't bargain with Telcos and similar > > criminals. Gotta drop them hard and the pest will go away from itself > > after five years or so. > > This does not help to make a case for or against encoding of anything. > > -- > Doug Ewell | Thornton, CO, USA | http://ewellic.org > > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > -------------- next part -------------- An HTML attachment was scrubbed... URL: From haberg-1 at telia.com Mon Feb 9 13:48:21 2015 From: haberg-1 at telia.com (Hans Aberg) Date: Mon, 9 Feb 2015 20:48:21 +0100 Subject: Unicode block for programming related symbols and codepoints? In-Reply-To: <54D8F9A5.9070302@att.net> References: <54D7C3EA.6080000@web.de> <9B99B8A6-BB16-4DC0-A193-5D5869274040@lboro.ac.uk> <54D8F9A5.9070302@att.net> Message-ID: <6A89E653-0EC3-4E2A-849B-53F1A54CB5B6@telia.com> > On 9 Feb 2015, at 19:17, Ken Whistler wrote: ... > The use in C of "=" and "==" was badly designed > from the start, and is the source of bezillions of inadvertent programming > errors in practice. It is the ample oversupply of implicit conversions in combination with the lack of a proper boolean type that is causing those programming errors. > But if a left arrow, for example, might be a better choice for an assignment > operator in a programming language, and a two-character ASCII operator > like ":=" or "<-" doesn't seem appropriate or causes other confusion, there > still isn't a character *encoding* issue here. Just use "?", which already exists (U+2190), > and is a fine left arrow! There are also ? COLON EQUALS U+2254 and others. No problems using such characters in Flex: The problem is the lack of input methods. From doug at ewellic.org Mon Feb 9 14:16:58 2015 From: doug at ewellic.org (Doug Ewell) Date: Mon, 09 Feb 2015 13:16:58 -0700 Subject: Emoji (was: Re: Unicode block for programming related symbols and =?UTF-8?Q?codepoints=3F=29?= Message-ID: <20150209131658.665a7a7059d7ee80bb4d670165c8327d.de86ceee88.wbe@email03.secureserver.net> Shervin Afshar wrote: >> There is no longer any requirement that the robot faces and >> burritos appear first in any sort of industry character set >> extension, with which Unicode is then obliged to maintain >> compatibility. > > Only if you don't consider existing usage and popular requests as > requirement and precedence; for example Gmail had Robot Face for a > long time. I said there was no longer a requirement *that the items appear first in an industry character set extension*, right? In what character encoding standard, or extension, does ROBOT FACE appear? "Gmail has it" is not a character encoding standard. Neither is "People want to see it." "Most popularly requested," as a criterion for adding a character, is absolutely new to Unicode. Earlier I wrote privately to a Unicode officer about whether PERSON TAKING SELFIE and GIRL TWERKING and PERSON DUMPING ICE BUCKET OVER HEAD would be ephemeral enough, and got no reply. (What, you've forgotten the ice-bucket craze already? That's exactly why "most popular at the moment" wasn't supposed to be a criterion.) -- Doug Ewell | Thornton, CO, USA | http://ewellic.org From frederic.grosshans at gmail.com Mon Feb 9 14:34:33 2015 From: frederic.grosshans at gmail.com (=?UTF-8?B?RnLDqWTDqXJpYyBHcm9zc2hhbnM=?=) Date: Mon, 9 Feb 2015 21:34:33 +0100 Subject: Unicode block for programming related symbols and codepoints? In-Reply-To: <20150209122530.665a7a7059d7ee80bb4d670165c8327d.98030114b2.wbe@email03.secureserver.net> References: <20150209122530.665a7a7059d7ee80bb4d670165c8327d.98030114b2.wbe@email03.secureserver.net> Message-ID: Le 9 f?vr. 2015 20:27, "Doug Ewell" a ?crit : > > Sorry, I can't let the "compatibility" argument go unchallenged again. > I stand corrected (and I should have known better! ) -------------- next part -------------- An HTML attachment was scrubbed... URL: From everson at evertype.com Mon Feb 9 14:36:33 2015 From: everson at evertype.com (Michael Everson) Date: Mon, 9 Feb 2015 20:36:33 +0000 Subject: Emoji (was: Re: Unicode block for programming related symbols and codepoints?) In-Reply-To: <20150209131658.665a7a7059d7ee80bb4d670165c8327d.de86ceee88.wbe@email03.secureserver.net> References: <20150209131658.665a7a7059d7ee80bb4d670165c8327d.de86ceee88.wbe@email03.secureserver.net> Message-ID: <62EBF72F-6832-4174-946C-234508DE434D@evertype.com> I like symbols a lot. But I know that I and a number of people have been thinking that too much emphasis is being put on emoji. Michael Everson * http://www.evertype.com/ From andrea.giammarchi at gmail.com Mon Feb 9 15:02:54 2015 From: andrea.giammarchi at gmail.com (Andrea Giammarchi) Date: Mon, 9 Feb 2015 22:02:54 +0100 Subject: About cultural/languages communities flags In-Reply-To: References: Message-ID: Thanks, that was somehow indeed my very first concern. Everyone could claim an emoji, at that point. Enough info for me so far, so thanks again. Best Regards On Mon, Feb 9, 2015 at 8:16 PM, Markus Scherer wrote: > On Mon, Feb 9, 2015 at 9:54 AM, Andrea Giammarchi < > andrea.giammarchi at gmail.com> wrote: > >> > if a cultural/language TLD is typed with Unicode RIS, then show the >> flag for these culture/language: >> > > This does not work. The "Unicode RIS" are defined to be used in pairs, > with semantics according to corresponding ISO 3166 alpha2 codes. In your > examples, each successive pair will encode a flag. > > If you want to represent every flag of every locality, you first have to > figure out how to catalog and label them. You are mentioning provinces, one > level down from nation states; I guess there are thousands of them. In much > of Europe, every little village > has its own flag and coat of arms. Where do you want the text encoding and > fonts to stop? > > markus > -------------- next part -------------- An HTML attachment was scrubbed... URL: From alfred_z at web.de Mon Feb 9 15:04:32 2015 From: alfred_z at web.de (Alfred Zett) Date: Mon, 09 Feb 2015 22:04:32 +0100 Subject: Emoji (was: Re: Unicode block for programming related symbols and codepoints?) In-Reply-To: <20150209131658.665a7a7059d7ee80bb4d670165c8327d.de86ceee88.wbe@email03.secureserver.net> References: <20150209131658.665a7a7059d7ee80bb4d670165c8327d.de86ceee88.wbe@email03.secureserver.net> Message-ID: <54D920E0.8020008@web.de> Doug Ewell: > "Most popularly requested," as a criterion for adding a character, is > absolutely new to Unicode. Earlier I wrote privately to a Unicode > officer about whether PERSON TAKING SELFIE and GIRL TWERKING and > PERSON DUMPING ICE BUCKET OVER HEAD would be ephemeral enough, and got > no reply. (What, you've forgotten the ice-bucket craze already? That's > exactly why "most popular at the moment" wasn't supposed to be a > criterion.) There is much truth in this. I'll now leave the discussion, because it doesn't lead anywhere. Best regards, A. Z. From shervinafshar at gmail.com Mon Feb 9 15:12:52 2015 From: shervinafshar at gmail.com (Shervin Afshar) Date: Mon, 9 Feb 2015 13:12:52 -0800 Subject: Emoji (was: Re: Unicode block for programming related symbols and codepoints?) In-Reply-To: <20150209131658.665a7a7059d7ee80bb4d670165c8327d.de86ceee88.wbe@email03.secureserver.net> References: <20150209131658.665a7a7059d7ee80bb4d670165c8327d.de86ceee88.wbe@email03.secureserver.net> Message-ID: > > I said there was no longer a requirement *that the items appear first in > an industry character set extension*, right? > The issue is with your very rigid interpretation of the criteria for encoding new symbols. Is "appearing in an industry character set extension" an official phrasing that you keep referring to? In what character encoding standard, or extension, does ROBOT FACE > appear? "Gmail has it" is not a character encoding standard. Neither is > "People want to see it." > Robot Face is available on Gmail (GChat), Facebook, and Twitch among others (calculating the size of user community is left as an assignment for the reader). That's enough usage for consideration by the UTC even if the symbol is not present in a character encoding standard. Also, since Unicode is an industry standard maintained by industry members (among others), then if there is enough request to these corporations from communities of users, then there might be some reason for considering those symbols. I think that's the case for the newer symbols. > "Most popularly requested," as a criterion for adding a character, is > absolutely new to Unicode. Earlier I wrote privately to a Unicode > officer about whether PERSON TAKING SELFIE and GIRL TWERKING and PERSON > DUMPING ICE BUCKET OVER HEAD would be ephemeral enough, and got no > reply. (What, you've forgotten the ice-bucket craze already? That's > exactly why "most popular at the moment" wasn't supposed to be a > criterion.) IMO, Unicode officers seems to have low patience for such sentiments. You might want to reconsider your tone. There is a time and place for sarcasm. ? Shervin On Mon, Feb 9, 2015 at 12:16 PM, Doug Ewell wrote: > Shervin Afshar wrote: > > >> There is no longer any requirement that the robot faces and > >> burritos appear first in any sort of industry character set > >> extension, with which Unicode is then obliged to maintain > >> compatibility. > > > > Only if you don't consider existing usage and popular requests as > > requirement and precedence; for example Gmail had Robot Face for a > > long time. > > I said there was no longer a requirement *that the items appear first in > an industry character set extension*, right? > > In what character encoding standard, or extension, does ROBOT FACE > appear? "Gmail has it" is not a character encoding standard. Neither is > "People want to see it." > > "Most popularly requested," as a criterion for adding a character, is > absolutely new to Unicode. Earlier I wrote privately to a Unicode > officer about whether PERSON TAKING SELFIE and GIRL TWERKING and PERSON > DUMPING ICE BUCKET OVER HEAD would be ephemeral enough, and got no > reply. (What, you've forgotten the ice-bucket craze already? That's > exactly why "most popular at the moment" wasn't supposed to be a > criterion.) > > -- > Doug Ewell | Thornton, CO, USA | http://ewellic.org > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From markus.icu at gmail.com Mon Feb 9 15:21:06 2015 From: markus.icu at gmail.com (Markus Scherer) Date: Mon, 9 Feb 2015 13:21:06 -0800 Subject: About cultural/languages communities flags In-Reply-To: References: Message-ID: On Mon, Feb 9, 2015 at 1:11 PM, Joan Montan? wrote: > AFAIK, this is done in font side. Emoji flags are just ligatures, so a > font can provide a ligature for 4 RIS characters. > Technically true, but a font that violates the encoding standard would cause large problems. Imagine a font that ligates letters 't' and 'h' and displays an Egyptian hieroglyph for the combination. What's the way for encoding them in Unicode standard? > In principle, the way for encoding anything in the Unicode Standard is to write a well-formed proposal, and convince the Unicode Technical Committee and ISO JTC1/SC2 that the proposal has merit. However, I would much prefer if everyone spent their considerable energy on upgrading protocols (e.g., IETF RFCs for email subject lines) and lobby relevant vendors (e.g., chat services & social network messages) to support images embedded in the text stream, ideally with scaling and other behavior that would make them behave somewhat text-like. Best regards, markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From jf at colson.eu Mon Feb 9 15:07:29 2015 From: jf at colson.eu (=?ISO-8859-1?Q?Jean-Fran=E7ois_Colson?=) Date: Mon, 09 Feb 2015 22:07:29 +0100 Subject: Unicode block for programming related symbols and codepoints? Message-ID: <4rhiswo1bpjq22xajg9jw62q.1423515992998@email.android.com> -------- Message d'origine -------- De : Hans Aberg Date :09/02/2015 20:48 (GMT+01:00) A : Ken Whistler Cc : Unicode Mailing List Objet : Re: Unicode block for programming related symbols and codepoints? > On 9 Feb 2015, at 19:17, Ken Whistler wrote: ... > But if a left arrow, for example, might be a better choice for an assignment > operator in a programming language, and a two-character ASCII operator > like ":=" or "<-" doesn't seem appropriate or causes other confusion, there > still isn't a character *encoding* issue here. Just use "?", which already exists (U+2190), > and is a fine left arrow! There are also ? ? COLON EQUALS U+2254 and others. No problems using such characters in Flex: The problem is the lack of input methods. No problem for me: I can input a?? by typing either Alt Gr + 4 (on the numeric keypad) or compose + ?< + - I have no way to type "colon equals" but to type it as compose + : + = I should simply add one single line to my ~/.XCompose file: :?U2254 and restart my text editor. That isn't more difficult than that. (I'm on my phone right now.) -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Mon Feb 9 16:17:36 2015 From: doug at ewellic.org (Doug Ewell) Date: Mon, 09 Feb 2015 15:17:36 -0700 Subject: Emoji (was: Re: Unicode block for programming related symbols and =?UTF-8?Q?codepoints=3F=29?= Message-ID: <20150209151736.665a7a7059d7ee80bb4d670165c8327d.e90f409a66.wbe@email03.secureserver.net> Shervin Afshar wrote: > The issue is with your very rigid interpretation of the criteria for > encoding new symbols. Is "appearing in an industry character set > extension" an official phrasing that you keep referring to? It was either from the WG2 Principles and Procedures document, or some other bit of Unicode/10646 folklore that I've read over the past 22 years of keeping up with Unicode/10646. I should look up the exact wording. Of course, Unicode can encode anything they please. That's not in question. But in order to claim "compatibility" as the basis for encoding something, these specific, "rigid" definitions and criteria have historically been required. "Compatibility" with any random JPEG or meme that makes the rounds on the Internet was not enough. > Robot Face is available on Gmail (GChat), Facebook, and Twitch among > others (calculating the size of user community is left as an > assignment for the reader). That's enough usage for consideration by > the UTC even if the symbol is not present in a character encoding > standard. Also, since Unicode is an industry standard maintained by > industry members (among others), then if there is enough request to > these corporations from communities of users, then there might be some > reason for considering those symbols. I think that's the case for the > newer symbols. Great. Go ahead and encode them, UTC. But don't say it's because your hands are tied and you have no choice. > IMO, Unicode officers seems to have low patience for such sentiments. > You might want to reconsider your tone. There is a time and place for > sarcasm. I'll take my chances. I've been called out before for discouraging list members from requesting things that were out of scope according to the old rules. All I'm saying now is, if the old rules no longer apply, say so. -- Doug Ewell | Thornton, CO, USA | http://ewellic.org From joan at montane.cat Mon Feb 9 15:11:01 2015 From: joan at montane.cat (=?ISO-8859-1?Q?Joan_Montan=E9?=) Date: Mon, 9 Feb 2015 22:11:01 +0100 Subject: About cultural/languages communities flags In-Reply-To: References: Message-ID: Hi all, I am the one who made the request to tweemoji Github. 2015-02-09 20:16 GMT+01:00 Markus Scherer : > On Mon, Feb 9, 2015 at 9:54 AM, Andrea Giammarchi < > andrea.giammarchi at gmail.com> wrote: > >> > if a cultural/language TLD is typed with Unicode RIS, then show the >> flag for these culture/language: >> > > This does not work. The "Unicode RIS" are defined to be used in pairs, > with semantics according to corresponding ISO 3166 alpha2 codes. In your > examples, each successive pair will encode a flag. > > AFAIK, this is done in font side. Emoji flags are just ligatures, so a font can provide a ligature for 4 RIS characters. This is not an issue here. I agree some strange behaviour can appear if a 3 RIS string, take CAT, is shown in a system with only 2 RIS support (a Canadian will appear followed by a T). If you want to represent every flag of every locality, you first have to > figure out how to catalog and label them. You are mentioning provinces, one > level down from nation states; I guess there are thousands of them. In much > of Europe, every little village > has its own flag and coat of arms. Where do you want the text encoding and > fonts to stop? > > I don't request flag support for every flag in the world. I requested flags for culture/language communities *with* an approved TLD (Top Level Domain). I know flags are an issue, and I know flags represents territories, not languages, but I think some support should be done for these active communities. As I pointed, some country flag collections expand with a fews non-independent country. See [1], [2] and [3] (search for Scottish or Welsh flag). You can check this [4] petition requesting Catalan flag on WhatsApp. So, there is a demand and they are used in real world. What's the way for encoding them in Unicode standard? Thanks, Joan Montan? [1] http://www.famfamfam.com/lab/icons/flags/ [2] https://www.gosquared.com/resources/flag-icons/ [3] http://www.sherv.net/flag-emoticons.html [4] https://www.change.org/p/whatsapp-inc-incloure-la-senyera-de-catalunya-a-whatsapp -------------- next part -------------- An HTML attachment was scrubbed... URL: From joan at montane.cat Mon Feb 9 15:18:23 2015 From: joan at montane.cat (=?ISO-8859-1?Q?Joan_Montan=E9?=) Date: Mon, 9 Feb 2015 22:18:23 +0100 Subject: About cultural/languages communities flags In-Reply-To: References: Message-ID: Sorry, my reply was sended CC: to Unicode ML, My apologies, Joan Montan? 2015-02-09 22:11 GMT+01:00 Joan Montan? : > > Hi all, > > I am the one who made the request to tweemoji Github. > > > 2015-02-09 20:16 GMT+01:00 Markus Scherer : > >> On Mon, Feb 9, 2015 at 9:54 AM, Andrea Giammarchi < >> andrea.giammarchi at gmail.com> wrote: >> >>> > if a cultural/language TLD is typed with Unicode RIS, then show the >>> flag for these culture/language: >>> >> >> This does not work. The "Unicode RIS" are defined to be used in pairs, >> with semantics according to corresponding ISO 3166 alpha2 codes. In your >> examples, each successive pair will encode a flag. >> >> > AFAIK, this is done in font side. Emoji flags are just ligatures, so a > font can provide a ligature for 4 RIS characters. This is not an issue here. > > I agree some strange behaviour can appear if a 3 RIS string, take CAT, is > shown in a system with only 2 RIS support (a Canadian will appear followed > by a T). > > > If you want to represent every flag of every locality, you first have to >> figure out how to catalog and label them. You are mentioning provinces, one >> level down from nation states; I guess there are thousands of them. In much >> of Europe, every little village >> has its own flag and coat of >> arms. Where do you want the text encoding and fonts to stop? >> >> > I don't request flag support for every flag in the world. I requested > flags for culture/language communities *with* an approved TLD (Top Level > Domain). > > I know flags are an issue, and I know flags represents territories, not > languages, but I think some support should be done for these active > communities. As I pointed, some country flag collections expand with a fews > non-independent country. See [1], [2] and [3] (search for Scottish or > Welsh flag). You can check this [4] petition requesting Catalan flag on > WhatsApp. > > So, there is a demand and they are used in real world. What's the way for > encoding them in Unicode standard? > > Thanks, > > Joan Montan? > > [1] http://www.famfamfam.com/lab/icons/flags/ > [2] https://www.gosquared.com/resources/flag-icons/ > [3] http://www.sherv.net/flag-emoticons.html > [4] > https://www.change.org/p/whatsapp-inc-incloure-la-senyera-de-catalunya-a-whatsapp > -------------- next part -------------- An HTML attachment was scrubbed... URL: From haberg-1 at telia.com Mon Feb 9 16:33:49 2015 From: haberg-1 at telia.com (Hans Aberg) Date: Mon, 9 Feb 2015 23:33:49 +0100 Subject: Unicode block for programming related symbols and codepoints? In-Reply-To: <4rhiswo1bpjq22xajg9jw62q.1423515992998@email.android.com> References: <4rhiswo1bpjq22xajg9jw62q.1423515992998@email.android.com> Message-ID: <6E316B18-AB55-43B8-9441-E6E06173A31A@telia.com> > On 9 Feb 2015, at 22:07, Jean-Fran?ois Colson wrote: >> > But if a left arrow, for example, might be a better choice for an assignment >> > operator in a programming language, and a two-character ASCII operator >> > like ":=" or "<-" doesn't seem appropriate or causes other confusion, there >> > still isn't a character *encoding* issue here. Just use "?", which already exists (U+2190), >> > and is a fine left arrow! >> >> There are also >> ? COLON EQUALS U+2254 >> and others. >> >> No problems using such characters in Flex: >> >> The problem is the lack of input methods. > > No problem for me: I can input a ? by typing either Alt Gr + 4 (on the numeric keypad) or compose + < + - > I have no way to type "colon equals" but to type it as compose + : + = I should simply add one single line to my ~/.XCompose file: > : U2254 > and restart my text editor. That isn't more difficult than that. The problem is that there are a lot of characters and rather time consuming to design ones own input methods. From doug at ewellic.org Mon Feb 9 16:38:42 2015 From: doug at ewellic.org (Doug Ewell) Date: Mon, 09 Feb 2015 15:38:42 -0700 Subject: About cultural/languages communities flags Message-ID: <20150209153842.665a7a7059d7ee80bb4d670165c8327d.766f5788f4.wbe@email03.secureserver.net> Joan Montan? wrote: > I don't request flag support for every flag in the world. I requested > flags for culture/language communities *with* an approved TLD (Top > Level Domain). Incidentally, about a year and a half ago I discussed this with another list member, on- and off-list. We agreed that some sort of text-based encoding of flags would be an interesting project, but disagreed as to whether this was a Unicode problem. The present discussion seems to approach the issue from the other side: treat it as *only* a Unicode problem, and assume that the encoding problem has been solved by TLD registration. See also http://www.unicode.org/faq/emoji_dingbats.html#12 . This is the Unicode Consortium talking, not me. -- Doug Ewell | Thornton, CO, USA | http://ewellic.org From shervinafshar at gmail.com Mon Feb 9 17:04:30 2015 From: shervinafshar at gmail.com (Shervin Afshar) Date: Mon, 9 Feb 2015 15:04:30 -0800 Subject: Emoji (was: Re: Unicode block for programming related symbols and codepoints?) In-Reply-To: <20150209151736.665a7a7059d7ee80bb4d670165c8327d.e90f409a66.wbe@email03.secureserver.net> References: <20150209151736.665a7a7059d7ee80bb4d670165c8327d.e90f409a66.wbe@email03.secureserver.net> Message-ID: > It was either from the WG2 Principles and Procedures document, or some > other bit of Unicode/10646 folklore that I've read over the past 22 >years of keeping up with Unicode/10646. I should look up the exact > wording. Yes, please. I would like to have that policy noted for my future use. > Of course, Unicode can encode anything they please. That's not in > question. But in order to claim "compatibility" as the basis for > encoding something, these specific, "rigid" definitions and criteria > have historically been required. "Compatibility" with any random JPEG or > meme that makes the rounds on the Internet was not enough. It's not about encoding what "they" please. Compatibility was the issue with the first set of emoji symbols. The rest of symbols are being added for various other reasons; e.g. diversity, parity, requests, etc. Also, random JPEG and meme don't apply here and you're mistaken to assume that GChat and Facebook fit in this category. > Great. Go ahead and encode them, UTC. But don't say it's because your > hands are tied and you have no choice. Quoting an official UTC communication? > I'll take my chances. I've been called out before for discouraging list > members from requesting things that were out of scope according to the > old rules. All I'm saying now is, if the old rules no longer apply, say > so. AFAIK, rules haven't changed. Unicode didn't have a policy regarding emoji and symbols with similar usage. Now it does. For a longer while now, some folks tend to use emoji as means to an end other than what is in the scope of conversation regarding emoji. And that is not acceptable. ? Shervin On Mon, Feb 9, 2015 at 2:17 PM, Doug Ewell wrote: > Shervin Afshar wrote: > > > The issue is with your very rigid interpretation of the criteria for > > encoding new symbols. Is "appearing in an industry character set > > extension" an official phrasing that you keep referring to? > > It was either from the WG2 Principles and Procedures document, or some > other bit of Unicode/10646 folklore that I've read over the past 22 > years of keeping up with Unicode/10646. I should look up the exact > wording. > > Of course, Unicode can encode anything they please. That's not in > question. But in order to claim "compatibility" as the basis for > encoding something, these specific, "rigid" definitions and criteria > have historically been required. "Compatibility" with any random JPEG or > meme that makes the rounds on the Internet was not enough. > > > Robot Face is available on Gmail (GChat), Facebook, and Twitch among > > others (calculating the size of user community is left as an > > assignment for the reader). That's enough usage for consideration by > > the UTC even if the symbol is not present in a character encoding > > standard. Also, since Unicode is an industry standard maintained by > > industry members (among others), then if there is enough request to > > these corporations from communities of users, then there might be some > > reason for considering those symbols. I think that's the case for the > > newer symbols. > > Great. Go ahead and encode them, UTC. But don't say it's because your > hands are tied and you have no choice. > > > IMO, Unicode officers seems to have low patience for such sentiments. > > You might want to reconsider your tone. There is a time and place for > > sarcasm. > > I'll take my chances. I've been called out before for discouraging list > members from requesting things that were out of scope according to the > old rules. All I'm saying now is, if the old rules no longer apply, say > so. > > -- > Doug Ewell | Thornton, CO, USA | http://ewellic.org > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From kenwhistler at att.net Mon Feb 9 17:11:39 2015 From: kenwhistler at att.net (Ken Whistler) Date: Mon, 09 Feb 2015 15:11:39 -0800 Subject: About cultural/languages communities flags In-Reply-To: <20150209153842.665a7a7059d7ee80bb4d670165c8327d.766f5788f4.wbe@email03.secureserver.net> References: <20150209153842.665a7a7059d7ee80bb4d670165c8327d.766f5788f4.wbe@email03.secureserver.net> Message-ID: <54D93EAB.7010409@att.net> To follow up on Doug Ewell's response, the mechanism currently standardized in the Unicode Standard for "regional indicator codes" has an interpretation tied to the two-letter codes of ISO 3166-1, and *not* to TLD's. The two are not directly connected. If anyone really wants to pursue getting a Scots flag into general implementation via Unicode regional indicator codes, the correct way to make that happen is for somebody to get off their duff and convince the BSI (British Standards Institute) to put in for an exceptional reservation of a two-letter code for Scotland in ISO 3166-1 by petitioning the ISO 3166/MA. See: http://en.wikipedia.org/wiki/ISO_3166-1_alpha-2 for the full context, and for the current 26x26 letter matrix which is the basis for the flag glyph implementations of regional indicator code pairs on smartphones. SC, SO, ST are already taken, but might I suggest putting in for registering "AB" for Alba? That one is currently unassigned. Yeah, yeah, what is the likelihood of BSI pushing for a Scots two-letter code?! But seriously, if folks are planning ahead for Scots independence or even some kind of greater autonomy, this is an issue that needs to be worked, anyway. In the meantime, let me reiterate that there is *no* formal relationship between TLD's and the regional indicator codes in Unicode (or the implementations built upon them). Well, yes, a bunch of registered TLD's do match the country codes, but there is no two-letter constraint on TLD's. This should already be apparent, as Scotland has registered ".scot" At this point there isn't even a limitation of TLD's to ASCII letters, so there is no way to map them to the limited set of regional indicator codes in the Unicode Standard. Not having a two letter country code for Scotland that matches the four letter TLD for Scotland might indeed be a problem for someone, but I don't see *this* as a problem that the Unicode Standard needs to solve. --Ken On 2/9/2015 2:38 PM, Doug Ewell wrote: > Joan Montan? wrote: > >> I don't request flag support for every flag in the world. I requested >> flags for culture/language communities *with* an approved TLD (Top >> Level Domain). > Incidentally, about a year and a half ago I discussed this with another > list member, on- and off-list. We agreed that some sort of text-based > encoding of flags would be an interesting project, but disagreed as to > whether this was a Unicode problem. > > The present discussion seems to approach the issue from the other side: > treat it as *only* a Unicode problem, and assume that the encoding > problem has been solved by TLD registration. > > From doug at ewellic.org Mon Feb 9 17:53:44 2015 From: doug at ewellic.org (Doug Ewell) Date: Mon, 09 Feb 2015 16:53:44 -0700 Subject: About cultural/languages communities flags Message-ID: <20150209165344.665a7a7059d7ee80bb4d670165c8327d.8fe2797a38.wbe@email03.secureserver.net> And just another follow-up, to try to explain *why* the mechanism for Regional Indicator Codes might be so closely tied to ISO 3166-1 alpha-2 code elements: ISO 3166-1 codes are derived from code elements published by the United Nations Statistics Division. This is the group that ultimately decides "what is and isn't a country" for the purposes of these codes. While there is inevitably some political influence in the UN, many organizations and projects that use ISO 3166-1 codes do so to avoid getting embroiled in their own debate over "what is a country." The IETF language-tagging project (BCP 47, RFC 5646; see "IETF language tag" in Wikipedia for more information) is one example. Conversely, it is sometimes the case that groups which seek to extend the set of ISO 3166-1 codes unilaterally, or to establish a competing or supplemental coding system, might do so in order to gain acceptance or establish credibility for a nation or territory that is not recognized as such by UNSD. It is entirely reasonable (IMHO) to suggest that if Unicode were to attempt, by whatever means, to enable encoding of flags for entities beyond those encoded in ISO 3166-1, that the door would be opened wide for unrecognized nations and separatist groups to claim that the Unicode Consortium "supports" their cause by supporting display of their flag. It's very possible that Unicode has thought of this and does not want to put itself in that position. -- Doug Ewell | Thornton, CO, USA | http://ewellic.org From chris.fynn at gmail.com Mon Feb 9 22:37:01 2015 From: chris.fynn at gmail.com (Christopher Fynn) Date: Tue, 10 Feb 2015 10:37:01 +0600 Subject: About cultural/languages communities flags In-Reply-To: References: Message-ID: Using flags to indicate particular languages on websites has plenty of problems - languages need a better indicator. Scripts could be indicated by a representative glyph. From mark at macchiato.com Tue Feb 10 00:10:56 2015 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Tue, 10 Feb 2015 07:10:56 +0100 Subject: About cultural/languages communities flags In-Reply-To: <54D93EAB.7010409@att.net> References: <20150209153842.665a7a7059d7ee80bb4d670165c8327d.766f5788f4.wbe@email03.secureserver.net> <54D93EAB.7010409@att.net> Message-ID: On Tue, Feb 10, 2015 at 12:11 AM, Ken Whistler wrote: > for the full context, and for the current 26x26 letter matrix which is > the basis for the flag glyph implementations of regional indicator > code pairs on smartphones. > > SC, SO, ST are already taken, but might I suggest putting in for > registering > "AB" for Alba? That one is currently unassigned. > > Yeah, yeah, what is the likelihood of BSI pushing for a Scots two-letter > code?! But seriously, if folks are planning ahead for Scots independence > or even some kind of greater autonomy, this is an issue that needs to > be worked, anyway. > > In the meantime, let me reiterate that there is *no* formal relationship > between TLD's and the regional indicator codes in Unicode (or the > implementations > built upon them). Well, yes, a bunch of registered TLD's do match the > country > codes, but there is no two-letter constraint on TLD's. This should already > be apparent, as Scotland has registered ".scot" At this point there isn't > even > a limitation of TLD's to ASCII letters, so there is no way to map them > to the limited set of regional indicator codes in the Unicode Standard. > > Not having a two letter country code for Scotland that matches the > four letter TLD for Scotland might indeed be a problem for someone, > but I don't see *this* as a problem that the Unicode Standard needs > to solve. > ?I want to add to that that there are already a fair number of ISO 2-letter codes for regions that are administered as part of another country, like Hong Kong. There are also codes for crown possessions like Guernsey. So having a code for Scotland (and Wales, and N. Ireland) do not really break precedent. But as Ken says, the best mechanism is for the UK to push for a code in ISO and the UN. Mark *? Il meglio ? l?inimico del bene ?* -------------- next part -------------- An HTML attachment was scrubbed... URL: From joan at montane.cat Tue Feb 10 01:32:17 2015 From: joan at montane.cat (=?ISO-8859-1?Q?Joan_Montan=E9?=) Date: Tue, 10 Feb 2015 08:32:17 +0100 Subject: About cultural/languages communities flags In-Reply-To: References: <20150209153842.665a7a7059d7ee80bb4d670165c8327d.766f5788f4.wbe@email03.secureserver.net> <54D93EAB.7010409@att.net> Message-ID: Thanks for your replies, As far as I see, my informal request for expanding current RIS design hasn't a good response. I understand it. Flags are cause of disputes, and it isn't an issue for Unicode encode them. IMHO keept tied to 2-alpha codes is a poor choice for users. May be industry manufactures could find a better approach. Best regards, Joan Montan? -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Tue Feb 10 10:16:14 2015 From: doug at ewellic.org (Doug Ewell) Date: Tue, 10 Feb 2015 09:16:14 -0700 Subject: About cultural/languages communities flags Message-ID: <20150210091614.665a7a7059d7ee80bb4d670165c8327d.a342055f13.wbe@email03.secureserver.net> Joan Montan? wrote: > As far as I see, my informal request for expanding current RIS design > hasn't a good response. I understand it. Flags are cause of disputes, > and it isn't an issue for Unicode encode them. There are technical limitations as well. Because the mechanism is already defined on pairs of symbols, it's not trivial to expand it to three or more symbols. Earlier, you had written: > I agree some strange behaviour can appear if a 3 RIS string, take CAT, > is shown in a system with only 2 RIS support (a Canadian will appear > followed by a T). but in fact, every one of the combinations in the original post will generate incorrect output (if any): > [S][C][O][T] --> it shows Scottish flag Seychelles, "undefined" > [C][Y][M][R][U] --> it shows a Welsh flag Cyprus, Mauritania, unpaired symbol > [B][Z][H] --> it shows a Breton flag Belize, unpaired symbol > [C][A][T] --> it shows Catalan flag Canada, unpaired symbol > [E][U][S] --> it shows a Basque flag "Undefined" (or European Union if the implementation happens to include an extension to ISO 3166 exceptionally reserved code elements), unpaired symbol > [G][A][L] --> it shows a Gallician flag Gabon, unpaired symbol In order to make a system like this work with an arbitrary number of symbols, a terminating symbol would have to be defined. Finding the longest match between a string of symbols and a TLD wouldn't work; someone might really want to encode "Brazil, United States, Sweden, Lesotho" consecutively, and would not want this converted to "Brussels." And as Ken pointed out, TLDs are TLDs; they are not a general-purpose geographic coding system. They don't include every sub-national region or separatist group, only the ones that Donuts and similar companies chose to register. There's no TLD for Abkhazia, for example, or for ISIS. > IMHO keept tied to 2-alpha codes is a poor choice for users. May be > industry manufactures could find a better approach. Let's hope that industry manufacturers adhere to the standard instead of going off on their own. I thought that was the idea when all these cell-phone symbols were added to Unicode in the first place. -- Doug Ewell | Thornton, CO, USA | http://ewellic.org From mark at macchiato.com Tue Feb 10 10:45:24 2015 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Tue, 10 Feb 2015 17:45:24 +0100 Subject: Emoji (was: Re: Unicode block for programming related symbols and codepoints?) In-Reply-To: <62EBF72F-6832-4174-946C-234508DE434D@evertype.com> References: <20150209131658.665a7a7059d7ee80bb4d670165c8327d.de86ceee88.wbe@email03.secureserver.net> <62EBF72F-6832-4174-946C-234508DE434D@evertype.com> Message-ID: We are being pretty conservative about what we add. There are approximately 1,200 emoji characters now (see tr51), and we're anticipating adding perhaps 50 per release. And we are encouraging a "sticker" approach for the longer term. On the other hand, I wouldn't be surprised if the 41 emoji characters that we are planning on for Unicode 8.0 end up having a higher frequency of use than the other 7K characters in the release. Mark *? Il meglio ? l?inimico del bene ?* On Mon, Feb 9, 2015 at 9:36 PM, Michael Everson wrote: > I like symbols a lot. But I know that I and a number of people have been > thinking that too much emphasis is being put on emoji. > > Michael Everson * http://www.evertype.com/ > > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mark at macchiato.com Tue Feb 10 10:48:34 2015 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Tue, 10 Feb 2015 17:48:34 +0100 Subject: Emoji (was: Re: Unicode block for programming related symbols and codepoints?) In-Reply-To: <20150209131658.665a7a7059d7ee80bb4d670165c8327d.de86ceee88.wbe@email03.secureserver.net> References: <20150209131658.665a7a7059d7ee80bb4d670165c8327d.de86ceee88.wbe@email03.secureserver.net> Message-ID: > In what character encoding standard, or extension, does ROBOT FACE appear? Unicode has never been limited to what is in other character encoding standard or extensions, "official" or de facto. Mark *? Il meglio ? l?inimico del bene ?* On Mon, Feb 9, 2015 at 9:16 PM, Doug Ewell wrote: > Shervin Afshar wrote: > > >> There is no longer any requirement that the robot faces and > >> burritos appear first in any sort of industry character set > >> extension, with which Unicode is then obliged to maintain > >> compatibility. > > > > Only if you don't consider existing usage and popular requests as > > requirement and precedence; for example Gmail had Robot Face for a > > long time. > > I said there was no longer a requirement *that the items appear first in > an industry character set extension*, right? > > In what character encoding standard, or extension, does ROBOT FACE > appear? "Gmail has it" is not a character encoding standard. Neither is > "People want to see it." > > "Most popularly requested," as a criterion for adding a character, is > absolutely new to Unicode. Earlier I wrote privately to a Unicode > officer about whether PERSON TAKING SELFIE and GIRL TWERKING and PERSON > DUMPING ICE BUCKET OVER HEAD would be ephemeral enough, and got no > reply. (What, you've forgotten the ice-bucket craze already? That's > exactly why "most popular at the moment" wasn't supposed to be a > criterion.) > > -- > Doug Ewell | Thornton, CO, USA | http://ewellic.org > > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Tue Feb 10 11:00:17 2015 From: doug at ewellic.org (Doug Ewell) Date: Tue, 10 Feb 2015 10:00:17 -0700 Subject: Emoji (was: Re: Unicode block for programming related symbols and =?UTF-8?Q?codepoints=3F=29?= Message-ID: <20150210100017.665a7a7059d7ee80bb4d670165c8327d.729df8c844.wbe@email03.secureserver.net> Shervin Afshar wrote: >>> The issue is with your very rigid interpretation of the criteria for >>> encoding new symbols. Is "appearing in an industry character set >>> extension" an official phrasing that you keep referring to? >> >> It was either from the WG2 Principles and Procedures document, or >> some other bit of Unicode/10646 folklore that I've read over the past >> 22 years of keeping up with Unicode/10646. I should look up the exact >> wording. > > Yes, please. I would like to have that policy noted for my future use. I hadn't said, of course, that no new symbols could ever be encoded unless they appeared in an industry character set or extension. I was responding to a point that Fr?d?ric Grosshans made [1] about these symbols being added for compatibility with Japanese telco usage. That argument could be used for the original emoji set, but not for new emoji; those are supposed to follow the regular criteria. [1] http://unicode.org/pipermail/unicode/2015-February/001246.html Here is a passage from TUS 7.0, Section 2.3 that may shed light: "Conceptually, compatibility characters are characters that would not have been encoded in the Unicode Standard except for compatibility and round-trip convertibility with other standards. Such standards include international, national, and vendor character encoding standards. For the most part, these are widely used standards that pre-dated Unicode, but because continued interoperability with new standards and data sources is one of the primary design goals of the Unicode Standard, additional compatibility characters are added as the situation warrants. "Compatibility characters can be contrasted with ordinary (or non-compatibility) characters in the standard?ones that are generally consistent with the Unicode text model and which would have been accepted for encoding to represent various scripts and sets of symbols, regardless of whether those characters also existed in other character encoding standards." > It's not about encoding what "they" please. Compatibility was the > issue with the first set of emoji symbols. The rest of symbols are > being added for various other reasons; e.g. diversity, parity, > requests, etc. Right. So the "compatibility with Japanese telcos" argument cannot be used here. > Also, random JPEG and meme don't apply here and you're mistaken to > assume that GChat and Facebook fit in this category. If you look at the set of new emoji proposed in L2/15-054 [2], you'll see that quite a few of them are justified by their current popularity on the Web. ("Selfie are very popular" was kind of striking. I guess at least one of my predictions was right.) [2] http://www.unicode.org/L2/L2015/15054r-emoji-tranche5.pdf >> Great. Go ahead and encode them, UTC. But don't say it's because your >> hands are tied and you have no choice. > > Quoting an official UTC communication? Quoting an off-list remark. > For a longer while now, some folks tend to use emoji as means to an > end other than what is in the scope of conversation regarding emoji. > And that is not acceptable. Sorry, I don't understand this. -- Doug Ewell | Thornton, CO, USA | http://ewellic.org From doug at ewellic.org Tue Feb 10 11:03:06 2015 From: doug at ewellic.org (Doug Ewell) Date: Tue, 10 Feb 2015 10:03:06 -0700 Subject: Emoji (was: Re: Unicode block for programming related symbols and =?UTF-8?Q?codepoints=3F=29?= Message-ID: <20150210100306.665a7a7059d7ee80bb4d670165c8327d.cd03b4ab63.wbe@email03.secureserver.net> Mark Davis ?? wrote: >> In what character encoding standard, or extension, does ROBOT FACE >> appear? > > Unicode has never been limited to what is in other character encoding > standard or extensions, "official" or de facto. Of course not. But that's been a stated condition for labeling something as "compatibility." -- Doug Ewell | Thornton, CO, USA | http://ewellic.org From shervinafshar at gmail.com Tue Feb 10 12:07:17 2015 From: shervinafshar at gmail.com (Shervin Afshar) Date: Tue, 10 Feb 2015 10:07:17 -0800 Subject: Emoji (was: Re: Unicode block for programming related symbols and codepoints?) In-Reply-To: <20150210100306.665a7a7059d7ee80bb4d670165c8327d.cd03b4ab63.wbe@email03.secureserver.net> References: <20150210100306.665a7a7059d7ee80bb4d670165c8327d.cd03b4ab63.wbe@email03.secureserver.net> Message-ID: > Of course not. But that's been a stated condition for labeling something > as "compatibility." It *is* compatibility; go back and read my email where I mentioned exactly where it was used. ? Shervin On Tue, Feb 10, 2015 at 9:03 AM, Doug Ewell wrote: > Mark Davis ?? wrote: > > >> In what character encoding standard, or extension, does ROBOT FACE > >> appear? > > > > Unicode has never been limited to what is in other character encoding > > standard or extensions, "official" or de facto. > > Of course not. But that's been a stated condition for labeling something > as "compatibility." > > -- > Doug Ewell | Thornton, CO, USA | http://ewellic.org > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Tue Feb 10 12:27:32 2015 From: doug at ewellic.org (Doug Ewell) Date: Tue, 10 Feb 2015 11:27:32 -0700 Subject: Emoji (was: Re: Unicode block for programming related symbols and =?UTF-8?Q?codepoints=3F=29?= Message-ID: <20150210112732.665a7a7059d7ee80bb4d670165c8327d.7b9a049cd6.wbe@email03.secureserver.net> Shervin Afshar wrote: >> Of course not. But that's been a stated condition for labeling >> something as "compatibility." > > It *is* compatibility; go back and read my email where I mentioned > exactly where it was used. You mean the one where you said that Gmail has had ROBOT FACE for a long time? You mean to say that any time Gmail or someone adds a private-use character or embeddable graphic for TOILET PAPER or TIRE IRON or BEER KEG, that Unicode is essentially obliged to add an emoji to maintain compatibility with it? Well, perhaps that's how it is now. But that isn't the way Unicode used to be. Fuddily-duddily, -- Doug Ewell | Thornton, CO, USA | http://ewellic.org From shervinafshar at gmail.com Tue Feb 10 12:29:43 2015 From: shervinafshar at gmail.com (Shervin Afshar) Date: Tue, 10 Feb 2015 10:29:43 -0800 Subject: Emoji (was: Re: Unicode block for programming related symbols and codepoints?) In-Reply-To: References: <20150210100306.665a7a7059d7ee80bb4d670165c8327d.cd03b4ab63.wbe@email03.secureserver.net> Message-ID: > > I was responding to a point that Fr?d?ric Grosshans made [1] about > these symbols being added for compatibility with Japanese telco usage. > That argument could be used for the original emoji set, but not for new > emoji; those are supposed to follow the regular criteria. The compatibility argument can also be applied to major vendors who are using emoji other than Japanese vendors; you can find a list of 20-30 of them here[3]. Add to that list, Facebook and Google. If it is commonly in use, it has a precedence to be proposed for addition to Unicode. To have an informing, objective conversation, people should first look at the actual criteria[4] (as well as the criteria for encoding symbols[5]) and see if what they are claiming is actually according to the criteria or not. [3]: http://www.emoji-cheat-sheet.com/ [4]: http://www.unicode.org/reports/tr51/#Selection_Factors [5]: http://unicode.org/pending/symbol-guidelines.html > If you look at the set of new emoji proposed in L2/15-054 [2], you'll > see that quite a few of them are justified by their current popularity > on the Web. ("Selfie are very popular" was kind of striking. I guess at > least one of my predictions was right.) > [2] http://www.unicode.org/L2/L2015/15054r-emoji-tranche5.pdf > First of all, these are just proposed and not accepted. Secondly, requests by online communities (either directly to UTC or through corp members) creates a precedence for UTC to consider the symbol for encoding. > > For a longer while now, some folks tend to use emoji as means to an > > end other than what is in the scope of conversation regarding emoji. > > And that is not acceptable. > Sorry, I don't understand this. No worries. I don't blame you. It's just the good ol' circular logic. ? Shervin On Tue, Feb 10, 2015 at 10:07 AM, Shervin Afshar wrote: > > Of course not. But that's been a stated condition for labeling something > > as "compatibility." > > It *is* compatibility; go back and read my email where I mentioned exactly > where it was used. > > > ? Shervin > > On Tue, Feb 10, 2015 at 9:03 AM, Doug Ewell wrote: > >> Mark Davis [image: ?]? wrote: >> >> >> In what character encoding standard, or extension, does ROBOT FACE >> >> appear? >> > >> > Unicode has never been limited to what is in other character encoding >> > standard or extensions, "official" or de facto. >> >> Of course not. But that's been a stated condition for labeling something >> as "compatibility." >> >> -- >> Doug Ewell | Thornton, CO, USA | http://ewellic.org >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: emoji_u2615.png Type: image/png Size: 1890 bytes Desc: not available URL: From chris.fynn at gmail.com Tue Feb 10 12:41:23 2015 From: chris.fynn at gmail.com (Christopher Fynn) Date: Wed, 11 Feb 2015 00:41:23 +0600 Subject: About cultural/languages communities flags In-Reply-To: References: <20150209153842.665a7a7059d7ee80bb4d670165c8327d.766f5788f4.wbe@email03.secureserver.net> <54D93EAB.7010409@att.net> Message-ID: One area where this would be useful is for indicating national teams in football (soccer), rugby and other sports where England, Scotland, Wales and N. Ireland play separately internationally. On 10 February 2015 at 12:10, Mark Davis ?? wrote: > > On Tue, Feb 10, 2015 at 12:11 AM, Ken Whistler wrote: >> >> for the full context, and for the current 26x26 letter matrix which is >> the basis for the flag glyph implementations of regional indicator >> code pairs on smartphones. >> >> SC, SO, ST are already taken, but might I suggest putting in for >> registering >> "AB" for Alba? That one is currently unassigned. >> >> Yeah, yeah, what is the likelihood of BSI pushing for a Scots two-letter >> code?! But seriously, if folks are planning ahead for Scots independence >> or even some kind of greater autonomy, this is an issue that needs to >> be worked, anyway. >> >> In the meantime, let me reiterate that there is *no* formal relationship >> between TLD's and the regional indicator codes in Unicode (or the >> implementations >> built upon them). Well, yes, a bunch of registered TLD's do match the >> country >> codes, but there is no two-letter constraint on TLD's. This should already >> be apparent, as Scotland has registered ".scot" At this point there isn't >> even >> a limitation of TLD's to ASCII letters, so there is no way to map them >> to the limited set of regional indicator codes in the Unicode Standard. >> >> Not having a two letter country code for Scotland that matches the >> four letter TLD for Scotland might indeed be a problem for someone, >> but I don't see *this* as a problem that the Unicode Standard needs >> to solve. > > > I want to add to that that there are already a fair number of ISO 2-letter > codes for regions that are administered as part of another country, like > Hong Kong. There are also codes for crown possessions like Guernsey. So > having a code for Scotland (and Wales, and N. Ireland) do not really break > precedent. But as Ken says, the best mechanism is for the UK to push for a > code in ISO and the UN. > > Mark > > ? Il meglio ? l?inimico del bene ? > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > From shervinafshar at gmail.com Tue Feb 10 12:48:20 2015 From: shervinafshar at gmail.com (Shervin Afshar) Date: Tue, 10 Feb 2015 10:48:20 -0800 Subject: Emoji (was: Re: Unicode block for programming related symbols and codepoints?) In-Reply-To: <20150210112732.665a7a7059d7ee80bb4d670165c8327d.7b9a049cd6.wbe@email03.secureserver.net> References: <20150210112732.665a7a7059d7ee80bb4d670165c8327d.7b9a049cd6.wbe@email03.secureserver.net> Message-ID: This thread turns more and more absurd by the email! I apologize to people on the list who have to tolerate this; it might be noisy and annoying, but it is important. Doug Ewell asked: You mean the one where you said that Gmail has had ROBOT FACE for a long > time? Let me use copy-paste for your convenience: Robot Face is available on Gmail (GChat), Facebook, and Twitch among others > (calculating the size of user community is left as an assignment for the > reader). That's enough usage for consideration by the UTC even if the > symbol is not present in a character encoding standard. and then, Doug Ewell wondered: You mean to say that any time Gmail or someone adds a private-use > character or embeddable graphic for TOILET PAPER or TIRE IRON or BEER > KEG, that Unicode is essentially obliged to add an emoji to maintain > compatibility with it? > Yes, but the industry is already moving away from character-based solutions and towards sticker-based solutions as we speak. Right now, Facebook is moving in this direction, as well as Line, Trello, and many others. But things which were added beforehand have precedence to be proposed to Unicode. > Well, perhaps that's how it is now. But that isn't the way Unicode used > to be. Well...Since you seem to be so keen on Internet memes, here's one[6] for you. [6]: http://www.quickmeme.com/img/2a/2ab86791fe23ec5c73dc6d46c2cc5bef14e5ca47ba9208571b79c078fb2af561.jpg ? Shervin On Tue, Feb 10, 2015 at 10:27 AM, Doug Ewell wrote: > Shervin Afshar wrote: > > >> Of course not. But that's been a stated condition for labeling > >> something as "compatibility." > > > > It *is* compatibility; go back and read my email where I mentioned > > exactly where it was used. > > You mean the one where you said that Gmail has had ROBOT FACE for a long > time? > > You mean to say that any time Gmail or someone adds a private-use > character or embeddable graphic for TOILET PAPER or TIRE IRON or BEER > KEG, that Unicode is essentially obliged to add an emoji to maintain > compatibility with it? > > Well, perhaps that's how it is now. But that isn't the way Unicode used > to be. > > Fuddily-duddily, > > -- > Doug Ewell | Thornton, CO, USA | http://ewellic.org > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From joan at montane.cat Tue Feb 10 14:28:19 2015 From: joan at montane.cat (=?ISO-8859-1?Q?Joan_Montan=E9?=) Date: Tue, 10 Feb 2015 21:28:19 +0100 Subject: About cultural/languages communities flags In-Reply-To: <20150210091614.665a7a7059d7ee80bb4d670165c8327d.a342055f13.wbe@email03.secureserver.net> References: <20150210091614.665a7a7059d7ee80bb4d670165c8327d.a342055f13.wbe@email03.secureserver.net> Message-ID: 2015-02-10 17:16 GMT+01:00 Doug Ewell : > > In order to make a system like this work with an arbitrary number of > symbols, a terminating symbol would have to be defined. Finding the > longest match between a string of symbols and a TLD wouldn't work; > someone might really want to encode "Brazil, United States, Sweden, > Lesotho" consecutively, and would not want this converted to "Brussels." > > And as Ken pointed out, TLDs are TLDs; they are not a general-purpose > geographic coding system. They don't include every sub-national region > or separatist group, only the ones that Donuts and similar companies > chose to register. There's no TLD for Abkhazia, for example, or for > ISIS. > > well, my propose for using GeoTLDs is an answer to the question "where do you put the line?" I agree a terminating symbol would help in expanding RIS system. > IMHO keept tied to 2-alpha codes is a poor choice for users. May be > > industry manufactures could find a better approach. > > Let's hope that industry manufacturers adhere to the standard instead of > going off on their own. I thought that was the idea when all these > cell-phone symbols were added to Unicode in the first place. > > I really full agree. Manufacturers must follow standards. I support standard, but IMHO RIS dessign is very strict. Unicode doesn't define flags. Unicode doesn't define country flags. Unicode define a mechanism to define ISO country (and dependent territories) flags But manufacturers doesn't follow 100% ISO country codes, for instance, dependent territories codes are usually mapped to country flag [1]. This is a choice made by industry manufacturers, but, it's not in ISO. Another choice made by industry is using a private code, like XK for Kosovo, that's good! The issue with Scotland, Walles, Catalonia and similar flags is a chicken and egg situation. If a manufacturer wants to add such flags, standard doesn't allow it!!! (PUA can be used, of course). And Unicode doesn't expand RIS because manufacturers doesn't use these flags. IMHO RIS mechanism should be expanded being more flexible, beyond 2 char RIS. Unicode doesn't define flags, it defines a mechanism. Manufacturers will choice supported flags, just like they are doing now! So, the real question here is: Where do you put the line? Currently it's put on ISO 3166-1 + some customizations made by industry, but always it's tied to 2 char RIS. IMHO this is too poor for covering real world use/request. I suggested using currently ISO country codes + cultural/language TLDs. Maybe there is a better approach Best regards, Joan Montan? [1] https://github.com/googlei18n/region-flags/blob/master/ALIASES -------------- next part -------------- An HTML attachment was scrubbed... URL: From derhoermi at gmx.net Wed Feb 11 14:39:22 2015 From: derhoermi at gmx.net (Bjoern Hoehrmann) Date: Wed, 11 Feb 2015 21:39:22 +0100 Subject: Unicode IDNA Compatibility Processing Proposed Update In-Reply-To: <54DBB951.7040108@unicode.org> References: <54DBB951.7040108@unicode.org> Message-ID: * announcements at unicode.org wrote: > Oh my... -- Bj?rn H?hrmann ? mailto:bjoern at hoehrmann.de ? http://bjoern.hoehrmann.de D-10243 Berlin ? PGP Pub. KeyID: 0xA4357E78 ? http://www.bjoernsworld.de Available for hire in Berlin (early 2015) ? http://www.websitedev.de/ From asmusf at ix.netcom.com Thu Feb 12 15:47:27 2015 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Thu, 12 Feb 2015 13:47:27 -0800 Subject: sex and emoji Message-ID: <54DD1F6F.8090002@ix.netcom.com> An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Thu Feb 12 22:15:56 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Fri, 13 Feb 2015 05:15:56 +0100 Subject: About cultural/languages communities flags In-Reply-To: References: Message-ID: RIS could represent languages as well, using BCP47 principle, except that they start by an ISO 3166 coide (as there's no territory, you'd normally use a 3166 code for undetermined region, but there's no 3166 code that starts by an hyphen. So to use a BCP47 language tag you could use the hyphen reencoded to RIS as the first character. The problem is that langauge codes in BCP47 have variable sizes. Even if you limit just to the ISO639 compatible repertoire (3 letter codes) you'd need to use 4 RIS codes And the language flags would be represented as RIS(HYPHEN)+RIS(ISO639-3 code). 4 codes would work with font rendering engines that can build 3 successive ligatures from left to right If there's no match for a know flag (or if there's an exact multiple of 4 RIS codes), the default glyphs would just show a blank flag frame showing the RIS Code converted back to ASCII letters (rendered with a small capitals style: where the first glyph shows the flag's hoist and the first RIS code and i.e. the hyphen, the 2nd and 3rd gyphs shows the top/bottom part of the blank frame an the ASCII character the 4th glyph is similar but adds the flying end of the flag, possibly decorated with non rectangular frame). If there remains less than 4 RIS codes, the flag frame would add the flying end of the flag, with no letter (or just the SPACE).. The wole would be in a large dotted frame to exhibit the special format. These default glyphs are easy to produce in the font. Then to support more languages (7000 languages : 7000 flags ? certainly not so many exist...), you just have to map new ligatures to replace the default ligatures by more accurate "flags". But my opinion is that "flags" (even ifshowing them generically) are not the cood concept for languages (I would highly prefer a "speech bubble frame" like on comics, even if some applications could render in them a colorful regional flag., or the letter code within the "sonor waves" of an audio speaker device. 2015-02-09 22:11 GMT+01:00 Joan Montan? : > > Hi all, > > I am the one who made the request to tweemoji Github. > > > 2015-02-09 20:16 GMT+01:00 Markus Scherer : > >> On Mon, Feb 9, 2015 at 9:54 AM, Andrea Giammarchi < >> andrea.giammarchi at gmail.com> wrote: >> >>> > if a cultural/language TLD is typed with Unicode RIS, then show the >>> flag for these culture/language: >>> >> >> This does not work. The "Unicode RIS" are defined to be used in pairs, >> with semantics according to corresponding ISO 3166 alpha2 codes. In your >> examples, each successive pair will encode a flag. >> >> > AFAIK, this is done in font side. Emoji flags are just ligatures, so a > font can provide a ligature for 4 RIS characters. This is not an issue here. > > I agree some strange behaviour can appear if a 3 RIS string, take CAT, is > shown in a system with only 2 RIS support (a Canadian will appear followed > by a T). > > > If you want to represent every flag of every locality, you first have to >> figure out how to catalog and label them. You are mentioning provinces, one >> level down from nation states; I guess there are thousands of them. In much >> of Europe, every little village >> has its own flag and coat of >> arms. Where do you want the text encoding and fonts to stop? >> >> > I don't request flag support for every flag in the world. I requested > flags for culture/language communities *with* an approved TLD (Top Level > Domain). > > I know flags are an issue, and I know flags represents territories, not > languages, but I think some support should be done for these active > communities. As I pointed, some country flag collections expand with a fews > non-independent country. See [1], [2] and [3] (search for Scottish or > Welsh flag). You can check this [4] petition requesting Catalan flag on > WhatsApp. > > So, there is a demand and they are used in real world. What's the way for > encoding them in Unicode standard? > > Thanks, > > Joan Montan? > > [1] http://www.famfamfam.com/lab/icons/flags/ > [2] https://www.gosquared.com/resources/flag-icons/ > [3] http://www.sherv.net/flag-emoticons.html > [4] > https://www.change.org/p/whatsapp-inc-incloure-la-senyera-de-catalunya-a-whatsapp > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From srl at icu-project.org Thu Feb 12 23:12:46 2015 From: srl at icu-project.org (Steven R. Loomis) Date: Thu, 12 Feb 2015 21:12:46 -0800 Subject: About cultural/languages communities flags In-Reply-To: References: Message-ID: > > El feb 9, 2015, a las 1:21 PM, Markus Scherer escribi?: > > > However, I would much prefer if everyone spent their considerable energy on upgrading protocols (e.g., IETF RFCs for email subject lines) and lobby relevant vendors (e.g., chat services & social network messages) to support images embedded in the text stream, ideally with scaling and other behavior that would make them behave somewhat text-like. This is the "long term solution" listed for emoji in http://www.unicode.org/reports/tr51/#Longer_Term S From verdy_p at wanadoo.fr Thu Feb 12 23:22:42 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Fri, 13 Feb 2015 06:22:42 +0100 Subject: About cultural/languages communities flags In-Reply-To: References: Message-ID: Another solution isalso to not extend the scope of use of RIS characters (leave them as they are for ISO3166-1 based codes only), but defne a separate set with "Language Indicator Symbols" (LIS) working the same way, but based on ISO 639-2 or -3 (3-letter codes, accepting also the language family codes also encoded on 3 letters, as well as alll -3 macrolanguages such as "zho" for Chinese or "que" for Quechua). Exactly the same principle as RIS, and as easy to produce with a generic font with very few actual glyphs (on the Ligatures OpenType table may look long, but it can be generated automatically by a basic script, to integrate it in the font build project). No need of complex ligature support, all can work based with a single lookup table of pairs (of glyph ids), simply because there's no need for reordering glyphs. And the default glyph id's for indidual LIS charactes would be mapped to the default building blocks shoiowing the "speech bubble frame" (so a baisc renderer not processing the fonct SUBST tables for ligatures would still produce the basic glyphs and produce a consistant result (even if no decorated bubble would show the colorful and decorated content matching a user-expected "flag" that would be produced in font whose design is based on country/region flags. No requirement by Unicode about how the decorated glyphs will look or about their use or color. Just like fonts with various styles for emojis, the font to use could be a user preference for the reader. No requirement as well to use an OpenType renderer, applications can use icons as well in any convenient graphic format (GIF, PNG, SVG...) as long as they match in term of dimension within the standard line height (not more than about 1.25 em in height incluiding top and bottom bearings). No requirement as well about their width. basic font styles (bold, italic) could be rendered as well by the default glyphs, either on their inner letters, or on the type of bubble frame, including for colorful bubbles whose generic "rounded rectangle" frame can also be "italicized" and bolden even when tit has a colorful complex content. Nowhere, that will mean that Unicode defines what is a valid language or not. All well-formed triplets are valid, and users are free to use 3-code sequences of LIS to do what they want as long as this respects the known ISO639 standard (otr its history, including retired codes). So it will be wellformed to use LIS codes to "say": yes or YES, with LIS[Y]+LIS[E]+LIS[S] (but if there's a ISO 639 language matching the code "yes",it is also valid to replace it with a bubble showing inside a culturally associated "flag-like" decoration. French uses could also use LIS[O]+LIS[U]+LIS[I] to "say": "oui" or "OUI", even if there's another ISO639 language matchin the code "oui" (there's inherently no violation of the per-character identity of LIS characters as Unicode does not encode ligatures or require them to be used for rendering. 2015-02-13 5:15 GMT+01:00 Philippe Verdy : > RIS could represent languages as well, using BCP47 principle, except that > they start by an ISO > 3166 coide (as there's no territory, you'd normally use a 3166 code for > undetermined region, but there's no 3166 code that starts by an hyphen. > So to use a BCP47 language tag you could use the hyphen reencoded to RIS > as the first character. > The problem is that langauge codes in BCP47 have variable sizes. Even if > you limit just to the ISO639 compatible repertoire (3 letter codes) you'd > need to use 4 RIS codes > And the language flags would be represented as RIS(HYPHEN)+RIS(ISO639-3 > code). > > 4 codes would work with font rendering engines that can build 3 successive > ligatures from left to right > > If there's no match for a know flag (or if there's an exact multiple of 4 > RIS codes), the default glyphs would just show a blank flag frame showing > the RIS Code converted back to ASCII letters (rendered with a small > capitals style: where the first glyph shows the flag's hoist and the first > RIS code and i.e. the hyphen, the 2nd and 3rd gyphs shows the top/bottom > part of the blank frame an the ASCII character the 4th glyph is similar but > adds the flying end of the flag, possibly decorated with non rectangular > frame). If there remains less than 4 RIS codes, the flag frame would add > the flying end of the flag, with no letter (or just the SPACE).. The wole > would be in a large dotted frame to exhibit the special format. > > These default glyphs are easy to produce in the font. Then to support more > languages (7000 languages : 7000 flags ? certainly not so many exist...), > you just have to map new ligatures to replace the default ligatures by more > accurate "flags". > > But my opinion is that "flags" (even ifshowing them generically) are not > the cood concept for languages (I would highly prefer a "speech bubble > frame" like on comics, even if some applications could render in them a > colorful regional flag., or the letter code within the "sonor waves" of an > audio speaker device. > > > 2015-02-09 22:11 GMT+01:00 Joan Montan? : > >> >> Hi all, >> >> I am the one who made the request to tweemoji Github. >> >> >> 2015-02-09 20:16 GMT+01:00 Markus Scherer : >> >>> On Mon, Feb 9, 2015 at 9:54 AM, Andrea Giammarchi < >>> andrea.giammarchi at gmail.com> wrote: >>> >>>> > if a cultural/language TLD is typed with Unicode RIS, then show the >>>> flag for these culture/language: >>>> >>> >>> This does not work. The "Unicode RIS" are defined to be used in pairs, >>> with semantics according to corresponding ISO 3166 alpha2 codes. In your >>> examples, each successive pair will encode a flag. >>> >>> >> AFAIK, this is done in font side. Emoji flags are just ligatures, so a >> font can provide a ligature for 4 RIS characters. This is not an issue here. >> >> I agree some strange behaviour can appear if a 3 RIS string, take CAT, is >> shown in a system with only 2 RIS support (a Canadian will appear followed >> by a T). >> >> >> If you want to represent every flag of every locality, you first have to >>> figure out how to catalog and label them. You are mentioning provinces, one >>> level down from nation states; I guess there are thousands of them. In much >>> of Europe, every little village >>> has its own flag and coat of >>> arms. Where do you want the text encoding and fonts to stop? >>> >>> >> I don't request flag support for every flag in the world. I requested >> flags for culture/language communities *with* an approved TLD (Top Level >> Domain). >> >> I know flags are an issue, and I know flags represents territories, not >> languages, but I think some support should be done for these active >> communities. As I pointed, some country flag collections expand with a fews >> non-independent country. See [1], [2] and [3] (search for Scottish or >> Welsh flag). You can check this [4] petition requesting Catalan flag on >> WhatsApp. >> >> So, there is a demand and they are used in real world. What's the way for >> encoding them in Unicode standard? >> >> Thanks, >> >> Joan Montan? >> >> [1] http://www.famfamfam.com/lab/icons/flags/ >> [2] https://www.gosquared.com/resources/flag-icons/ >> [3] http://www.sherv.net/flag-emoticons.html >> [4] >> https://www.change.org/p/whatsapp-inc-incloure-la-senyera-de-catalunya-a-whatsapp >> >> _______________________________________________ >> Unicode mailing list >> Unicode at unicode.org >> http://unicode.org/mailman/listinfo/unicode >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From cjsvance at gmail.com Fri Feb 13 00:04:42 2015 From: cjsvance at gmail.com (Christopher Vance) Date: Fri, 13 Feb 2015 17:04:42 +1100 Subject: About cultural/languages communities flags In-Reply-To: References: Message-ID: With ISO3166, there's almost always an objective answer to "what is the flag?". UA may be breaking up, but many of those opposed to the Kyiv government would prefer not to be in UA anyway. Sometimes there's a dispute as to which group is running a country, like in SY at the moment, but I'm guessing few would yet claim it's time to change the flag there. EH may be a problem. For languages, there's often no objective answer, unless you ask "which country has the most speakers?", and then you'd have to ask about first language vs second/third/etc. What flag for English? India, UK, US, or something else? What about sub-national language? I have been told there are more Tokelauans (and therefore to a first approximation speakers of Tokelauan) in Wellington NZ, than there are in Tokelau itself. Which flag for them? On Fri, Feb 13, 2015 at 4:22 PM, Philippe Verdy wrote: > Another solution isalso to not extend the scope of use of RIS characters > (leave them as they are for ISO3166-1 based codes only), but defne a > separate set with "Language Indicator Symbols" (LIS) working the same way, > but based on ISO 639-2 or -3 (3-letter codes, accepting also the language > family codes also encoded on 3 letters, as well as alll -3 macrolanguages > such as "zho" for Chinese or "que" for Quechua). > > Exactly the same principle as RIS, and as easy to produce with a generic > font with very few actual glyphs (on the Ligatures OpenType table may look > long, but it can be generated automatically by a basic script, to integrate > it in the font build project). No need of complex ligature support, all can > work based with a single lookup table of pairs (of glyph ids), simply > because there's no need for reordering glyphs. And the default glyph id's > for indidual LIS charactes would be mapped to the default building blocks > shoiowing the "speech bubble frame" (so a baisc renderer not processing the > fonct SUBST tables for ligatures would still produce the basic glyphs and > produce a consistant result (even if no decorated bubble would show the > colorful and decorated content matching a user-expected "flag" that would > be produced in font whose design is based on country/region flags. > > No requirement by Unicode about how the decorated glyphs will look or > about their use or color. Just like fonts with various styles for emojis, > the font to use could be a user preference for the reader. No requirement > as well to use an OpenType renderer, applications can use icons as well in > any convenient graphic format (GIF, PNG, SVG...) as long as they match in > term of dimension within the standard line height (not more than about 1.25 > em in height incluiding top and bottom bearings). No requirement as well > about their width. basic font styles (bold, italic) could be rendered as > well by the default glyphs, either on their inner letters, or on the type > of bubble frame, including for colorful bubbles whose generic "rounded > rectangle" frame can also be "italicized" and bolden even when tit has a > colorful complex content. > > Nowhere, that will mean that Unicode defines what is a valid language or > not. All well-formed triplets are valid, and users are free to use 3-code > sequences of LIS to do what they want as long as this respects the known > ISO639 standard (otr its history, including retired codes). So it will be > wellformed to use LIS codes to "say": yes or YES, with LIS[Y]+LIS[E]+LIS[S] > (but if there's a ISO 639 language matching the code "yes",it is also valid > to replace it with a bubble showing inside a culturally associated > "flag-like" decoration. French uses could also use LIS[O]+LIS[U]+LIS[I] to > "say": "oui" or "OUI", even if there's another ISO639 language matchin the > code "oui" (there's inherently no violation of the per-character identity > of LIS characters as Unicode does not encode ligatures or require them to > be used for rendering. > > > 2015-02-13 5:15 GMT+01:00 Philippe Verdy : > >> RIS could represent languages as well, using BCP47 principle, except that >> they start by an ISO >> 3166 coide (as there's no territory, you'd normally use a 3166 code for >> undetermined region, but there's no 3166 code that starts by an hyphen. >> So to use a BCP47 language tag you could use the hyphen reencoded to RIS >> as the first character. >> The problem is that langauge codes in BCP47 have variable sizes. Even if >> you limit just to the ISO639 compatible repertoire (3 letter codes) you'd >> need to use 4 RIS codes >> And the language flags would be represented as RIS(HYPHEN)+RIS(ISO639-3 >> code). >> >> 4 codes would work with font rendering engines that can build 3 >> successive ligatures from left to right >> >> If there's no match for a know flag (or if there's an exact multiple of 4 >> RIS codes), the default glyphs would just show a blank flag frame showing >> the RIS Code converted back to ASCII letters (rendered with a small >> capitals style: where the first glyph shows the flag's hoist and the first >> RIS code and i.e. the hyphen, the 2nd and 3rd gyphs shows the top/bottom >> part of the blank frame an the ASCII character the 4th glyph is similar but >> adds the flying end of the flag, possibly decorated with non rectangular >> frame). If there remains less than 4 RIS codes, the flag frame would add >> the flying end of the flag, with no letter (or just the SPACE).. The wole >> would be in a large dotted frame to exhibit the special format. >> >> These default glyphs are easy to produce in the font. Then to support >> more languages (7000 languages : 7000 flags ? certainly not so many >> exist...), you just have to map new ligatures to replace the default >> ligatures by more accurate "flags". >> >> But my opinion is that "flags" (even ifshowing them generically) are not >> the cood concept for languages (I would highly prefer a "speech bubble >> frame" like on comics, even if some applications could render in them a >> colorful regional flag., or the letter code within the "sonor waves" of an >> audio speaker device. >> >> >> 2015-02-09 22:11 GMT+01:00 Joan Montan? : >> >>> >>> Hi all, >>> >>> I am the one who made the request to tweemoji Github. >>> >>> >>> 2015-02-09 20:16 GMT+01:00 Markus Scherer : >>> >>>> On Mon, Feb 9, 2015 at 9:54 AM, Andrea Giammarchi < >>>> andrea.giammarchi at gmail.com> wrote: >>>> >>>>> > if a cultural/language TLD is typed with Unicode RIS, then show the >>>>> flag for these culture/language: >>>>> >>>> >>>> This does not work. The "Unicode RIS" are defined to be used in pairs, >>>> with semantics according to corresponding ISO 3166 alpha2 codes. In your >>>> examples, each successive pair will encode a flag. >>>> >>>> >>> AFAIK, this is done in font side. Emoji flags are just ligatures, so a >>> font can provide a ligature for 4 RIS characters. This is not an issue here. >>> >>> I agree some strange behaviour can appear if a 3 RIS string, take CAT, >>> is shown in a system with only 2 RIS support (a Canadian will appear >>> followed by a T). >>> >>> >>> If you want to represent every flag of every locality, you first have to >>>> figure out how to catalog and label them. You are mentioning provinces, one >>>> level down from nation states; I guess there are thousands of them. In much >>>> of Europe, every little village >>>> has its own flag and coat >>>> of arms. Where do you want the text encoding and fonts to stop? >>>> >>>> >>> I don't request flag support for every flag in the world. I requested >>> flags for culture/language communities *with* an approved TLD (Top Level >>> Domain). >>> >>> I know flags are an issue, and I know flags represents territories, not >>> languages, but I think some support should be done for these active >>> communities. As I pointed, some country flag collections expand with a fews >>> non-independent country. See [1], [2] and [3] (search for Scottish or >>> Welsh flag). You can check this [4] petition requesting Catalan flag on >>> WhatsApp. >>> >>> So, there is a demand and they are used in real world. What's the way >>> for encoding them in Unicode standard? >>> >>> Thanks, >>> >>> Joan Montan? >>> >>> [1] http://www.famfamfam.com/lab/icons/flags/ >>> [2] https://www.gosquared.com/resources/flag-icons/ >>> [3] http://www.sherv.net/flag-emoticons.html >>> [4] >>> https://www.change.org/p/whatsapp-inc-incloure-la-senyera-de-catalunya-a-whatsapp >>> >>> _______________________________________________ >>> Unicode mailing list >>> Unicode at unicode.org >>> http://unicode.org/mailman/listinfo/unicode >>> >>> >> > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > > -- Christopher Vance -------------- next part -------------- An HTML attachment was scrubbed... URL: From pandey at umich.edu Fri Feb 13 01:03:01 2015 From: pandey at umich.edu (Anshuman Pandey) Date: Fri, 13 Feb 2015 02:03:01 -0500 Subject: sex and emoji In-Reply-To: <54DD1F6F.8090002@ix.netcom.com> References: <54DD1F6F.8090002@ix.netcom.com> Message-ID: Never would have imagined 'sex' and 'Unicode' in the memetic scene, but a big ol' ?? to the UTC! Kudos, rather ??. > On Feb 12, 2015, at 4:47 PM, Asmus Freytag wrote: > > To quote: "While this probably isn?t news to fans of the eggplant emoji, ...." > > More here: > > http://time.com/3694763/match-com-dating-survey-emoji-sex/ > > A./ > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Fri Feb 13 05:04:54 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Fri, 13 Feb 2015 12:04:54 +0100 Subject: About cultural/languages communities flags In-Reply-To: References: Message-ID: 2015-02-13 7:04 GMT+01:00 Christopher Vance : > With ISO3166, there's almost always an objective answer to "what is the > flag?". UA may be breaking up, but many of those opposed to the Kyiv > government would prefer not to be in UA anyway. Sometimes there's a dispute > as to which group is running a country, like in SY at the moment, but I'm > guessing few would yet claim it's time to change the flag there. EH may be > a problem. > > For languages, there's often no objective answer, unless you ask "which > country has the most speakers?", and then you'd have to ask about first > language vs second/third/etc. What flag for English? India, UK, US, or > something else? What about sub-national language? I have been told there > are more Tokelauans (and therefore to a first approximation speakers of > Tokelauan) in Wellington NZ, than there are in Tokelau itself. Which flag > for them? > This is completely a non-issue with the Unicode standard itself. There's an ample enough space to use various designs that match character properties as well as user expectations *without* breaking the character identity itself. So even if the US flag is often used for English, in Britanic sites they will use the British flag. In the Republic of Ireland they'll won't use the Irish flag for the English language (prefered for the Irish language itself) and will unlikely use the British flag. In South Africa or India to, they won't use their national flag for English (multiple official languages there, and English is not even the preferred language). In those last cases they will prefer a neutral flag with just the letters "en" to using the alternative with the US flag, or they will use a "pachwork" flag mixing the US flag and the British flag.... It's up to appplications to use the set of glyphs that are appropriate for their own users, or to offer them the choice of fonts or icon sets, either in the UI of their input method, or keyboards (even physical keyboards if they can display icons with small displays on top of keycaps, or on a row of virtual keys added on a touch display panel on top of the keyboard (with the appropriate drivers for installing the support for the secondary display adapter and touch device), or to vendors to sell stickers or custom keycaps. Applications can also offer the same choice by preference in their text renderer (or web browser). Word processors can also offer it with their font selector, for those that want to produce preset documents with a design determined by the author or the web designers or some predetermined graphic charter for collective works. This choice can include prefilled sets matching several common cultures, or various styles (such as falt rectangular flags, or free flying flags, or basic text in a blank flag frame). If users don't want to see the official national flags but prefer to see other icon matchin his culture (including objects such as an Eiffel Tower for France or the logos of their regional council, or the logo of region capitals, or a small locator map of the region), they can do so. All this remains valid for "flags" used to repesent ISO regions, but as well will be vali for -------------- next part -------------- An HTML attachment was scrubbed... URL: From shervinafshar at gmail.com Fri Feb 13 09:20:49 2015 From: shervinafshar at gmail.com (Shervin Afshar) Date: Fri, 13 Feb 2015 07:20:49 -0800 Subject: sex and emoji In-Reply-To: <54DD1F6F.8090002@ix.netcom.com> References: <54DD1F6F.8090002@ix.netcom.com> Message-ID: Related opinion piece: "Are you a smug emoji snob? Chances are you're not getting laid" http://gu.com/p/45n8e/stw On Feb 12, 2015 1:52 PM, "Asmus Freytag" wrote: > To quote: "While this probably isn?t news to fans of the eggplant emoji > , > ...." > > More here: > > http://time.com/3694763/match-com-dating-survey-emoji-sex/ > > A./ > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From shervinafshar at gmail.com Fri Feb 13 09:37:13 2015 From: shervinafshar at gmail.com (Shervin Afshar) Date: Fri, 13 Feb 2015 07:37:13 -0800 Subject: About cultural/languages communities flags In-Reply-To: References: Message-ID: On Feb 13, 2015 3:12 AM, "Philippe Verdy" wrote: > This is completely a non-issue with the Unicode standard itself. There's an ample enough space to use various designs that match character properties as well as user expectations *without* breaking the character identity itself. So even if the US flag is often used for English, in Britanic sites they will use the British flag. In the Republic of Ireland they'll won't use the Irish flag for the English language (prefered for the Irish language itself) and will unlikely use the British flag. In South Africa or India to, they won't use their national flag for English (multiple official languages there, and English is not even the preferred language). Are these statements about use of flags for language selectors on websites, based on some UX study, survey, or commonly accepted guideline, or are they just speculations? -------------- next part -------------- An HTML attachment was scrubbed... URL: From kenwhistler at att.net Fri Feb 13 10:12:51 2015 From: kenwhistler at att.net (Ken Whistler) Date: Fri, 13 Feb 2015 08:12:51 -0800 Subject: Language tags redux (was: Re: About cultural/languages communities flags) In-Reply-To: References: Message-ID: <54DE2283.8090407@att.net> Philippe may have overlooked the fact that this has been tried (years ago) in the Unicode Standard. See: language tags. http://www.unicode.org/versions/Unicode7.0.0/ch23.pdf#G26419 The syntax for those even goes beyond just ISO 639-2/3 to incorporate the full range of BCP 47 tags, in principle. But the catch is that the language tag characters ended up *deprecated*, precisely because attempting to do this kind of thing in plain text is the wrong thing to do -- it interferes with the level-appropriate language tagging mechanisms available in markup. I see no point in speculating about reinventing this particular broken wheel one more time for the Unicode Standard. --Ken On 2/12/2015 9:22 PM, Philippe Verdy wrote: > Another solution isalso to not extend the scope of use of RIS > characters (leave them as they are for ISO3166-1 based codes only), > but defne a separate set with "Language Indicator Symbols" (LIS) > working the same way, but based on ISO 639-2 or -3 (3-letter codes, > accepting also the language family codes also encoded on 3 letters, as > well as alll -3 macrolanguages such as "zho" for Chinese or "que" for > Quechua). > > > Nowhere, that will mean that Unicode defines what is a valid language > or not. All well-formed triplets are valid, and users are free to use > 3-code sequences of LIS to do what they want as long as this respects > the known ISO639 standard (otr its history, including retired codes). ... > > From verdy_p at wanadoo.fr Fri Feb 13 11:09:52 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Fri, 13 Feb 2015 18:09:52 +0100 Subject: Language tags redux (was: Re: About cultural/languages communities flags) In-Reply-To: <54DE2283.8090407@att.net> References: <54DE2283.8090407@att.net> Message-ID: I do not propose it as a "language markup" but only as "visible" icons (independant of the language markup used in text), similar to RIS icons in the Emoji set. This is *not* the same usage. In other words, these icons may be rendered with *translated* levels inside, or localized locally to the appropriate culture (just like flag icons) to represent the same "referenced language" (not necessarily the same "used language" in the document, with the language markup... 2015-02-13 17:12 GMT+01:00 Ken Whistler : > Philippe may have overlooked the fact that this has been tried (years ago) > in the > Unicode Standard. See: language tags. > > http://www.unicode.org/versions/Unicode7.0.0/ch23.pdf#G26419 > > The syntax for those even goes beyond just ISO 639-2/3 to incorporate > the full range of BCP 47 tags, in principle. > > But the catch is that the language tag characters ended up *deprecated*, > precisely because attempting to do this kind of thing in plain text is the > wrong thing to do -- it interferes with the level-appropriate language > tagging mechanisms available in markup. > > I see no point in speculating about reinventing this particular broken > wheel one > more time for the Unicode Standard. > > --Ken > > On 2/12/2015 9:22 PM, Philippe Verdy wrote: > >> Another solution isalso to not extend the scope of use of RIS characters >> (leave them as they are for ISO3166-1 based codes only), but defne a >> separate set with "Language Indicator Symbols" (LIS) working the same way, >> but based on ISO 639-2 or -3 (3-letter codes, accepting also the language >> family codes also encoded on 3 letters, as well as alll -3 macrolanguages >> such as "zho" for Chinese or "que" for Quechua). >> >> >> Nowhere, that will mean that Unicode defines what is a valid language or >> not. All well-formed triplets are valid, and users are free to use 3-code >> sequences of LIS to do what they want as long as this respects the known >> ISO639 standard (otr its history, including retired codes). ... >> >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Fri Feb 13 11:13:10 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Fri, 13 Feb 2015 18:13:10 +0100 Subject: About cultural/languages communities flags In-Reply-To: References: Message-ID: This is just experience of visiting sites commonly using these flags to represent (inappropriately) languages *visually*. And even if it is not the best way to represent languages, this is what happens (Unicode cannot interfer with the freedom of speech and the choice of authors if they prefer visual icons to plain words). 2015-02-13 16:37 GMT+01:00 Shervin Afshar : > > On Feb 13, 2015 3:12 AM, "Philippe Verdy" wrote: > > > This is completely a non-issue with the Unicode standard itself. There's > an ample enough space to use various designs that match character > properties as well as user expectations *without* breaking the character > identity itself. So even if the US flag is often used for English, in > Britanic sites they will use the British flag. In the Republic of Ireland > they'll won't use the Irish flag for the English language (prefered for the > Irish language itself) and will unlikely use the British flag. In South > Africa or India to, they won't use their national flag for English > (multiple official languages there, and English is not even the preferred > language). > > Are these statements about use of flags for language selectors on > websites, based on some UX study, survey, or commonly accepted guideline, > or are they just speculations? > -------------- next part -------------- An HTML attachment was scrubbed... URL: From shervinafshar at gmail.com Fri Feb 13 11:41:15 2015 From: shervinafshar at gmail.com (Shervin Afshar) Date: Fri, 13 Feb 2015 09:41:15 -0800 Subject: Use of Flags as Language Identifier on the Web (was: About cultural/languages communities flags) Message-ID: I'm neither proposing nor implying what should or should not be done or whether Unicode can or can not interfere with anything anywhere. I'm just curious about use of flags in language selectors or as visual language identifier on websites which you wrote about. I know of some organizations that strictly avoid using flags altogether to represent languages. Did you encounter that during your research? Also, do you have your research on this matter documented somewhere else so I can refer my colleagues in i18n to it? ? Shervin On Fri, Feb 13, 2015 at 9:13 AM, Philippe Verdy wrote: > This is just experience of visiting sites commonly using these flags to > represent (inappropriately) languages *visually*. And even if it is not the > best way to represent languages, this is what happens (Unicode cannot > interfer with the freedom of speech and the choice of authors if they > prefer visual icons to plain words). > > > 2015-02-13 16:37 GMT+01:00 Shervin Afshar : > >> >> On Feb 13, 2015 3:12 AM, "Philippe Verdy" wrote: >> >> > This is completely a non-issue with the Unicode standard itself. >> There's an ample enough space to use various designs that match character >> properties as well as user expectations *without* breaking the character >> identity itself. So even if the US flag is often used for English, in >> Britanic sites they will use the British flag. In the Republic of Ireland >> they'll won't use the Irish flag for the English language (prefered for the >> Irish language itself) and will unlikely use the British flag. In South >> Africa or India to, they won't use their national flag for English >> (multiple official languages there, and English is not even the preferred >> language). >> >> Are these statements about use of flags for language selectors on >> websites, based on some UX study, survey, or commonly accepted guideline, >> or are they just speculations? >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Fri Feb 13 12:20:23 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Fri, 13 Feb 2015 19:20:23 +0100 Subject: Use of Flags as Language Identifier on the Web (was: About cultural/languages communities flags) In-Reply-To: References: Message-ID: There are many examples and notably on home pages of a lot of commercial sites un their top bar and in startup selectors of many mobile apps or in popular games or on various including translation tools or catalogues of dictionnaires ans manu printed dictionbaries show these flags on their cover, including wellknown ones from famous brands such as Harraps or Larousse. Or on official sites of various tourism information offices and museums on their printed leaflets or on museums. They do not support all languages with accurate translations but are giving a visual choice or indicator of the language this way. Many physical products use these flags on their printed labels or boxes and embedded leaflets for listing used components or describe their use. As this saves space on the limited size of the label or box. Most people cannot identify standard language codes correctly but recognize the flag commonly used to designate their language. These icons also replace bullet separators for their visual impact, they are true symbols acting like ponctuation, but more visible si they allow saving newlines as well. Even if country flags are not culturally neutral for those languages they are very often sufficient for the few listed languages. And with the same frequency we see packagings showing country codes instead of language codes. When they realize that country flags are too much culturally/politically oriented they do not want tout show them will juste use region codes, more less decorated (not always standard ISO codes but like on car plates). These uses are on fact very old, before standardisation of language codes and they have notre disappeared and will likely not in any expected short time frame. Now with the internet available around the world, massively advertized and used daily in multiple times or activities, people know their country code but still not their langage code... Le 13 f?vr. 2015 18:42, "Shervin Afshar" a ?crit : > I'm neither proposing nor implying what should or should not be done or > whether Unicode can or can not interfere with anything anywhere. I'm just > curious about use of flags in language selectors or as visual language > identifier on websites which you wrote about. > > I know of some organizations that strictly avoid using flags altogether to > represent languages. Did you encounter that during your research? > > Also, do you have your research on this matter documented somewhere else > so I can refer my colleagues in i18n to it? > > > ? Shervin > > On Fri, Feb 13, 2015 at 9:13 AM, Philippe Verdy > wrote: > >> This is just experience of visiting sites commonly using these flags to >> represent (inappropriately) languages *visually*. And even if it is not the >> best way to represent languages, this is what happens (Unicode cannot >> interfer with the freedom of speech and the choice of authors if they >> prefer visual icons to plain words). >> >> >> 2015-02-13 16:37 GMT+01:00 Shervin Afshar : >> >>> >>> On Feb 13, 2015 3:12 AM, "Philippe Verdy" wrote: >>> >>> > This is completely a non-issue with the Unicode standard itself. >>> There's an ample enough space to use various designs that match character >>> properties as well as user expectations *without* breaking the character >>> identity itself. So even if the US flag is often used for English, in >>> Britanic sites they will use the British flag. In the Republic of Ireland >>> they'll won't use the Irish flag for the English language (prefered for the >>> Irish language itself) and will unlikely use the British flag. In South >>> Africa or India to, they won't use their national flag for English >>> (multiple official languages there, and English is not even the preferred >>> language). >>> >>> Are these statements about use of flags for language selectors on >>> websites, based on some UX study, survey, or commonly accepted guideline, >>> or are they just speculations? >>> >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From shervinafshar at gmail.com Fri Feb 13 13:37:32 2015 From: shervinafshar at gmail.com (Shervin Afshar) Date: Fri, 13 Feb 2015 11:37:32 -0800 Subject: Use of Flags as Language Identifier on the Web (was: About cultural/languages communities flags) In-Reply-To: References: Message-ID: Some of what you mentioned are relevant to the general topic in a very broad sense, but not relevant to the focus of the conversation we're having here; e.g. saving space in package design, replacing bullet separators, etc. Although not relevant to the conversation, still as an i18n practitioner, I'd like to see them in a document with some figures and some references. See this[1] as an exquisite example. > These uses are on fact very old, before standardisation of language codes > and they have notre disappeared and will likely not in any expected short > time frame. Is there an example of a multilingual document pre-dating ISO/TC 37 and ISO/R 639 which uses flags to distinguish text in different languages? Most people cannot identify standard language codes correctly but recognize > the flag commonly used to designate their language. [...] Even if country flags are not culturally neutral for those languages they > are very often sufficient for the few listed languages. I agree with what you're saying about language codes being sometimes obscure to common user. I also agree with what you said yesterday in the other thread about flags not being good to visually represent languages: On Thu, Feb 12, 2015 at 8:15 PM, Philippe Verdy wrote: > But my opinion is that "flags" (even ifshowing them generically) are not > the cood concept for languages All said and done, it seems to me there are always better ways to represent languages in software UIs. A very large scale and illustrative example is Wikimedia Foundation's Universal Language Selector[2]. It is used on most WMF projects to switch between hundreds of languages and it doesn't use neither flags nor language codes in its UI. See the design notes[3]. [1]: http://www.w3.org/TR/jlreq/ [2]: https://www.mediawiki.org/wiki/Universal_Language_Selector [3]: https://www.mediawiki.org/w/index.php?title=Universal_Language_Selector/Interaction_Design_Framework#Iconography_to_represent_languages ? Shervin On Fri, Feb 13, 2015 at 10:20 AM, Philippe Verdy wrote: > There are many examples and notably on home pages of a lot of commercial > sites un their top bar and in startup selectors of many mobile apps or in > popular games or on various including translation tools or catalogues of > dictionnaires ans manu printed dictionbaries show these flags on their > cover, including wellknown ones from famous brands such as Harraps or > Larousse. > Or on official sites of various tourism information offices and museums on > their printed leaflets or on museums. They do not support all languages > with accurate translations but are giving a visual choice or indicator of > the language this way. > Many physical products use these flags on their printed labels or boxes > and embedded leaflets for listing used components or describe their use. As > this saves space on the limited size of the label or box. > Most people cannot identify standard language codes correctly but > recognize the flag commonly used to designate their language. > These icons also replace bullet separators for their visual impact, they > are true symbols acting like ponctuation, but more visible si they allow > saving newlines as well. > Even if country flags are not culturally neutral for those languages they > are very often sufficient for the few listed languages. > And with the same frequency we see packagings showing country codes > instead of language codes. > When they realize that country flags are too much culturally/politically > oriented they do not want tout show them will juste use region codes, more > less decorated (not always standard ISO codes but like on car plates). > These uses are on fact very old, before standardisation of language codes > and they have notre disappeared and will likely not in any expected short > time frame. Now with the internet available around the world, massively > advertized and used daily in multiple times or activities, people know > their country code but still not their langage code... > Le 13 f?vr. 2015 18:42, "Shervin Afshar" a > ?crit : > > I'm neither proposing nor implying what should or should not be done or >> whether Unicode can or can not interfere with anything anywhere. I'm just >> curious about use of flags in language selectors or as visual language >> identifier on websites which you wrote about. >> >> I know of some organizations that strictly avoid using flags altogether >> to represent languages. Did you encounter that during your research? >> >> Also, do you have your research on this matter documented somewhere else >> so I can refer my colleagues in i18n to it? >> >> >> ? Shervin >> >> On Fri, Feb 13, 2015 at 9:13 AM, Philippe Verdy >> wrote: >> >>> This is just experience of visiting sites commonly using these flags to >>> represent (inappropriately) languages *visually*. And even if it is not the >>> best way to represent languages, this is what happens (Unicode cannot >>> interfer with the freedom of speech and the choice of authors if they >>> prefer visual icons to plain words). >>> >>> >>> 2015-02-13 16:37 GMT+01:00 Shervin Afshar : >>> >>>> >>>> On Feb 13, 2015 3:12 AM, "Philippe Verdy" wrote: >>>> >>>> > This is completely a non-issue with the Unicode standard itself. >>>> There's an ample enough space to use various designs that match character >>>> properties as well as user expectations *without* breaking the character >>>> identity itself. So even if the US flag is often used for English, in >>>> Britanic sites they will use the British flag. In the Republic of Ireland >>>> they'll won't use the Irish flag for the English language (prefered for the >>>> Irish language itself) and will unlikely use the British flag. In South >>>> Africa or India to, they won't use their national flag for English >>>> (multiple official languages there, and English is not even the preferred >>>> language). >>>> >>>> Are these statements about use of flags for language selectors on >>>> websites, based on some UX study, survey, or commonly accepted guideline, >>>> or are they just speculations? >>>> >>> >>> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Fri Feb 13 16:33:17 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Fri, 13 Feb 2015 23:33:17 +0100 Subject: Use of Flags as Language Identifier on the Web (was: About cultural/languages communities flags) In-Reply-To: References: Message-ID: 2015-02-13 20:37 GMT+01:00 Shervin Afshar : > Some of what you mentioned are relevant to the general topic in a very > broad sense, but not relevant to the focus of the conversation we're having > here; e.g. saving space in package design, replacing bullet separators, > etc. Although not relevant to the conversation, still as an i18n > practitioner, I'd like to see them in a document with some figures and some > references. See this[1] as an exquisite example. > > >> These uses are on fact very old, before standardisation of language codes >> and they have notre disappeared and will likely not in any expected short >> time frame. > > > Is there an example of a multilingual document pre-dating ISO/TC 37 and > ISO/R 639 which uses flags to distinguish text in different languages? > >> My sentence was more generic than that. It was about the old practice of using things identifies countries/regions where the real meaning was to represent languages (independantly of regions where it is supposed to be "mostly" spoken (false for languages that are much more spoken in other places than their native region. So various things associated to places (rather than languages) have been used and continue to be used: * more or less abbreviated coutnry/region names (often altered locally or using imaginative/poetic descriptions at best, or frequently as well using insulting slang words for these regions names) * the standard name of these regions (even if the language is no longer spoken there: it has the side effect that those that speak the language today are considered as "strangers" within their current country. * the new name of the region once it has become an region occupied by another ruler (the old name used when that region was still self-governing is prohibited. * iconic representations of various objects typical of this region (e.g. using an icon of the Eiffel Tower to designate Paris, or France, or an iconic representation of the Colyseum to erpresent Rome Italy, or the Tower of Pise as well, or a Pyramid to represent Egypt) as a way to designate the language that is mostly spoken there or originates from there; wellknown monuments in this region are the most used * But you'll see also (notably in sports) a frog or a peacok to represent France, an other natural elements symbolizing historical events in nations of UK. Frequently these elements may be also part of today's flags (e.g. the mapple leaf for Canada, the hermine for Britanny) * Flags **of course** for these regions (but there are disagreements about the choice of Flag, as well as to the graographical border of the region where that language is spoken or originates) * Coats of arms * National colors in some arrangements (far from the effective form of the flag even if it includes these colors). * Iconic representation of the region borders (often only the borders remaining in today's countries) * Religious and esotheric symbols * Other non inconic symbols of these regions (flags are not the only official symbols of today's countries) : it could be some notes of an anthem, or a a famous song or music from a musician of that region (which European country do you think the three apples may mean in Romance countries ? you have to think about it phonetically, and then to which European language will you associate these three apples ?) * Photos of portraits, or scultpures of famous persons from that region, notably the most famous artists (e.g. look into per-language categories of the "Languages" category on several editions of Wiktionnary),frequentlty these are poets, writers, dramaturges. * Common sentences attributing object to the country or region (a standard used in East Asian regions, and replacing country names without using any phonologic similarity). Those sentences are also depicted iconically on their flags (e.g. Japan). ... In all those cases, there's a common confusion between designating regions and languages (and politically it seems that most countries want to define their concept of nation and associated territory to a language and want that language to be named according to the way theur also name the region. So most frequenty, the "gentil?s" derived friom the region name to designate people of that region are used as adjectives qualifying every subject used by people of this region or from hat region (and these include theur language) Human history, since many centuries, has a huge record of dramatic events caused by this confusion of cultures/languages/peoples with regions by their current winning rulers as well as by their occupants and occupied countruesx. This is stil lthe case today and new events are coming almost every day to recall it. This contaminates the basic concept of "nation" and even th way we write and pronounce languages. -------------- next part -------------- An HTML attachment was scrubbed... URL: From shervinafshar at gmail.com Fri Feb 13 16:46:05 2015 From: shervinafshar at gmail.com (Shervin Afshar) Date: Fri, 13 Feb 2015 14:46:05 -0800 Subject: Use of Flags as Language Identifier on the Web (was: About cultural/languages communities flags) In-Reply-To: References: Message-ID: I see. It all make sense to me now. For some reason, I was of the impression that we are talking about flags and language codes here. ? Shervin On Fri, Feb 13, 2015 at 2:33 PM, Philippe Verdy wrote: > > > 2015-02-13 20:37 GMT+01:00 Shervin Afshar : > >> Some of what you mentioned are relevant to the general topic in a very >> broad sense, but not relevant to the focus of the conversation we're having >> here; e.g. saving space in package design, replacing bullet separators, >> etc. Although not relevant to the conversation, still as an i18n >> practitioner, I'd like to see them in a document with some figures and some >> references. See this[1] as an exquisite example. >> >> >>> These uses are on fact very old, before standardisation of language >>> codes and they have notre disappeared and will likely not in any expected >>> short time frame. >> >> >> Is there an example of a multilingual document pre-dating ISO/TC 37 and >> ISO/R 639 which uses flags to distinguish text in different languages? >> >>> > My sentence was more generic than that. It was about the old practice of > using things identifies countries/regions where the real meaning was to > represent languages (independantly of regions where it is supposed to be > "mostly" spoken (false for languages that are much more spoken in other > places than their native region. > So various things associated to places (rather than languages) have been > used and continue to be used: > * more or less abbreviated coutnry/region names (often altered locally or > using imaginative/poetic descriptions at best, or frequently as well using > insulting slang words for these regions names) > * the standard name of these regions (even if the language is no longer > spoken there: it has the side effect that those that speak the language > today are considered as "strangers" within their current country. > * the new name of the region once it has become an region occupied by > another ruler (the old name used when that region was still self-governing > is prohibited. > * iconic representations of various objects typical of this region (e.g. > using an icon of the Eiffel Tower to designate Paris, or France, or an > iconic representation of the Colyseum to erpresent Rome Italy, or the Tower > of Pise as well, or a Pyramid to represent Egypt) as a way to designate the > language that is mostly spoken there or originates from there; wellknown > monuments in this region are the most used > * But you'll see also (notably in sports) a frog or a peacok to represent > France, an other natural elements symbolizing historical events in nations > of UK. Frequently these elements may be also part of today's flags (e.g. > the mapple leaf for Canada, the hermine for Britanny) > * Flags **of course** for these regions (but there are disagreements about > the choice of Flag, as well as to the graographical border of the region > where that language is spoken or originates) > * Coats of arms > * National colors in some arrangements (far from the effective form of the > flag even if it includes these colors). > * Iconic representation of the region borders (often only the borders > remaining in today's countries) > * Religious and esotheric symbols > * Other non inconic symbols of these regions (flags are not the only > official symbols of today's countries) : it could be some notes of an > anthem, or a a famous song or music from a musician of that region (which > European country do you think the three apples may mean in Romance > countries ? you have to think about it phonetically, and then to which > European language will you associate these three apples ?) > * Photos of portraits, or scultpures of famous persons from that region, > notably the most famous artists (e.g. look into per-language categories of > the "Languages" category on several editions of Wiktionnary),frequentlty > these are poets, writers, dramaturges. > * Common sentences attributing object to the country or region (a standard > used in East Asian regions, and replacing country names without using any > phonologic similarity). Those sentences are also depicted iconically on > their flags (e.g. Japan). > ... > > In all those cases, there's a common confusion between designating regions > and languages (and politically it seems that most countries want to define > their concept of nation and associated territory to a language and want > that language to be named according to the way theur also name the region. > So most frequenty, the "gentil?s" derived friom the region name to > designate people of that region are used as adjectives qualifying every > subject used by people of this region or from hat region (and these include > theur language) > > Human history, since many centuries, has a huge record of dramatic events > caused by this confusion of cultures/languages/peoples with regions by > their current winning rulers as well as by their occupants and occupied > countruesx. This is stil lthe case today and new events are coming almost > every day to recall it. This contaminates the basic concept of "nation" and > even th way we write and pronounce languages. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Sat Feb 14 07:53:16 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Sat, 14 Feb 2015 14:53:16 +0100 Subject: Unicode block for programming related symbols and codepoints? In-Reply-To: <54D7DCDD.9060003@colson.eu> References: <54D7C3EA.6080000@web.de> <54D7DCDD.9060003@colson.eu> Message-ID: But the TAB is still the whitespace character you describe that is accepted in the programming language using it. Defining a new codepoint would require the lexical analyzer of these languages to be modified (you modify those languages). Clearly, given that the lexiccal items of the programming languages for the functions you describe are is a very closed subset, you cannot substitute them. All you describe is a matter of design for the UI of code editors, which will still scan the edited sources looking for TABs any not your custom character, in order to display it in a custom way, accoding to preferences of the programmer. We are in fact not talking about the character identities (the only significant identiy here is the identity of the original characters in the source text, and the code editor will not alter it even if they display it differently (but they only "display" them, they don't replace them, unless the progrzmmer effectively makes a change to the source code (such as reindeting or compressing whitespaces, or using a source code beautifer/reformatter (which is safe to use in those editors ONLY if these editors effectively recognize not only the source characters, but also the syntax of the source language (so not only it must be able to read and scan te source, but it must also know which programming language you are using (generally it uses the file extension of the source file, but if you have still not given a filename to your source by saving it (or by adding a nod eto yuor source tree in your IDE), you can still select the programming language in the menu of the editor. The same editor can then present the source program in any convenient presentation that matches the expectations and needs of the programmers using it: it will typically provide syntax coloring, it will group/ungroup blocks of source lines (by detecting the syntax used to delimit blocks (punctuations, begin/end keywords,indentation, statement separators or operators, priority orders of operators...) The presentation made will never depend of your new "character" (and a new symbolic character is not the unique and best way to present the programming structure because the needs for progrzammers is at a higher level than isolated characters, but based on the upper-level parsing seyntax of programming blocks, statements and operations: the program can then be presented in a treeview listing nodes with sorted listed of properties, where property values can also be another tree). The tree is also not the only option: you could as well have rectangular blocks that you can expand/reduce, appearing as multine blocks of rich text containing other blocks. Additionally there could be several superposed structures that are not hierarchically embedded (e.g. one for a line-base preprocessor, another for the code as it would be understtod by the next layer, after the preprocessing layer) And even in programminag languages, there exists structures that do not obey the hierarchic structure (e.g. SGML and HTML where elements can rreely close the scope of extension of /many/ previously opened /blocks/, and not just the one that is in the top of stack When you close an eement that is not at the top of stack, the existing top of stack /may/ remain at the top of stack, or could be closed implicitly, according to complex matching rules (which depend of properties of all elements in the stack between the element you are explciitly closing and the element at top of stck) 2015-02-08 23:02 GMT+01:00 Jean-Fran?ois Colson : > Le 08/02/15 22:32, Pierpaolo Bernardi a ?crit : > > On Sun, Feb 8, 2015 at 9:15 PM, Alfred Zett wrote: > > [?] > > > > -- unlike tabs or space, it wouldn't be whitespace > > [?] > > > > a Tab is exactly what you described. > > Not exactly: a tab IS whitespace. > It may sometimes be displayed in a different color or with a special > symbol on request if the editor allows it, but in most cases it is > whitespace. > > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Sat Feb 14 08:23:10 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Sat, 14 Feb 2015 15:23:10 +0100 Subject: Unicode block for programming related symbols and codepoints? In-Reply-To: References: <54D7C3EA.6080000@web.de> <54D7E2E2.6080705@web.de> Message-ID: 2015-02-08 23:54 GMT+01:00 Pierpaolo Bernardi : > On Sun, Feb 8, 2015 at 11:27 PM, Alfred Zett wrote: > > > That was exactly my thought, so I figured it couldn't harm to have these > > >> a Tab is exactly what you described. > > > > No. It's only half of what I described. > > It's still a typographical character that implies whitespace and may > appear > > everywhere in the text. > > How would your proposed character be displayed as plain text? > You new language will have to invent another language syntax for exporting and serializing its native source into a plain text file. It will certainly use an escaping syntax (such as the commn use of backslahes), but that syntax will be a traditional syntax for traditional programs. And standard ASCII or UTF-8 encodings using standard characters will be largely enough. Your programming toll will need a separate serializer and a separate parser that that alternate syntax, or it could reuse some existing parsers (such as XML and JSON serializers and parsers,or existing generic libraries handing rich text documents containinig embedded collections, and an API more or less like DOM APIs offered with an adapting layer of "bindings" for lots of other languages, with a binary interface, or an SQL-like interface, or other convenient interfaces such as common collections and associative arrays, or containers like ZIP/JAR files) : The programmer will in fact not have to edit these complex source files, but may look inside with tricky tools can could corrupt its internal structure of references. They will just use the specific IDE made for your language, will select a file or resource (e.g. a network service) using that custom syntax, it will be loaded (or will perform queries) to edit some viewable and editable parts of the program, and many internal data used in the native format (notably the purely internal references and pointers) will be hidden to them and will change without notice, while preserving the intended structure of your langage. In many modern environments, in fact a single programmer cannot reprogram the whole project but can only edit some parts of it, and there are privileged operations (reservd to some groups of users) and some parts that will change in parallel and can be edited in teams of programmers/designers/correctors and that require another system to coordinate works and resolve edit conflicts, or to create alternate branches that someone else will merge into the common trunk: the programmers create their own branches not seen by others, until the programmer submits its proposed branch for review by more privileged users. It does not mean that, even if that branch is rejected for merging in the trunck, the bracnh will be necesarily deleted: that programmer/designer can still use his own branch without effecting other users using the common trunk or designing or using their own branch (o that want to keep an older version of the trunk, ignoring new versions). We are clkearly out of scope of Unciode because we are not speaking about text, but about programming tools and services, and about models of operations for working or cooperating teams (and those teams will include various types of peoiple, not just designers and programmers, but (as well) final users and customers creating their own customizations and adding their own features and data and interoperating using various "programming languages" and tools with various UIs, more friendly than traditional linear and text-based programming languages). -------------- next part -------------- An HTML attachment was scrubbed... URL: From 0.le.phare.ouest at gmail.com Sat Feb 14 12:12:35 2015 From: 0.le.phare.ouest at gmail.com (=?UTF-8?B?QW50b2luZSBNw6lyaWM=?=) Date: Sat, 14 Feb 2015 19:12:35 +0100 Subject: sex and emoji In-Reply-To: References: <54DD1F6F.8090002@ix.netcom.com> Message-ID: <54DF9013.2020406@gmail.com> I was wondering, has the question ? Given the massive usage over time of the glyph, and the number of academic papers about it, should we consider adding a PHALLIC REPRESENTATION to the unicode standard ? ? ever been asked ? Seriously, Antoine M?RIC Le 13/02/2015 08:03, Anshuman Pandey a ?crit : > Never would have imagined 'sex' and 'Unicode' in the memetic scene, > but a big ol' ?? to the UTC! Kudos, rather ??. > > > > On Feb 12, 2015, at 4:47 PM, Asmus Freytag > wrote: > >> To quote: "While this probably isn?t news to fans of the eggplant >> emoji >> , >> ...." >> >> More here: >> >> http://time.com/3694763/match-com-dating-survey-emoji-sex/ >> >> A./ >> _______________________________________________ >> Unicode mailing list >> Unicode at unicode.org >> http://unicode.org/mailman/listinfo/unicode > > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode -------------- next part -------------- An HTML attachment was scrubbed... URL: From timpart at perdix.demon.co.uk Mon Feb 16 01:25:12 2015 From: timpart at perdix.demon.co.uk (Tim Partridge) Date: Mon, 16 Feb 2015 07:25:12 +0000 Subject: sex and emoji In-Reply-To: <54DF9013.2020406@gmail.com> References: <54DD1F6F.8090002@ix.netcom.com> , <54DF9013.2020406@gmail.com> Message-ID: <8C324C32065663409974565298FC6EC52D1BACB8@exmbx04.thus.corp> Antoine M?ric ?said > I was wondering, has the question ? Given the massive usage over time > of the glyph, and the number of academic papers about it, should we consider >adding a PHALLIC REPRESENTATION to the unicode standard ? ? ever been asked ? The Ancient Egyptians had some glyphs to act as determinatives for words that relate to that topic, and they are encoded in the standard. See U+130B8 to A. Tim -------------- next part -------------- An HTML attachment was scrubbed... URL: From ian.clifton at chem.ox.ac.uk Mon Feb 16 05:48:07 2015 From: ian.clifton at chem.ox.ac.uk (Ian Clifton) Date: Mon, 16 Feb 2015 11:48:07 +0000 Subject: sex and emoji References: <54DD1F6F.8090002@ix.netcom.com> <54DF9013.2020406@gmail.com> <8C324C32065663409974565298FC6EC52D1BACB8@exmbx04.thus.corp> Message-ID: <4qd25a2i2w.fsf@chem-arachne.chem.ox.ac.uk> Tim Partridge writes: > Antoine M?ric ?said >> I was wondering, has the question ? Given the massive usage over > time >> of the glyph, and the number of academic papers about it, should we > consider >>adding a PHALLIC REPRESENTATION to the unicode standard ? ? ever been > asked ? > > The Ancient Egyptians had some glyphs to act as determinatives for > words that relate to that topic, and they are encoded in the standard. > See U+130B8 to A. Good grief, I don?t like the look of U+130B9. Maybe I don?t want to know what?s going on ??. -- Ian ? From timpart at perdix.demon.co.uk Mon Feb 16 13:42:17 2015 From: timpart at perdix.demon.co.uk (Tim Partridge) Date: Mon, 16 Feb 2015 19:42:17 +0000 Subject: sex and emoji In-Reply-To: <4qd25a2i2w.fsf@chem-arachne.chem.ox.ac.uk> References: <54DD1F6F.8090002@ix.netcom.com> <54DF9013.2020406@gmail.com> <8C324C32065663409974565298FC6EC52D1BACB8@exmbx04.thus.corp>, <4qd25a2i2w.fsf@chem-arachne.chem.ox.ac.uk> Message-ID: <8C324C32065663409974565298FC6EC52D1BAD13@exmbx04.thus.corp> Ian Clifton said: > Good grief, I don?t like the look of U+130B9. Maybe I don?t want to know > what?s going on ??. I'm not an egyptologist, but I think it's just a scribal ligature between U+132F4 and U+130B8. The former is just a folded piece of cloth representing the sound /s/. Usually a long thin sign is combined with a following sign to save space. These two wouldn't fit together well, so I guess the scribes decided to just put one on top of the other. There are other similar examples in the code charts. Tim From eliz at gnu.org Thu Feb 19 04:55:20 2015 From: eliz at gnu.org (Eli Zaretskii) Date: Thu, 19 Feb 2015 12:55:20 +0200 Subject: Compatibility decomposition for Hebrew and Greek final letters Message-ID: <83d25641d3.fsf@gnu.org> Does anyone know why does the UCD define compatibility decompositions for Arabic initial, medial, and final forms, but doesn't do the same for Hebrew final letters, like U+05DD HEBREW LETTER FINAL MEM? Or for that matter, for U+03C2 GREEK SMALL LETTER FINAL SIGMA? The relevant application where this would matter is text search, where these letters might be folded to the same code point for the purposes of comparison. TIA From everson at evertype.com Thu Feb 19 05:21:19 2015 From: everson at evertype.com (Michael Everson) Date: Thu, 19 Feb 2015 11:21:19 +0000 Subject: Compatibility decomposition for Hebrew and Greek final letters In-Reply-To: <83d25641d3.fsf@gnu.org> References: <83d25641d3.fsf@gnu.org> Message-ID: <03FFAE2B-CD9A-470C-BCCB-62001B95F4CF@evertype.com> On 19 Feb 2015, at 10:55, Eli Zaretskii wrote: > Does anyone know why does the UCD define compatibility decompositions > for Arabic initial, medial, and final forms, but doesn't do the same > for Hebrew final letters, like U+05DD HEBREW LETTER FINAL MEM? Or for > that matter, for U+03C2 GREEK SMALL LETTER FINAL SIGMA? > > The relevant application where this would matter is text search, where > these letters might be folded to the same code point for the purposes > of comparison. Such comparisons happen at a different level, I think. Michael Everson * http://www.evertype.com/ From eliz at gnu.org Thu Feb 19 05:30:22 2015 From: eliz at gnu.org (Eli Zaretskii) Date: Thu, 19 Feb 2015 13:30:22 +0200 Subject: Compatibility decomposition for Hebrew and Greek final letters In-Reply-To: <03FFAE2B-CD9A-470C-BCCB-62001B95F4CF@evertype.com> References: <83d25641d3.fsf@gnu.org> <03FFAE2B-CD9A-470C-BCCB-62001B95F4CF@evertype.com> Message-ID: <837fve3zqp.fsf@gnu.org> > From: Michael Everson > Date: Thu, 19 Feb 2015 11:21:19 +0000 > > On 19 Feb 2015, at 10:55, Eli Zaretskii wrote: > > > Does anyone know why does the UCD define compatibility decompositions > > for Arabic initial, medial, and final forms, but doesn't do the same > > for Hebrew final letters, like U+05DD HEBREW LETTER FINAL MEM? Or for > > that matter, for U+03C2 GREEK SMALL LETTER FINAL SIGMA? > > > > The relevant application where this would matter is text search, where > > these letters might be folded to the same code point for the purposes > > of comparison. > > Such comparisons happen at a different level, I think. Sorry, I'm not sure I follow: different from what? In any case, regardless of the level, if there's no data to support such "folding", how can applications implement it (except by inventing its own data)? Also, perhaps there are some deep linguistic reasons why such folding might be inappropriate, and that's why the UCD doesn't define such decompositions? Thanks. From jcb+unicode at inf.ed.ac.uk Thu Feb 19 05:47:24 2015 From: jcb+unicode at inf.ed.ac.uk (Julian Bradfield) Date: Thu, 19 Feb 2015 11:47:24 GMT Subject: Compatibility decomposition for Hebrew and Greek final letters References: <83d25641d3.fsf@gnu.org> Message-ID: On 2015-02-19, Eli Zaretskii wrote: > Does anyone know why does the UCD define compatibility decompositions > for Arabic initial, medial, and final forms, but doesn't do the same > for Hebrew final letters, like U+05DD HEBREW LETTER FINAL MEM? Or for > that matter, for U+03C2 GREEK SMALL LETTER FINAL SIGMA? As far as I understand it: In Arabic, the variant of a letter is determined entirely by its position, so there is no compelling need to represent the forms separately (as characters rather than glyphs) save for the existence of legacy standards (and if there is, you can use the ZWJ/ZWNJ hacks). Thus the forms would not have been encoded but for the legacy standards. Whereas in Hebrew, non-final forms appear finally in certain contexts in normal text; and in Greek, while Greek text may have a determinate choice between ? and ?, there are many contexts where the two symbols are distinguished (not least maths). -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. From eliz at gnu.org Thu Feb 19 05:59:44 2015 From: eliz at gnu.org (Eli Zaretskii) Date: Thu, 19 Feb 2015 13:59:44 +0200 Subject: Compatibility decomposition for Hebrew and Greek final letters In-Reply-To: References: <83d25641d3.fsf@gnu.org> Message-ID: <834mqi3ydr.fsf@gnu.org> > Date: Thu, 19 Feb 2015 11:47:24 GMT > From: Julian Bradfield > > In Arabic, the variant of a letter is determined entirely by its > position, so there is no compelling need to represent the forms separately > (as characters rather than glyphs) save for the existence of legacy > standards (and if there is, you can use the ZWJ/ZWNJ hacks). Thus the > forms would not have been encoded but for the legacy standards. > Whereas in Hebrew, non-final forms appear finally in certain contexts > in normal text; and in Greek, while Greek text may have a determinate > choice between ? and ?, there are many contexts where the two symbols > are distinguished (not least maths). Got it, thanks. From verdy_p at wanadoo.fr Thu Feb 19 13:31:07 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Thu, 19 Feb 2015 20:31:07 +0100 Subject: Compatibility decomposition for Hebrew and Greek final letters In-Reply-To: <834mqi3ydr.fsf@gnu.org> References: <83d25641d3.fsf@gnu.org> <834mqi3ydr.fsf@gnu.org> Message-ID: The decompositions are not needed for plain text searches, that can use the collation data (with the collation data, you can unify at the primary level differences such as capitalisation and ignore diacritics, or transform some base groups of letters into a single entry, or make some significant primary difference when there are diacritics (for example in German equating 'ae' and '?' at the primary level). Yes, collation must use the canonical decompositions, but does not need to follow the compatibility decompositions for all locales (even if this is done for the root locale and the DUCET... with some exceptions considering the rules for the most important language using an encoded letter and all its *canonical* equivalents). Compatibility decompositions in the UCD have little use, they should be preserved in encoded texts and transformations of text, they are just suggestions which *may* be useful: - for rendering text (the most important use is in character mappings within fonts, or in fallback mappings implemented in the rendering engine), - or for mappings to legacy encodings (e.g. when converting to GSM for SMS services, or converting for display in text-only devices and terminals using a limited OEM charset) 2015-02-19 12:59 GMT+01:00 Eli Zaretskii : > > Date: Thu, 19 Feb 2015 11:47:24 GMT > > From: Julian Bradfield > > > > In Arabic, the variant of a letter is determined entirely by its > > position, so there is no compelling need to represent the forms > separately > > (as characters rather than glyphs) save for the existence of legacy > > standards (and if there is, you can use the ZWJ/ZWNJ hacks). Thus the > > forms would not have been encoded but for the legacy standards. > > Whereas in Hebrew, non-final forms appear finally in certain contexts > > in normal text; and in Greek, while Greek text may have a determinate > > choice between ? and ?, there are many contexts where the two symbols > > are distinguished (not least maths). > > Got it, thanks. > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > -------------- next part -------------- An HTML attachment was scrubbed... URL: From eliz at gnu.org Thu Feb 19 14:17:30 2015 From: eliz at gnu.org (Eli Zaretskii) Date: Thu, 19 Feb 2015 22:17:30 +0200 Subject: Compatibility decomposition for Hebrew and Greek final letters In-Reply-To: References: <83d25641d3.fsf@gnu.org> <834mqi3ydr.fsf@gnu.org> Message-ID: <83zj89lkpx.fsf@gnu.org> > From: Philippe Verdy > Date: Thu, 19 Feb 2015 20:31:07 +0100 > Cc: Julian Bradfield , > unicode Unicode Discussion > > The decompositions are not needed for plain text searches, that can use the > collation data (with the collation data, you can unify at the primary level > differences such as capitalisation and ignore diacritics, or transform some > base groups of letters into a single entry, or make some significant primary > difference when there are diacritics (for example in German equating 'ae' and > '?' at the primary level). Sorry, I disagree. First, collation data is overkill for search, since the order information is not required, so the weights are simply wasting storage. Second, people do want to find, e.g., "?" when they search for "2" etc. I'm not saying that they _always_ want that, but sometimes they do. There's no reason a sophisticated text editor shouldn't support such a feature, under user control. From markus.icu at gmail.com Thu Feb 19 15:08:57 2015 From: markus.icu at gmail.com (Markus Scherer) Date: Thu, 19 Feb 2015 13:08:57 -0800 Subject: Compatibility decomposition for Hebrew and Greek final letters In-Reply-To: <83zj89lkpx.fsf@gnu.org> References: <83d25641d3.fsf@gnu.org> <834mqi3ydr.fsf@gnu.org> <83zj89lkpx.fsf@gnu.org> Message-ID: On Thu, Feb 19, 2015 at 12:17 PM, Eli Zaretskii wrote: > Sorry, I disagree. First, collation data is overkill for search, > since the order information is not required, so the weights are simply > wasting storage. Second, people do want to find, e.g., "?" when they > search for "2" etc. > Depends on what you do. "the weights are simply wasting storage" is not really true, you do have to encode something for which characters are same or different, and it turns out that that comes close to defining a sort order. Some people also want to ignore accents, others don't. As to your original question, Unicode collation would give you primary-equal "mem" and "sigma" characters. 05DE; [63 1E, 05, 05] # Hebr Lo [1F81.0020.0002] * HEBREW LETTER MEM FB26; [63 1E, 05, 20] # Hebr Lo [1F81.0020.0005] * HEBREW LETTER WIDE FINAL MEM 05DD; [63 1E, 05, 2E] # Hebr Lo [1F81.0020.0019] * HEBREW LETTER FINAL MEM FB3E; [63 1E, 05, 05][, E5 B1, 05] # Hebr Lo [1F81.0020.0002][0000.005F.0002] * HEBREW LETTER MEM WITH DAGESH 03C3; [5F 42, 05, 05] # Grek Ll [1C95.0020.0002] * GREEK SMALL LETTER SIGMA 03F2; [5F 42, 05, 10] # Grek Ll [1C95.0020.0004] * GREEK LUNATE SIGMA SYMBOL 1D6D3; [5F 42, 05, 17] # Zyyy Ll [1C95.0020.0005] * MATHEMATICAL BOLD SMALL FINAL SIGMA ... 03C2; [5F 42, 05, 33] # Grek Ll [1C95.0020.0019] * GREEK SMALL LETTER FINAL SIGMA You can certainly simplify a few things when you don't care about the order, therefore CLDR defines "search" tailorings. Some popular browsers use collation-based search for ctrl-F in-page search, either with strength=primary (ignore accent/case/etc. variants), or with asymmetric search. ICU implements those algorithms and carries the CLDR tailorings. See http://www.unicode.org/reports/tr10/#Searching Best regards, markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Thu Feb 19 16:02:57 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Thu, 19 Feb 2015 22:02:57 +0000 Subject: Compatibility decomposition for Hebrew and Greek final letters In-Reply-To: <83zj89lkpx.fsf@gnu.org> References: <83d25641d3.fsf@gnu.org> <834mqi3ydr.fsf@gnu.org> <83zj89lkpx.fsf@gnu.org> Message-ID: <20150219220257.6833468f@JRWUBU2> On Thu, 19 Feb 2015 22:17:30 +0200 Eli Zaretskii wrote: > First, collation data is overkill for search, > since the order information is not required, so the weights are simply > wasting storage. The big waste is not in text-dependent storage, but in the processing for search orders that bear little relationship to alphabetical order. As Markus pointed out, most of that overhead is removed from processing by the use of special 'search' collations. > Second, people do want to find, e.g., "?" when they > search for "2" etc. I'm not saying that they _always_ want that, but > sometimes they do. There's no reason a sophisticated text editor > shouldn't support such a feature, under user control. I think one problem is disbelief in the existence of enough sophisticated users to matter. I gather it can be quite hard to obtain a Swedish interface for editing Thai. Richard. From duerst at it.aoyama.ac.jp Thu Feb 19 20:50:17 2015 From: duerst at it.aoyama.ac.jp (=?UTF-8?B?Ik1hcnRpbiBKLiBEw7xyc3Qi?=) Date: Fri, 20 Feb 2015 11:50:17 +0900 Subject: Compatibility decomposition for Hebrew and Greek final letters In-Reply-To: <83zj89lkpx.fsf@gnu.org> References: <83d25641d3.fsf@gnu.org> <834mqi3ydr.fsf@gnu.org> <83zj89lkpx.fsf@gnu.org> Message-ID: <54E6A0E9.50504@it.aoyama.ac.jp> On 2015/02/20 05:17, Eli Zaretskii wrote: >> From: Philippe Verdy >> Date: Thu, 19 Feb 2015 20:31:07 +0100 >> Cc: Julian Bradfield , >> unicode Unicode Discussion >> >> The decompositions are not needed for plain text searches, that can use the >> collation data (with the collation data, you can unify at the primary level >> differences such as capitalisation and ignore diacritics, or transform some >> base groups of letters into a single entry, or make some significant primary >> difference when there are diacritics (for example in German equating 'ae' and >> '?' at the primary level). > > Sorry, I disagree. First, collation data is overkill for search, > since the order information is not required, so the weights are simply > wasting storage. Second, people do want to find, e.g., "?" when they > search for "2" etc. I'm not saying that they _always_ want that, but > sometimes they do. There's no reason a sophisticated text editor > shouldn't support such a feature, under user control. Well, for cased scripts, search is usually case-insensitive, but case conversions aren't given by compatibility decompositions. If the question isn't "Why are there equivalences useful for search that are not covered by compatibility decompositions?", but "Why doesn't Unicode provide some data for final/non-final Hebrew letter correspondence?", maybe the answer is that it hasn't been seen as a need up to now because it's so easy to figure out. Regards, Martin. From public at khwilliamson.com Thu Feb 19 20:55:20 2015 From: public at khwilliamson.com (Karl Williamson) Date: Thu, 19 Feb 2015 19:55:20 -0700 Subject: Question about the Sentence_Break property Message-ID: <54E6A218.3030500@khwilliamson.com> UAX 29 says this: Break after paragraph separators. SB4. Sep | CR | LF Why are CR and LF considered to be paragraph separators? NEL and Line Break are as well. My mental model of plain text has it containing embedded characters, which I'll call \n, to allow it to be displayed in a terminal window of a given width. Not all text is like that, of course, but there is an awful lot that is. This rule makes no sense to me. From duerst at it.aoyama.ac.jp Thu Feb 19 21:01:00 2015 From: duerst at it.aoyama.ac.jp (=?UTF-8?B?Ik1hcnRpbiBKLiBEw7xyc3Qi?=) Date: Fri, 20 Feb 2015 12:01:00 +0900 Subject: Compatibility decomposition for Hebrew and Greek final letters In-Reply-To: References: <83d25641d3.fsf@gnu.org> Message-ID: <54E6A36C.80903@it.aoyama.ac.jp> On 2015/02/19 20:47, Julian Bradfield wrote: > On 2015-02-19, Eli Zaretskii wrote: >> Does anyone know why does the UCD define compatibility decompositions >> for Arabic initial, medial, and final forms, but doesn't do the same >> for Hebrew final letters, like U+05DD HEBREW LETTER FINAL MEM? Or for >> that matter, for U+03C2 GREEK SMALL LETTER FINAL SIGMA? > > As far as I understand it: > In Arabic, the variant of a letter is determined entirely by its > position, so there is no compelling need to represent the forms separately > (as characters rather than glyphs) save for the existence of legacy > standards (and if there is, you can use the ZWJ/ZWNJ hacks). Thus the > forms would not have been encoded but for the legacy standards. > Whereas in Hebrew, non-final forms appear finally in certain contexts > in normal text; and in Greek, while Greek text may have a determinate > choice between ? and ?, there are many contexts where the two symbols > are distinguished (not least maths). Digging a bit deeper, the phenomenon of a letter changing shape depending on position is pervasive in Arabic, and involves complicated interdependencies across multiple characters in good-quality typography. But in Hebrew, this phenomenon is minor, and marginal in Greek, and typographic interactions are also very limited. That led to (after some initial tries with alternatives) different encoding models. In Arabic, shaping is the job of the rendering engine, whereas in Hebrew and Greek, it's part of the encoding. As for determinate choice between ? and ?, John Cowan once gave an example of a Greek word (composed of two original words) with a final sigma in the middle. Regards, Martin. From verdy_p at wanadoo.fr Thu Feb 19 21:47:52 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Fri, 20 Feb 2015 04:47:52 +0100 Subject: Compatibility decomposition for Hebrew and Greek final letters In-Reply-To: <83zj89lkpx.fsf@gnu.org> References: <83d25641d3.fsf@gnu.org> <834mqi3ydr.fsf@gnu.org> <83zj89lkpx.fsf@gnu.org> Message-ID: 2015-02-19 21:17 GMT+01:00 Eli Zaretskii : > > From: Philippe Verdy > > Date: Thu, 19 Feb 2015 20:31:07 +0100 > > Cc: Julian Bradfield , > > unicode Unicode Discussion > > > > The decompositions are not needed for plain text searches, that can use > the > > collation data (with the collation data, you can unify at the primary > level > > differences such as capitalisation and ignore diacritics, or transform > some > > base groups of letters into a single entry, or make some significant > primary > > difference when there are diacritics (for example in German equating > 'ae' and > > '?' at the primary level). > > Sorry, I disagree. First, collation data is overkill for search, > since the order information is not required, so the weights are simply > wasting storage. Second, people do want to find, e.g., "?" when they > search for "2" etc. I'm not saying that they _always_ want that, but > sometimes they do. There's no reason a sophisticated text editor > shouldn't support such a feature, under user control. > The weights or the collation strings do not need to be stored. Even database engines or plain-text search engines on the web provide now collation algorithms for searching or sorting data, so that you don't need to store it in your tables... It is not overkill, as good implementations of collation are efefctively used in high-permance database servers (and many users of these databases do not realize that collation is effectively used. There are also good text editors implementing collation searches. -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Thu Feb 19 22:45:58 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Fri, 20 Feb 2015 04:45:58 +0000 Subject: Compatibility decomposition for Hebrew and Greek final letters In-Reply-To: <54E6A0E9.50504@it.aoyama.ac.jp> References: <83d25641d3.fsf@gnu.org> <834mqi3ydr.fsf@gnu.org> <83zj89lkpx.fsf@gnu.org> <54E6A0E9.50504@it.aoyama.ac.jp> Message-ID: <20150220044558.24cb0b94@JRWUBU2> On Fri, 20 Feb 2015 11:50:17 +0900 "Martin J. D?rst" wrote: > If the question isn't "Why are there equivalences useful for search > that are not covered by compatibility decompositions?", but "Why > doesn't Unicode provide some data for final/non-final Hebrew letter > correspondence?", maybe the answer is that it hasn't been seen as a > need up to now because it's so easy to figure out. But as already pointed out, Unicode does provide data for the correspondence, in the form of collation weightings in DUCET. CLDR allows degrees of sameness to be recorded differently for different contexts, as is eminently reasonable. Richard. From richard.wordingham at ntlworld.com Thu Feb 19 23:14:30 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Fri, 20 Feb 2015 05:14:30 +0000 Subject: Question about the Sentence_Break property In-Reply-To: <54E6A218.3030500@khwilliamson.com> References: <54E6A218.3030500@khwilliamson.com> Message-ID: <20150220051430.1401c498@JRWUBU2> On Thu, 19 Feb 2015 19:55:20 -0700 Karl Williamson wrote: > UAX 29 says this: > > Break after paragraph separators. > SB4. Sep | CR | LF > > Why are CR and LF considered to be paragraph separators? NEL and > Line Break are as well. > > My mental model of plain text has it containing embedded characters, > which I'll call \n, to allow it to be displayed in a terminal window > of a given width. Not all text is like that, of course, but there is > an awful lot that is. This rule makes no sense to me. There are two types of plain text - that which requires explicit line-breaking, and that which does not. This is a case where a non-linguistic tailoring is required. TUS has a whole section on the issue, namely TUS 7.0.0 Section 5.8. One thing that is missing is mention of the convention that a single newline character (or CRLF pair) is a line break whereas a doubled newline character denotes a paragraph break. Richard. From eliz at gnu.org Fri Feb 20 01:51:45 2015 From: eliz at gnu.org (Eli Zaretskii) Date: Fri, 20 Feb 2015 09:51:45 +0200 Subject: Compatibility decomposition for Hebrew and Greek final letters In-Reply-To: References: <83d25641d3.fsf@gnu.org> <834mqi3ydr.fsf@gnu.org> <83zj89lkpx.fsf@gnu.org> Message-ID: <83y4ntkoku.fsf@gnu.org> > Date: Thu, 19 Feb 2015 13:08:57 -0800 > From: Markus Scherer > Cc: Philippe Verdy , Julian Bradfield , > Unicode Mailing List > > Sorry, I disagree. First, collation data is overkill for search, > since the order information is not required, so the weights are simply > wasting storage. Second, people do want to find, e.g., "?" when they > search for "2" etc. > > Depends on what you do. The context is text search, where the user enters the search string and specifies the strength of the required matches, and the editor then searches a (potentially very large) buffer of text. > "the weights are simply wasting storage" is not really > true, you do have to encode something for which characters are same or > different, and it turns out that that comes close to defining a sort order. > Some people also want to ignore accents, others don't. I think decomposition to NFKD solves these issues, doesn't it? > As to your original question, Unicode collation would give you primary-equal > "mem" and "sigma" characters. > 05DE; [63 1E, 05, 05] # Hebr Lo [1F81.0020.0002] * HEBREW LETTER MEM > FB26; [63 1E, 05, 20] # Hebr Lo [1F81.0020.0005] * HEBREW LETTER WIDE FINAL MEM > 05DD; [63 1E, 05, 2E] # Hebr Lo [1F81.0020.0019] * HEBREW LETTER FINAL MEM > FB3E; [63 1E, 05, 05][, E5 B1, 05] # Hebr Lo [1F81.0020.0002][0000.005F.0002] * > HEBREW LETTER MEM WITH DAGESH > > 03C3; [5F 42, 05, 05] # Grek Ll [1C95.0020.0002] * GREEK SMALL LETTER SIGMA > 03F2; [5F 42, 05, 10] # Grek Ll [1C95.0020.0004] * GREEK LUNATE SIGMA SYMBOL > 1D6D3; [5F 42, 05, 17] # Zyyy Ll [1C95.0020.0005] * MATHEMATICAL BOLD SMALL > FINAL SIGMA > ... > 03C2; [5F 42, 05, 33] # Grek Ll [1C95.0020.0019] * GREEK SMALL LETTER FINAL > SIGMA > > You can certainly simplify a few things when you don't care about the order, > therefore CLDR defines "search" tailorings. Some popular browsers use > collation-based search for ctrl-F in-page search, either with strength=primary > (ignore accent/case/etc. variants), or with asymmetric search. ICU implements > those algorithms and carries the CLDR tailorings. > > See http://www.unicode.org/reports/tr10/#Searching Thanks. I've studied that already, and I do know that collation data can be used for search. But it's still a lot of data that I'd like to avoid loading, if possible. From eliz at gnu.org Fri Feb 20 02:04:32 2015 From: eliz at gnu.org (Eli Zaretskii) Date: Fri, 20 Feb 2015 10:04:32 +0200 Subject: Compatibility decomposition for Hebrew and Greek final letters In-Reply-To: <20150219220257.6833468f@JRWUBU2> References: <83d25641d3.fsf@gnu.org> <834mqi3ydr.fsf@gnu.org> <83zj89lkpx.fsf@gnu.org> <20150219220257.6833468f@JRWUBU2> Message-ID: <83wq3dknzj.fsf@gnu.org> > Date: Thu, 19 Feb 2015 22:02:57 +0000 > From: Richard Wordingham > > > First, collation data is overkill for search, > > since the order information is not required, so the weights are simply > > wasting storage. > > The big waste is not in text-dependent storage, but in the > processing for search orders that bear little relationship to > alphabetical order. Sorry, I don't think I follow: what is "processing for search orders" to which you allude here? > > Second, people do want to find, e.g., "?" when they > > search for "2" etc. I'm not saying that they _always_ want that, but > > sometimes they do. There's no reason a sophisticated text editor > > shouldn't support such a feature, under user control. > > I think one problem is disbelief in the existence of enough > sophisticated users to matter. I gather it can be quite hard to obtain > a Swedish interface for editing Thai. I'm not talking about localized features, like for "?" to match "aa" in Danish locales. I'm talking about matching strings that are equivalent under canonical and compatibility decompositions. As for user sophistication, AFAIR, Microsoft Word finds "?" when you search for "2" by default, so it sounds like Word considers all users sophisticated enough for that. I think that's a solid enough precedent to follow. From eliz at gnu.org Fri Feb 20 02:06:37 2015 From: eliz at gnu.org (Eli Zaretskii) Date: Fri, 20 Feb 2015 10:06:37 +0200 Subject: Compatibility decomposition for Hebrew and Greek final letters In-Reply-To: <54E6A0E9.50504@it.aoyama.ac.jp> References: <83d25641d3.fsf@gnu.org> <834mqi3ydr.fsf@gnu.org> <83zj89lkpx.fsf@gnu.org> <54E6A0E9.50504@it.aoyama.ac.jp> Message-ID: <83vbixknw2.fsf@gnu.org> > Date: Fri, 20 Feb 2015 11:50:17 +0900 > From: "Martin J. D?rst" > CC: jcb+unicode at inf.ed.ac.uk, unicode at unicode.org > > Well, for cased scripts, search is usually case-insensitive, but case > conversions aren't given by compatibility decompositions. That's true, but comparing NFKD-decomposed sequences case-insensitively is not very hard, is it? > If the question isn't "Why are there equivalences useful for search that > are not covered by compatibility decompositions?", but "Why doesn't > Unicode provide some data for final/non-final Hebrew letter > correspondence?", maybe the answer is that it hasn't been seen as a need > up to now because it's so easy to figure out. It's easy to figure out if you read the script. And even if you do, you will have to prepare additional data, instead of just using UCD. But I do get the point, thanks. From eliz at gnu.org Fri Feb 20 02:13:41 2015 From: eliz at gnu.org (Eli Zaretskii) Date: Fri, 20 Feb 2015 10:13:41 +0200 Subject: Compatibility decomposition for Hebrew and Greek final letters In-Reply-To: References: <83d25641d3.fsf@gnu.org> <834mqi3ydr.fsf@gnu.org> <83zj89lkpx.fsf@gnu.org> Message-ID: <83twyhknka.fsf@gnu.org> > From: Philippe Verdy > Date: Fri, 20 Feb 2015 04:47:52 +0100 > Cc: jcb+unicode at inf.ed.ac.uk, unicode Unicode Discussion > > Sorry, I disagree. First, collation data is overkill for search, > since the order information is not required, so the weights are simply > wasting storage. Second, people do want to find, e.g., "?" when they > search for "2" etc. I'm not saying that they _always_ want that, but > sometimes they do. There's no reason a sophisticated text editor > shouldn't support such a feature, under user control. > > The weights or the collation strings do not need to be stored. Even database > engines or plain-text search engines on the web provide now collation > algorithms for searching or sorting data, so that you don't need to store it in > your tables... It is not overkill, as good implementations of collation are > efefctively used in high-permance database servers (and many users of these > databases do not realize that collation is effectively used. I'm talking specifically about Emacs. Emacs provides locale-dependent collation, but it relies on the underlying platform libraries to do the work, it doesn't itself load the DUCET database, or anything similar to it. By contrast, Emacs does have an efficient-storage implementation of the UCD, and by virtue of that, accessing decomposition data and performing normalization is at my fingertips. So I'd like to avoid loading DUCET, and doing so just for the sake of a few characters mentioned in this thread doesn't sound justified; it's much easier to have a small database of additional equivalences. > There are also good text editors implementing collation searches. Could you mention their names, please? Thanks. From richard.wordingham at ntlworld.com Fri Feb 20 09:01:34 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Fri, 20 Feb 2015 15:01:34 +0000 Subject: Compatibility decomposition for Hebrew and Greek final letters In-Reply-To: <83wq3dknzj.fsf@gnu.org> References: <83d25641d3.fsf@gnu.org> <834mqi3ydr.fsf@gnu.org> <83zj89lkpx.fsf@gnu.org> <20150219220257.6833468f@JRWUBU2> <83wq3dknzj.fsf@gnu.org> Message-ID: <20150220150134.6fc5663b@JRWUBU2> On Fri, 20 Feb 2015 10:04:32 +0200 Eli Zaretskii wrote: > > Date: Thu, 19 Feb 2015 22:02:57 +0000 > > From: Richard Wordingham > > > > > First, collation data is overkill for search, > > > since the order information is not required, so the weights are > > > simply wasting storage. > > > > The big waste is not in text-dependent storage, but in the > > processing for search orders that bear little relationship to > > alphabetical order. > > Sorry, I don't think I follow: what is "processing for search orders" > to which you allude here? The examples in the CLDR root locale and in DUCET are the massive sets of 'contractions' of consonants with vowels written before the associated consonant in the scripts where spacing characters are stored in the order written, namely Thai, Lao, Tai Viet and, soon, New Tai Lue. When customised collations are applied, there are enormous sets for Burmese (in CLDR) and New Tai Lue (not published in CLDR). The latter two have 'logical order exception' final consonants. (The exception here is that the logical order of characters in a word is not the order one wants for sorting.) > I'm not talking about localized features, like for "?" to match "aa" > in Danish locales. I'm talking about matching strings that are > equivalent under canonical and compatibility decompositions. Nor was I. I was talking about the user interface - commands, menus and messages. > As for user sophistication, AFAIR, Microsoft Word finds "?" when you > search for "2" by default, so it sounds like Word considers all users > sophisticated enough for that. I think that's a solid enough > precedent to follow. But what switches the match off? Richard. From eliz at gnu.org Fri Feb 20 09:28:36 2015 From: eliz at gnu.org (Eli Zaretskii) Date: Fri, 20 Feb 2015 17:28:36 +0200 Subject: Compatibility decomposition for Hebrew and Greek final letters In-Reply-To: <20150220150134.6fc5663b@JRWUBU2> References: <83d25641d3.fsf@gnu.org> <834mqi3ydr.fsf@gnu.org> <83zj89lkpx.fsf@gnu.org> <20150219220257.6833468f@JRWUBU2> <83wq3dknzj.fsf@gnu.org> <20150220150134.6fc5663b@JRWUBU2> Message-ID: <8361awlhzv.fsf@gnu.org> > Date: Fri, 20 Feb 2015 15:01:34 +0000 > From: Richard Wordingham > > > Sorry, I don't think I follow: what is "processing for search orders" > > to which you allude here? > > The examples in the CLDR root locale and in DUCET are the massive sets > of 'contractions' of consonants with vowels written before the > associated consonant in the scripts where spacing characters are stored > in the order written, namely Thai, Lao, Tai Viet and, soon, New Tai > Lue. When customised collations are applied, there are enormous sets > for Burmese (in CLDR) and New Tai Lue (not published in CLDR). The > latter two have 'logical order exception' final consonants. (The > exception here is that the logical order of characters in a word is not > the order one wants for sorting.) OK, thanks for explaining that. Still, the DUCET data is not insignificant. > > I'm not talking about localized features, like for "?" to match "aa" > > in Danish locales. I'm talking about matching strings that are > > equivalent under canonical and compatibility decompositions. > > Nor was I. I was talking about the user interface - commands, menus > and messages. Ah, that's easy (for now): Emacs doesn't have a localized UI. Everything in the UI is in US English. So this would be Someone Else's Problem. > > As for user sophistication, AFAIR, Microsoft Word finds "?" when you > > search for "2" by default, so it sounds like Word considers all users > > sophisticated enough for that. I think that's a solid enough > > precedent to follow. > > But what switches the match off? I'm not sure there _is_ a switch in Word. But my point is different: the above example means an editor should have the capability of matching such strings; whether it can or cannot be switched off is a separate issue (in Emacs, I don't imagine users will settle for not being able to switch it off and on as they see fit). From markus.icu at gmail.com Fri Feb 20 11:49:20 2015 From: markus.icu at gmail.com (Markus Scherer) Date: Fri, 20 Feb 2015 09:49:20 -0800 Subject: Compatibility decomposition for Hebrew and Greek final letters In-Reply-To: <83y4ntkoku.fsf@gnu.org> References: <83d25641d3.fsf@gnu.org> <834mqi3ydr.fsf@gnu.org> <83zj89lkpx.fsf@gnu.org> <83y4ntkoku.fsf@gnu.org> Message-ID: On Thu, Feb 19, 2015 at 11:51 PM, Eli Zaretskii wrote: > I think decomposition to NFKD solves these issues, doesn't it? > Not completely. Judging from your question, you expected more mappings than NFKD has. You might want to try the mappings that are used as input for deriving the DUCET (default Unicode collation): http://www.unicode.org/Public/UCA/latest/decomps.txt For a character-based search, you should still try to work with canonical equivalence, for example by applying the FCD check and normalizing when that fails. http://www.unicode.org/notes/tn5/ Thanks. I've studied that already, and I do know that collation data > can be used for search. But it's still a lot of data that I'd like to > avoid loading, if possible. > Sure, as I said, it depends on what you need and want. FYI, the ICU data file corresponding to the DUCET is about 160kB (for UCA 7.0) and could be reduced if limited to one specific use case, but the collation and string-search code is large and complex. Best regards, markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Fri Feb 20 17:56:14 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Sat, 21 Feb 2015 00:56:14 +0100 Subject: Question about the Sentence_Break property In-Reply-To: <20150220051430.1401c498@JRWUBU2> References: <54E6A218.3030500@khwilliamson.com> <20150220051430.1401c498@JRWUBU2> Message-ID: 2015-02-20 6:14 GMT+01:00 Richard Wordingham < richard.wordingham at ntlworld.com>: > TUS has a whole section on the issue, namely TUS 7.0.0 Section 5.8. > One thing that is missing is mention of the convention that a single > newline character (or CRLF pair) is a line break whereas a doubled > newline character denotes a paragraph break. > In that case CR or LF characters alone are not "paragraph separators" by themselves unless they are grouped together. Like NEL, they should just be considered as line separators and the terminology used in UAX 29 rule SB4 is effectively incorrect if what matters here is just the linebreak property. And also in that case, the SB4 rule should effecticely include NEL (from the C1 subset). But as SB4 is only related to sentence breaking, It would be e problem because simple linebreaks are used extremely frequently in the middle of sentences. What the Sentence break algorithm should say is that there should first be a preprossing step separating line breaks and paragraph breaks (creating custom entities,(similar to collation elements, but encoded internally with a code point out of the standard space), that the rule SB4 would use instead of "Sep | CR | LF". That custome entity should be "Sep" but without the rule defining it, as there are various ways to represent paragraph breaks. -------------- next part -------------- An HTML attachment was scrubbed... URL: From ritt.ks at gmail.com Fri Feb 20 20:50:53 2015 From: ritt.ks at gmail.com (Konstantin Ritt) Date: Sat, 21 Feb 2015 06:50:53 +0400 Subject: Question about the Sentence_Break property In-Reply-To: References: <54E6A218.3030500@khwilliamson.com> <20150220051430.1401c498@JRWUBU2> Message-ID: When UAX9 mentions a paragraph level, it says: > Paragraphs are divided by the Paragraph Separator or appropriate Newline Function (for guidelines on the handling of CR, LF, and CRLF, see *Section 4.4, Directionality*, and *Section 5.8, Newline Guidelines* of [Unicode ]). Paragraphs may also be determined by higher-level protocols: for example, the text in two different cells of a table will be in different paragraphs. Regards, Konstantin 2015-02-21 3:56 GMT+04:00 Philippe Verdy : > 2015-02-20 6:14 GMT+01:00 Richard Wordingham < > richard.wordingham at ntlworld.com>: > >> TUS has a whole section on the issue, namely TUS 7.0.0 Section 5.8. >> One thing that is missing is mention of the convention that a single >> newline character (or CRLF pair) is a line break whereas a doubled >> newline character denotes a paragraph break. >> > > In that case CR or LF characters alone are not "paragraph separators" by > themselves unless they are grouped together. Like NEL, they should just be > considered as line separators and the terminology used in UAX 29 rule SB4 > is effectively incorrect if what matters here is just the linebreak > property. And also in that case, the SB4 rule should effecticely include > NEL (from the C1 subset). > > But as SB4 is only related to sentence breaking, It would be e problem > because simple linebreaks are used extremely frequently in the middle of > sentences. > > What the Sentence break algorithm should say is that there should first be > a preprossing step separating line breaks and paragraph breaks (creating > custom entities,(similar to collation elements, but encoded internally with > a code point out of the standard space), that the rule SB4 would use > instead of "Sep | CR | LF". That custome entity should be "Sep" but without > the rule defining it, as there are various ways to represent paragraph > breaks. > > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From wjgo_10009 at btinternet.com Sat Feb 21 06:46:00 2015 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Sat, 21 Feb 2015 12:46:00 +0000 (GMT) Subject: [Probably off-topic] Mobile telephone numbers and next of kin contact Message-ID: <3500040.16871.1424522760849.JavaMail.defaultUser@defaultHost> [Probably off-topic] Mobile telephone numbers and next of kin contact./ This is probably off-topic but in view of the fact that there may well be people on this list who are working in telecommunications companies and/or are on appropriate standards committees or have contacts who are, perhaps readers will not mind. I have identified a problem that could perhaps be solved, or maybe just greatly reduced, with an addition to the documentation that accompanies a new mobile telephone, perhaps with wording on an industry standard basis next to where the telephone number of the device is stated. Here is the problem. A person H has as next of kin a person J, who lives at a different address. J has provided H with his or her mobile telephone number and H has recorded the number into his or her own medical record. One day, for whatever reason, J changes his or her mobile telephone and has a new number. J is busy and it just does not occur to J to inform H of the changed number. Time passes. One day, H is taken ill and medical staff try to contact J using the mobile telephone number that is in the medical record of H. The medical staff cannot contact J and maybe the call reaches someone else from another family if enough time has passed and J's old number has been reassigned from the list of discontinued-use numbers. So, it could help if every new mobile telephone were to carry a printed message next to where the telephone number of the device is stated, such as the following. If you are the next of kin of someone, please remember to inform him or her of this new telephone number and mention that he or she needs to have it added into his or her medical record. Lots of people might make such a notification anyway, but some people may not and there seems no way to know how big is the problem of out-dated contact information in medical records around the world. William Overington 21 February 2015 -------------- next part -------------- An HTML attachment was scrubbed... URL: From olopierpa at gmail.com Sat Feb 21 11:14:51 2015 From: olopierpa at gmail.com (Pierpaolo Bernardi) Date: Sat, 21 Feb 2015 18:14:51 +0100 Subject: [Probably off-topic] Mobile telephone numbers and next of kin contact In-Reply-To: <3500040.16871.1424522760849.JavaMail.defaultUser@defaultHost> References: <3500040.16871.1424522760849.JavaMail.defaultUser@defaultHost> Message-ID: Probably?? Please don't do this. On Sat, Feb 21, 2015 at 1:46 PM, William_J_G Overington wrote: > [Probably off-topic] Mobile telephone numbers and next of kin contact./ > > This is probably off-topic but in view of the fact that there may well be > people on this list who are working in telecommunications companies and/or > are on appropriate standards committees or have contacts who are, perhaps > readers will not mind. > > I have identified a problem that could perhaps be solved, or maybe just > greatly reduced, with an addition to the documentation that accompanies a > new mobile telephone, perhaps with wording on an industry standard basis > next to where the telephone number of the device is stated. > > Here is the problem. > > A person H has as next of kin a person J, who lives at a different address. > > J has provided H with his or her mobile telephone number and H has recorded > the number into his or her own medical record. > > One day, for whatever reason, J changes his or her mobile telephone and has > a new number. > > J is busy and it just does not occur to J to inform H of the changed number. > > Time passes. > > One day, H is taken ill and medical staff try to contact J using the mobile > telephone number that is in the medical record of H. > > The medical staff cannot contact J and maybe the call reaches someone else > from another family if enough time has passed and J's old number has been > reassigned from the list of discontinued-use numbers. > > So, it could help if every new mobile telephone were to carry a printed > message next to where the telephone number of the device is stated, such as > the following. > > If you are the next of kin of someone, please remember to inform him or her > of this new telephone number and mention that he or she needs to have it > added into his or her medical record. > > Lots of people might make such a notification anyway, but some people may > not and there seems no way to know how big is the problem of out-dated > contact information in medical records around the world. > > William Overington > > 21 February 2015 > > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > From public at khwilliamson.com Sat Feb 21 13:10:14 2015 From: public at khwilliamson.com (Karl Williamson) Date: Sat, 21 Feb 2015 12:10:14 -0700 Subject: Question about the Sentence_Break property In-Reply-To: References: <54E6A218.3030500@khwilliamson.com> <20150220051430.1401c498@JRWUBU2> Message-ID: <54E8D816.9010606@khwilliamson.com> On 02/20/2015 04:56 PM, Philippe Verdy wrote: > 2015-02-20 6:14 GMT+01:00 Richard Wordingham > >: > > TUS has a whole section on the issue, namely TUS 7.0.0 Section 5.8. > One thing that is missing is mention of the convention that a single > newline character (or CRLF pair) is a line break whereas a doubled > newline character denotes a paragraph break. > > > In that case CR or LF characters alone are not "paragraph separators" by > themselves unless they are grouped together. Like NEL, they should just > be considered as line separators and the terminology used in UAX 29 rule > SB4 is effectively incorrect if what matters here is just the linebreak > property. And also in that case, the SB4 rule should effecticely include > NEL (from the C1 subset). > > But as SB4 is only related to sentence breaking, It would be e problem > because simple linebreaks are used extremely frequently in the middle of > sentences. > > What the Sentence break algorithm should say is that there should first > be a preprossing step separating line breaks and paragraph breaks > (creating custom entities,(similar to collation elements, but encoded > internally with a code point out of the standard space), that the rule > SB4 would use instead of "Sep | CR | LF". That custome entity should be > "Sep" but without the rule defining it, as there are various ways to > represent paragraph breaks. > But isn't SB4 contradictory to this from TUS Section 5.8? R2c In parsing, choose the safest interpretation. For example, in recommendation R2c an implementer dealing with sentence break heuris- tics would reason in the following way that it is safer to interpret any NLF as LS: ? Suppose an NLF were interpreted as LS, when it was meant to be PS. Because most paragraphs are terminated with punctuation anyway, this would cause misidentification of sentence boundaries in only a few cases. ? Suppose an NLF were interpreted as PS, when it was meant to be LS. In this case, line breaks would cause sentence br eaks, which would result in significant problems with the sentence break heuristics It seems to me SB4 is choosing the non-safer way. What am I missing? From ishida at w3.org Mon Feb 23 13:22:53 2015 From: ishida at w3.org (Richard Ishida) Date: Mon, 23 Feb 2015 19:22:53 +0000 Subject: Persian counter styles Message-ID: <54EB7E0D.2090603@w3.org> at http://www.w3.org/TR/2015/WD-predefined-counter-styles-20150203/#arabic-styles there are two fixed counter styles for Persian which use the sequence U+0647 ARABIC LETTER HEH U+200D ZERO WIDTH JOINER i was wondering whether this is right, or whether that item should actually be U+06BE ARABIC LETTER HEH DOACHASHMEE does anyone know? ri From slevin at signpuddle.net Tue Feb 24 11:38:55 2015 From: slevin at signpuddle.net (Stephen E Slevinski Jr) Date: Tue, 24 Feb 2015 11:38:55 -0600 Subject: Fixing the sort order of the SignWriting symbols in Unicode 8 Message-ID: <54ECB72F.7070506@signpuddle.net> Hi Unicode list, I am concerned that the SignWriting symbols as defined in Unicode 8 do not sort properly. Making "_fill 1_" and "_rotation 1_" inherent values causes sorting problems. Without inherent values for "_fill 1_" and "_rotation 1_", the symbols sort properly. Consider these symbols in the correct sort order. symbol - fill 1 - rotation 1 symbol - fill 1 - rotation 2 symbol - fill 2 - rotation 1 symbol - fill 2 - rotation 2 When "_fill 1_" and "_rotation 1_" are inherent, the symbols above have shorter names that sort incorrectly. symbol symbol - rotation 2 symbol - fill 2 symbol - fill 2 - rotation 2 With the above list, "*symbol - fill 2*" will sort before "*symbol - rotation 2*". This is incorrect. I believe it would fix sorting by setting the weights in the "DUCET" table so that rotations sort before fills. If this addition was made, the SignWriting symbols in Unicode 8 should sort properly. Regards, ?Steve -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Tue Feb 24 14:38:05 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Tue, 24 Feb 2015 21:38:05 +0100 Subject: Fixing the sort order of the SignWriting symbols in Unicode 8 In-Reply-To: <54ECB72F.7070506@signpuddle.net> References: <54ECB72F.7070506@signpuddle.net> Message-ID: Just an adjustment of weights, so that "rotation" weights are lower than "fill" weights. The inherent "fill 1" and "rotation 1" can be kept. This is similar to the collation for case insensitive sorts that preserve the difference of diacritics, or sorts that swap these levels (all you have to do is to swap arithmetically the ranges of weights by a simple offset) The DUCET seems to have given small level to to rotation variants by assigning them higher ranges so that they take prority over fil variants, and you'd like the reverse : this a basic tailoring, and not a problem of inherent values, where "fill 1" and "rotation 1" are made ignorable ni all levels (except the last implicit level on the code points in NFD form, then the optional implicit level on original code points in any non normalized form) 2015-02-24 18:38 GMT+01:00 Stephen E Slevinski Jr : > Hi Unicode list, > > I am concerned that the SignWriting symbols as defined in Unicode 8 do not > sort properly. Making "*fill 1*" and "*rotation 1*" inherent values > causes sorting problems. > > Without inherent values for "*fill 1*" and "*rotation 1*", the symbols > sort properly. Consider these symbols in the correct sort order. > > symbol - fill 1 - rotation 1 > symbol - fill 1 - rotation 2 > symbol - fill 2 - rotation 1 > symbol - fill 2 - rotation 2 > > > When "*fill 1*" and "*rotation 1*" are inherent, the symbols above have > shorter names that sort incorrectly. > > symbol > symbol - rotation 2 > symbol - fill 2 > symbol - fill 2 - rotation 2 > > With the above list, "*symbol - fill 2*" will sort before "*symbol - > rotation 2*". This is incorrect. > > I believe it would fix sorting by setting the weights in the "DUCET" table > so that rotations sort before fills. If this addition was made, the > SignWriting symbols in Unicode 8 should sort properly. > > Regards, > ?Steve > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From markus.icu at gmail.com Tue Feb 24 15:20:44 2015 From: markus.icu at gmail.com (Markus Scherer) Date: Tue, 24 Feb 2015 13:20:44 -0800 Subject: Fixing the sort order of the SignWriting symbols in Unicode 8 In-Reply-To: <54ECB72F.7070506@signpuddle.net> References: <54ECB72F.7070506@signpuddle.net> Message-ID: On Tue, Feb 24, 2015 at 9:38 AM, Stephen E Slevinski Jr < slevin at signpuddle.net> wrote: > Hi Unicode list, > This is a useful place for discussion, but once the discussion peters out please submit formal feedback: http://www.unicode.org/review/pri285/ I am concerned that the SignWriting symbols as defined in Unicode 8 do not > sort properly. Making "*fill 1*" and "*rotation 1*" inherent values > causes sorting problems. > When you submit formal feedback, then please make it explicit and actionable. For example, it is not clear to me what you mean with "inherent values". And when you say "symbol - fill 1 - rotation 1", is that a sequence of three characters? (By the way, this sounds a bit like issues with sorting Hangul LV vs. LVT syllables.) Best regards, markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Wed Feb 25 17:08:59 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Wed, 25 Feb 2015 23:08:59 +0000 Subject: Indic Syllabic Categories In-Reply-To: <20140517115635.7e03509f@JRWUBU2> References: <20140509235752.27e23319@JRWUBU2> <20140517115635.7e03509f@JRWUBU2> Message-ID: <20150225230859.62b9ff2c@JRWUBU2> On Sat, 17 May 2014 11:56:35 +0100 Richard Wordingham wrote: > I've reviewed the application of the revised categories as set forth > in L2/14-126 > (http://www.unicode.org/L2/L2014/14126r-indic-properties.pdf) as > applied to the Thai, Lao and Tai Tham scripts, and noted a few other > characters, and come up with the following proposed changes of > syllabic category. I've just submitted a slightly different set of changes via the Unicode report function. They were updated to take into account other proposed changes and also Microsoft's new 'Universal Shaping Engine'. The submitted comment follows. Richard. I've reviewed the application of the revised categories as set forth in L2/14-126 (http://www.unicode.org/L2/L2014/14126r-indic-properties.pdf) as applied to the Thai, Lao and Tai Tham scripts, and noted a few other characters, and come up with the following proposed changes of syllabic category. I have also taken into account the proposals of Roozbeh Pournader of 24 February 2015 related to work on the Universal Shaping Engine. I've come up with 3 new characters of category Bindu: 0303 ;Bindu # Mn COMBINING TILDE 0310 ; Bindu # Mn COMBINING CANDRABINDU 1A74 ; Bindu # Mn TAI THAM SIGN MAI KANG (currently Vowel_Dependent) Note that both U+0ECD LAO NIGGAHITA and U+1A74 function both as Bindu and as Vowel_Dependent. U+0303 is used in Patani Malay in the Thai script - see UTC document L2/10-451. U+0310 is used for Sanskrit in Tamil script, according to Indic list email 'Re: Tamil Punctuation', 27/7/12 9:24 +0530 from Shriramana Sharma. I've found 4 new characters of category Visarga: 0E30 ; Visarga # Lo THAI CHARACTER SARA A 0EB0 ; Visarga # Lo LAO VOWEL SIGN A 1A61 ; Visarga # Mc TAI THAM VOWEL SIGN A 19B0 ; Visarga # Mc (to be Lo) NEW TAI LUE VOWEL SIGN VOWEL SHORTENER Note that the tone (or voice modulation) character U+1038 MYANMAR SIGN VISARGA is currently classified as Visarga. U+0E30 is used as visarga in Sanskrit, e.g. in the Royal Institute Dictionary. The typical sound of the four visargas above is /?/ rather than /h/, and, through a feature of Tai (SW Tai?) phonology, they all have the additional function of shortening a vowel. As a vowel shortener, U+1A61 and U+19B0 may follow a final consonant. These 4 characters are currently classified as Vowel_Dependent. Except for the Lao script, that usage can easily be interpreted as a modification of the implicit vowel. Modern Lao does not acknowledge the existence of an implicit vowel, so that interpretation may be harder to accept. (Vowel_Dependent U+0EB1 LAO VOWEL SIGN MAI KAN is also a vowel shortener; in the 19th century it was denied that Vowel_Dependent U+0E31 THAI CHARACTER MAI HAN-AKAT was a vowel in Thai.) U+1A61 occasionally has the sound /k/, especially when used in conjunction with U+1A62 TAI THAM VOWEL SIGN MAI SAT. I think we should regard this as just one of the uses of visarga. I've found 3 new nuktas, at least, so long as the application of nukta is not restricted to *foreign* consonants. 0331 ; Nukta # Mn COMBINING MACRON BELOW 0359 ; Nukta # Mn COMBINING ASTERISK BELOW 1A7F ; Nukta # Mn TAI THAM COMBINING CRYPTOGRAMMIC DOT U+0331 is used in Patani Malay in the Thai script - see L2/10-451 and the consonant chart on p16 of http://mlenetwork.org/sites/default/files/Patani%20Malay%20Presentation%20-%20Part%202.pdf. U+0331 and U+0359 have been used in English-Thai dictionaries to represent English sounds, very much a nukta role. They were previously classified as 'Other', though there is a proposal to make U+1A7F 'Syllable_Modifier'. U+0EC8 LAO TONE MAI EK functions as Nukta in Khmu as well as performing its principal r?le of Tone_Mark in Lao. U+0E3A THAI CHARACTER PHINTHU is used both as Nukta and as Pure_Killer; the latter is its traditional r?le. I've found 4 new pure killers, all currently classified as 'Other', though there is a proposal to classify U+0E4C (along with U+17CD) as 'Consonant_Killer'. They are: 0E4C ;Pure_Killer # Mn THAI CHARACTER THANTHAKHAT 0ECC ; Pure_Killer # Mn LAO CANCELLATION MARK 1A7C ; Pure_Killer # Mn TAI THAM SIGN KHUEN-LUE KARAN 1A7A ; Pure_Killer # Mn TAI THAM SIGN RA HAAM U+0E4C THAI CHARACTER THANTHAKHAT and U+0E4E THAI CHARACTER YAMAKKAN once divided the role of vowel killing - U+0E4E formed clusters and U+0E4C removed final vowels. The use of U+0E4C came to be largely restricted to vowels associated with clusters of consonants. Removing the vowel made the final consonant of the cluster silent (spoken Thai does not permit final consonant clusters), and from this effect it has been reinterpreted as a consonant-killer. U+0ECC probably had the same behaviour as U+0E4C. I don't know if it is still used in Laos - foreign loanwords often don't follow the rules. The Tai Tham marks are still at the transitional stage - they are sometimes found on final unsubscripted consonants to indicate that they have no vowel. There is an unfortunate overlap with the final consonant mark for (pronunciation necessarily /n/). The Khuen and Lue from of the final consonant symbol has the same shape as the Thai and Lao form of the pure killer. Consequently U+1A7A serves as Consonant_Final in Tai Khuen and Tai Lue. In Tai Khuen, at least, the use as a final consonant seems to have recently fallen into disfavour, so it seems most appropriate to classify U+1A7A as 'Pure_Killer'. I noted above that the 'Pure_Killer' U+0E3A THAI CHARACTER PHINTHU also serves as a nukta. I have a vague recollection that U+0E4C THAI CHARACTER THANTHAKHAT serves as a register mark in an orthography for the Chong language, so that would count as an auxiliary r?le as Tone_Mark. If 'Consonant_Killer' is to be separated from 'Pure_Killer', then we need a separate category 'Dual_Mode_Killer' for U+1A7A and U+1A7C. It should be noted that U+1A62 TAI THAM VOWEL SIGN MAI SAT serves not only as Vowel_Dependent but also as Consonant_Final. This seems to be chiefly relevant to anyone attempting to deduce the pronunciation from the spelling. There are 4 characters currently categorised as 'Consonant' which I think are better categorised as 'Vowel': 0E24 ; Vowel # Lo THAI CHARACTER RU 0E26 ; Vowel # Lo THAI CHARACTER LU 1A42 ; Vowel # Lo TAI THAM LETTER RUE 1A44 ; Vowel # Lo TAI THAM LETTER LUE They serve both as independent and dependent vowels. Note that U+0E24 and U+0E26 may be followed by the length mark U+0E45 THAI CHARACTER LAKKHANGYAO, which is categorised as 'Vowel_Dependent'. I am not aware of any usage of U+0E45 as a true vowel. The sequence occurs with the same meaning, 'elephant', as U+1AAD. I don't know AA> whether this justifies changing U+1AAD from 'Other' to 'Consonant_Placeholder'. I've found one new Consonant: 0EBD ; Consonant # Lo LAO SEMIVOWEL SIGN NYO (was Consonant_Medial) 0EDE ; Consonant # Lo LAO LETTER KHMU GO (was Other) U+0EBD is used as an initial consonant in Khmu, so U+0EBD has been used in all r?les in the Lao script, like U+0EA7 LAO LETTER WO, which is of category Consonant. For information on Khmu usage, see UTC document L2/10-335 (http://www.unicode.org/L2/L2010/10335r-n3893r-lao-hosken.pdf). The Khmu alphabet chart included backs up the text. (It also shows U+0EC8 LAO TONE MAI EK acting as a Nukta!) If 'repha' can be used as a general category, including for example Myanmar script kinzi, then there are two arguable new examples, currently categorised as Consonant_Final: 1A58 ; Consonant_Preceding_Repha? # Mn TAI THAM SIGN MAI KANG LAI 1A5A ; Consonant_Succeeding_Repha? # Mn TAI THAM CONSONANT SIGN LOW PA There are significant issues with U+1A58; while traditionally it behaves as repha/kinzi, some modern styles are better served by treating it as Consonant_Final. It takes some juggling for a single OTL-style rendering engine to be able to render either style depending on the lookups while oblivious to the difference, but it can be done. I've found 5 new instances of Consonant_Subjoined: 1A57 ; Consonant_Subjoined # Mc TAI THAM CONSONANT SIGN LA TANG LAI 1A5B ; Consonant_Subjoined # Mn TAI THAM CONSONANT SIGN HIGH RATHA OR LOW PA 1A5C ; Consonant_Subjoined # Mn TAI THAM CONSONANT SIGN MA 1A5D ; Consonant_Subjoined # Mn TAI THAM TAI THAM CONSONANT SIGN BA 1A5E ; Consonant_Subjoined # Mn TAI THAM CONSONANT SIGN SA They were all previously categorised as Consonant_Final. Note that U+1A57 is an abbreviation. It is derived by the addition of a stroke to the subscript form . Abbreviations of the word _tanglaai_ 'all' using U+1A57 normally include at least , so U+1A57 is not Consonant_Final. An example, apparently spelt , is given in Table?16 at http://www.seasite.niu.edu/tai/TaiLue/graphic%20blends.htm. The word ?????? _nippa:na_ 'nirvana' immediately demonstrates that U+1A5B is not a final consonant. U+1A5C occurs in Pali proper names ending -mmo , so is clearly not a final consonant. U+1A5D occurs in Northern Thai principally in one word, whose pronunciation is roughly /k?b??/. U+1A5D is not Consonant_Final in its phonetic effect. The word is a compound word (or perhaps just a visual compound), formed by chaining two syllables and striking out the duplicated characters. I have a text in which the constituents are to be encoded and TONE-1>, so the chained word may reasonably be encoded U+1A74, U+1A5D, U+1A75> or . While all my examples of U+1A5E are word final, it seems to differ from on the basis of the room available for it. Both forms are used as a word final consonant. The only Pali consonant cluster ending in /s/ is /ss/, and that is written using U+1A54 TAI THAM LETTER GREAT SA, so a non-final will be rare. (I'm finding /ks/ written with U+1A47 TAI THAM LETTER HIGH SSA due to the application of RUKI.) However, I feel it would be rash to presume that every example of U+1A5E will be a final consonant. I have one new Consonant_Final: 0EDF ; Consonant_Final # Lo LAO LETTER KHMU NYO (was Consonant) See UTC document L2/10-335 for evidence. I have one possible new Consonant_subjoined: 1A7B ; Consonant_subjoined # Mn TAI THAM SIGN MAI SAM The value of its Indic_Matra_Category, if relevant, should be recorded as Top. U+1A7B is principally a repetition mark, indicating the repetition of a word. As extensions of this role, it can also do at least the following: (1) Indicate a repeated (not geminate) consonant (2) Indicate an omitted implicit vowel (one omits an implicit vowel by replacing it with U+1A60) (3) Indicate an epenthetic vowel (extension of Role 2). In r?le (1), it serves as a subjoined consonant. In r?les (2) and (3), it serves as a dependent vowel. For a shaper that does not constrain appearance, such as the Universal Shaping Engine, the best categorisation is probably 'Consonant_subjoined'. Although U+1A55 TAI THAM CONSONANT SIGN MEDIAL RA and U+1A56 TAI THAM CONSONANT SIGN MEDIAL LA are named as medial consonants, too much should not be read into such a description. Both are, very occasionally, immediately preceded by vowels, and both may be followed by and . While the latter two sequences most commonly represent vowels, the strictly consonantal cluster starts a few words beginning with the cluster /lw/. This is a behaviour the Universal Shaping Engine of Microsoft currently disallows for medial consonants. We should therefore have: 1A55 ; Consonant_Subjoined #Mc TAI THAM CONSONANT SIGN MEDIAL RA 1A56 ; Consonant_Subjoined #Mn TAI THAM CONSONANT SIGN MEDIAL LA I actually see no benefits for rendering engines in distinguishing Consonant_Medial and Consonant_Subjoined, though the contrast may help in locating phonetic syllable boundaries. From shervinafshar at gmail.com Wed Feb 25 18:40:36 2015 From: shervinafshar at gmail.com (Shervin Afshar) Date: Wed, 25 Feb 2015 16:40:36 -0800 Subject: Persian counter styles In-Reply-To: References: <54EB7E0D.2090603@w3.org> Message-ID: On Wed, Feb 25, 2015 at 3:46 PM, Behnam Rassi wrote: > It is right in terms of what is being used. It is wrong linguistically. > Heh Havvaz (from Abjad alphabet) is not even Heh Dochashmee although it > looks similar. But it should be a non-joining character altogether. > The distinction made here between Abjad Heh and Arabic Heh is unknown to me. What makes Abjad Heh different from ARABIC LETTER HEH INITIAL FORM (U+FEEB)? On Tue, Feb 24, 2015 at 11:04 AM, Khaled Hosny wrote: > I don?t know about Persian, but in Arabic isolated Heh is not used in > math or lists is it can be confused with Arabic-Indic digit five, and > instead it is always used in initial form in such situations. I don't believe that the potential confusability between Arabic-Indic digit five and stand-alone Heh implies that it should not be used in writing math. See the mathematical charts (e.g. [1]) with Abjad numeral values in this manuscript in Arabic on astronomy[2] from 18th century. A modern approach might decide to avoid that usage, but it should not be elevated to an orthographic rule. [1]: http://pudl.princeton.edu/viewer.php?obj=r781wg07m#page/141/mode/2up [2]: http://pudl.princeton.edu/objects/r781wg07m ? Shervin On Wed, Feb 25, 2015 at 3:46 PM, Behnam Rassi wrote: > It is right in terms of what is being used. It is wrong linguistically. > Heh Havvaz (from Abjad alphabet) is not even Heh Dochashmee although it > looks similar. But it should be a non-joining character altogether. > -behnam > > > On Feb 23, 2015, at 2:22 PM, Richard Ishida wrote: > > > > at > http://www.w3.org/TR/2015/WD-predefined-counter-styles-20150203/#arabic-styles > there are two fixed counter styles for Persian which use the sequence > > > > U+0647 ARABIC LETTER HEH > > U+200D ZERO WIDTH JOINER > > > > i was wondering whether this is right, or whether that item should > actually be > > > > U+06BE ARABIC LETTER HEH DOACHASHMEE > > > > > > does anyone know? > > > > ri > > _______________________________________________ > > Unicode mailing list > > Unicode at unicode.org > > http://unicode.org/mailman/listinfo/unicode > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From dr.khaled.hosny at gmail.com Thu Feb 26 00:12:23 2015 From: dr.khaled.hosny at gmail.com (Khaled Hosny) Date: Thu, 26 Feb 2015 08:12:23 +0200 Subject: Persian counter styles In-Reply-To: References: <54EB7E0D.2090603@w3.org> Message-ID: On Feb 26, 2015 2:41 AM, "Shervin Afshar" wrote: > On Tue, Feb 24, 2015 at 11:04 AM, Khaled Hosny wrote: >> >> I don?t know about Persian, but in Arabic isolated Heh is not used in >> math or lists is it can be confused with Arabic-Indic digit five, and >> instead it is always used in initial form in such situations. > > > I don't believe that the potential confusability between Arabic-Indic digit five and stand-alone Heh implies that it should not be used in writing math. I only stated that it is not used (i.e. The current practice) whether it should or shouldn't be used is up to the mathematicians who write that math (and for one, the Arabic Mathematical Alphabetic block does not have an isolated Heh, though its place is reserved). Regards, Khaled -------------- next part -------------- An HTML attachment was scrubbed... URL: From shervinafshar at gmail.com Thu Feb 26 11:03:39 2015 From: shervinafshar at gmail.com (Shervin Afshar) Date: Thu, 26 Feb 2015 09:03:39 -0800 Subject: Persian counter styles In-Reply-To: <88367EA7-3E88-4F38-8FEC-EF4F5A791A6E@me.com> References: <54EB7E0D.2090603@w3.org> <88367EA7-3E88-4F38-8FEC-EF4F5A791A6E@me.com> Message-ID: On Wed, Feb 25, 2015 at 8:58 PM, Behnam Rassi wrote: > But Heh Havvaz is not meant to join to anything as far as I know. It might not be commonly in use today, but Abjad numeral values (e.g. 165="???"?, 135="???"?, 55= "??"?, 45 = "??"?, 35 = "??",? 25="??", etc.) join Abjad Heh (or what you call Heh Havvaz) with value of 5, to other Abjad letters. So the 35th item in a list would be enumerated as "??". You can also see this usage in manuscripts which number pages with Abjad numerals. ? Shervin On Wed, Feb 25, 2015 at 8:58 PM, Behnam Rassi wrote: > > On Feb 25, 2015, at 7:40 PM, Shervin Afshar > wrote: > > The distinction made here between Abjad Heh and Arabic Heh is unknown to > me. What makes Abjad Heh different from ARABIC LETTER HEH INITIAL FORM > (U+FEEB)? > > > Heh Initial Form is an arbitrary invention of typesetting era, as all > presentation forms for that matter. They do not represent any established > character but the presentation forms of the joining letter associated with > them. Heh Havvaz on the other hand, is a defined character used for > enumerating and abbreviating. It has nothing to do with an arbitrary > invention even if the appearance has some [poor] similarity. The real > similarity as Richard noted is with Heh Dochashmee. But Heh Havvaz is not > meant to join to anything as far as I know. > -b > -------------- next part -------------- An HTML attachment was scrubbed... URL: