From unicode at unicode.org Sat Jul 1 01:51:00 2017 From: unicode at unicode.org (a.lukyanov via Unicode) Date: Sat, 01 Jul 2017 09:51:00 +0300 Subject: LATIN CAPITAL LETTER SHARP S officially recognized In-Reply-To: References: Message-ID: <59574654.7090409@yspu.org> Is it possible to design fonts that will render ? as SS? So we could choose between ? and SS by just selecting the proper font, without changing the text itself. Or perhaps there will be a "font feature" to select this rendering within the same font. From unicode at unicode.org Sat Jul 1 04:06:07 2017 From: unicode at unicode.org (David Faulks via Unicode) Date: Sat, 1 Jul 2017 09:06:07 +0000 (UTC) Subject: LATIN CAPITAL LETTER SHARP S officially recognized References: <2089202539.2799673.1498899967933.ref@mail.yahoo.com> Message-ID: <2089202539.2799673.1498899967933@mail.yahoo.com> I think, and others agree, that this is a bad thing. Those who want SS can simply use 'S' and 'S', ? was encoded for those who wanted to use a capital form of ?. They would be annoyed if they found that the typeface they wanted subverted their intentions. -------------------------------------------- On Sat, 7/1/17, a.lukyanov via Unicode wrote: Subject: Re: LATIN CAPITAL LETTER SHARP S officially recognized To: unicode at unicode.org Received: Saturday, July 1, 2017, 2:51 AM Is it possible to design fonts that will render ? as SS? So we could choose between ? and SS by just selecting the proper font, without changing the text itself. Or perhaps there will be a "font feature" to select this rendering within the same font. From unicode at unicode.org Sat Jul 1 04:34:56 2017 From: unicode at unicode.org (Werner LEMBERG via Unicode) Date: Sat, 01 Jul 2017 11:34:56 +0200 (CEST) Subject: LATIN CAPITAL LETTER SHARP S officially recognized In-Reply-To: <2089202539.2799673.1498899967933@mail.yahoo.com> References: <2089202539.2799673.1498899967933.ref@mail.yahoo.com> <2089202539.2799673.1498899967933@mail.yahoo.com> Message-ID: <20170701.113456.1081815090830751152.wl@gnu.org> > > Is it possible to design fonts that will render ? as SS? > > > > So we could choose between ? and SS by just selecting the proper > > font, without changing the text itself. > > I think, and others agree, that this is a bad thing. Those who want > SS can simply use 'S' and 'S', ? was encoded for those who wanted to > use a capital form of ?. They would be annoyed if they found that > the typeface they wanted subverted their intentions. It's even more complicated. Take for example the word `Stra?e' (street), which gets capitalized as `STRASSE'. In Germany and Austria this word gets hyphenated as `STRA-SSE' (since hyphenation is not influenced by the ??SS substitution). However, in Switzerland it gets hyphenated as `STRAS-SE', since Swiss German doesn't use ?; instead, `ss' gets treated as a normal double consonant. Werner From unicode at unicode.org Sat Jul 1 06:49:20 2017 From: unicode at unicode.org (Michael Everson via Unicode) Date: Sat, 1 Jul 2017 12:49:20 +0100 Subject: LATIN CAPITAL LETTER SHARP S officially recognized In-Reply-To: <20170701.113456.1081815090830751152.wl@gnu.org> References: <2089202539.2799673.1498899967933.ref@mail.yahoo.com> <2089202539.2799673.1498899967933@mail.yahoo.com> <20170701.113456.1081815090830751152.wl@gnu.org> Message-ID: <963A2E34-D0C7-48A9-ACD1-A90B605E9CE6@evertype.com> On 1 Jul 2017, at 10:34, Werner LEMBERG via Unicode wrote: > It's even more complicated. Take for example the word `Stra?e' > (street), which gets capitalized as `STRASSE?. Or as STRA?E. > In Germany and Austria this word gets hyphenated as `STRA-SSE' (since hyphenation is not > influenced by the ??SS substitution). However, in Switzerland it gets > hyphenated as `STRAS-SE', since Swiss German doesn't use ?; instead, > `ss' gets treated as a normal double consonant. It would be hyphenated STRA-?E in any case. Michael Everson From unicode at unicode.org Sat Jul 1 08:36:52 2017 From: unicode at unicode.org (Itai Berli via Unicode) Date: Sat, 1 Jul 2017 16:36:52 +0300 Subject: Emacs' implementation of the bidirectional algorithm Message-ID: Emacs claims to fully conform to the Unicode Bidirectional Algorithm 8.0.0 (see sections 22.19 'Bidirectional Editing' and 37.26 'Bidirectional Display' of the Emacs manual), yet I have noticed some behavior that makes me question this claim. I'll appreciate the opinion of others, this way or the other. For each of the following three situation, I wish to know: Is Emacs' behavior consistent with the UBA? If it does, I'd like to know whether you find this behavior in line with the 'spirit' of the UBA, and with common sense. 1. Paragraph boundaries. According to the Emacs manual (section 22.19) "Paragraph boundaries are empty lines, i.e., lines consisting entirely of whitespace characters." The following screenshot shows this behavior in action: http://imgur.com/3eyrUfA 2. Visualization of explicit bidi characters. According to the Emacs manual (section 22.19: "In a GUI session, the lrm and rlm characters display as very thin blank characters; on text terminals they display as blanks." The following screenshot shows this behavior in action. There are three bidi marks (LRI,PDI,LRM) between the two left-most x's. http://imgur.com/VD3Lvsn 3. Line wrapping. The following screenshot shows the line-breaking algorithm in action. The paragraph starts with two Hebrew words followed by the beginning of Abraham Lincoln's Gettysburg Address. The English text flows from the bottom to the top. http://imgur.com/Bckn7zP Possible reasons why these behaviors are reasonable and consistent with the standard. 1. Paragraph boundaries. The UBA allows applications to employ higher-level protocols when deciding on base paragraph direction. See section 4.3 and specifically clause HL1 there. 2. Visualization of explicit bidi characters. (a) The UBA also allows to display the bidi characters. See section 5.2. (b) This is just the default; it can be customized like every other character's glyph. 3. Line wrapping. The remedy is simple: break long lines into shorter ones by inserting newlines. From unicode at unicode.org Sat Jul 1 11:39:47 2017 From: unicode at unicode.org (Eli Zaretskii via Unicode) Date: Sat, 01 Jul 2017 19:39:47 +0300 Subject: Emacs' implementation of the bidirectional algorithm In-Reply-To: (message from Itai Berli via Unicode on Sat, 1 Jul 2017 16:36:52 +0300) References: Message-ID: <83fuegp0vg.fsf@gnu.org> > Date: Sat, 1 Jul 2017 16:36:52 +0300 > From: Itai Berli via Unicode > > Emacs claims to fully conform to the Unicode Bidirectional Algorithm > 8.0.0 (see sections 22.19 'Bidirectional Editing' and 37.26 > 'Bidirectional Display' of the Emacs manual), yet I have noticed some > behavior that makes me question this claim. > > I'll appreciate the opinion of others, this way or the other. > > For each of the following three situation, I wish to know: Is Emacs' > behavior consistent with the UBA? If it does, I'd like to know whether > you find this behavior in line with the 'spirit' of the UBA, and with > common sense. > > 1. Paragraph boundaries. According to the Emacs manual (section 22.19) > "Paragraph boundaries are empty lines, i.e., lines consisting entirely > of whitespace characters." The following screenshot shows this > behavior in action: http://imgur.com/3eyrUfA > > 2. Visualization of explicit bidi characters. According to the Emacs > manual (section 22.19: "In a GUI session, the lrm and rlm characters > display as very thin blank characters; on text terminals they display > as blanks." The following screenshot shows this behavior in action. > There are three bidi marks (LRI,PDI,LRM) between the two left-most > x's. http://imgur.com/VD3Lvsn > > 3. Line wrapping. The following screenshot shows the line-breaking > algorithm in action. The paragraph starts with two Hebrew words > followed by the beginning of Abraham Lincoln's Gettysburg Address. The > English text flows from the bottom to the top. > http://imgur.com/Bckn7zP Item 3 doesn't conform to what section 3.4 of the UBA says. the reasons are that this requirement would need the Emacs display engine to be redesigned. The other items don't violate the UBA, IMO. They follow the high-level protocols clause in HL1, and section 5.2 which describes the optional retaining of directional control characters in the buffer and on display. From unicode at unicode.org Sat Jul 1 13:19:45 2017 From: unicode at unicode.org (Eli Zaretskii via Unicode) Date: Sat, 01 Jul 2017 21:19:45 +0300 Subject: Emacs' implementation of the bidirectional algorithm In-Reply-To: (message from Itai Berli via Unicode on Sat, 1 Jul 2017 16:36:52 +0300) References: Message-ID: <83efu0ow8u.fsf@gnu.org> > Date: Sat, 1 Jul 2017 16:36:52 +0300 > From: Itai Berli via Unicode > > Emacs claims to fully conform to the Unicode Bidirectional Algorithm > 8.0.0 (see sections 22.19 'Bidirectional Editing' and 37.26 > 'Bidirectional Display' of the Emacs manual) This is somewhat inaccurate. For the record, the actual text of section 22.19 if the Emacs User manual is: Emacs implements the Unicode Bidirectional Algorithm described in the Unicode Standard Annex #9, for reordering of bidirectional text for display. The actual text of section 37.26 of the Emacs Lisp Reference manual is: In performing this ?bidirectional reordering?, Emacs follows the Unicode Bidirectional Algorithm (a.k.a. UBA), which is described in Annex #9 of the Unicode standard (). Emacs provides a ?Full Bidirectionality? class implementation of the UBA, consistent with the requirements of the Unicode Standard v8.0. The "Full Bidirectionality class" part refers to section 4.2 of UAX#9, and specifically to the fact that all of the explicit directional formatting characters are supported, including the isolates. From unicode at unicode.org Sun Jul 2 07:49:20 2017 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Sun, 2 Jul 2017 13:49:20 +0100 Subject: LATIN CAPITAL LETTER SHARP S officially recognized In-Reply-To: <59574654.7090409@yspu.org> References: <59574654.7090409@yspu.org> Message-ID: <20170702134920.5309b950@JRWUBU2> On Sat, 01 Jul 2017 09:51:00 +0300 "a.lukyanov via Unicode" wrote: > Is it possible to design fonts that will render ? as SS? > > So we could choose between ? and SS by just selecting the proper > font, without changing the text itself. > > Or perhaps there will be a "font feature" to select this rendering > within the same font. I believe this sort of feature is being deprecated. (The deprecated features include altv, crcy, dflt, jajp, j0p03, kokr, vivn, zhcn and zntw). It belongs more in the barely tolerated realm of transliteration fonts. There are, however, the features cv01 to cv99 that may be used on a font-by-font basis - a sort of Private Use Area for features. Richard. From unicode at unicode.org Sun Jul 2 10:59:12 2017 From: unicode at unicode.org (=?UTF-8?Q?=22J=C3=B6rg_Knappen=22?= via Unicode) Date: Sun, 2 Jul 2017 17:59:12 +0200 Subject: Aw: Re: LATIN CAPITAL LETTER SHARP S officially recognized In-Reply-To: <59574654.7090409@yspu.org> References: <59574654.7090409@yspu.org> Message-ID: An HTML attachment was scrubbed... URL: From unicode at unicode.org Sun Jul 2 11:07:08 2017 From: unicode at unicode.org (Michael Everson via Unicode) Date: Sun, 2 Jul 2017 17:07:08 +0100 Subject: LATIN CAPITAL LETTER SHARP S officially recognized In-Reply-To: References: <59574654.7090409@yspu.org> Message-ID: <52C029A4-D136-46ED-AB80-FBAA71D1E191@evertype.com> Now that it has been added, however, the situation is different. > On 2 Jul 2017, at 16:59, J?rg Knappen via Unicode wrote: > > > Is it possible to design fonts that will render ? as SS? > > In fact, that has happened long before the capital letter sharp s was added to Unicode: The T1 encoding (aka Cork encoding) of LaTeX > does this since 1990. The reason for this was correct hyphenation for German words rendered in all caps. > > --J?rg Knappen From unicode at unicode.org Mon Jul 3 02:43:37 2017 From: unicode at unicode.org (Alastair Houghton via Unicode) Date: Mon, 3 Jul 2017 08:43:37 +0100 Subject: LATIN CAPITAL LETTER SHARP S officially recognized In-Reply-To: References: <59574654.7090409@yspu.org> Message-ID: <6063F267-BF9A-42E3-B679-E48BACE47541@alastairs-place.net> On 2 Jul 2017, at 16:59, J?rg Knappen via Unicode wrote: > > > Is it possible to design fonts that will render ? as SS? > > In fact, that has happened long before the capital letter sharp s was added to Unicode: The T1 encoding (aka Cork encoding) of LaTeX > does this since 1990. The reason for this was correct hyphenation for German words rendered in all caps. Wasn?t there also some oddity relating to hyphenation and ?ss?/?SS? in general? I seem to recall that it used to be the case that you ended up with more ?s?s than you started with when hyphenating a word containing ?ss?? Kind regards, Alastair. -- http://alastairs-place.net From unicode at unicode.org Mon Jul 3 11:05:28 2017 From: unicode at unicode.org (=?UTF-8?Q?=22J=C3=B6rg_Knappen=22?= via Unicode) Date: Mon, 3 Jul 2017 18:05:28 +0200 Subject: Aw: Re: LATIN CAPITAL LETTER SHARP S officially recognized In-Reply-To: <6063F267-BF9A-42E3-B679-E48BACE47541@alastairs-place.net> References: <59574654.7090409@yspu.org> <6063F267-BF9A-42E3-B679-E48BACE47541@alastairs-place.net> Message-ID: An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon Jul 3 11:31:06 2017 From: unicode at unicode.org (Otto Stolz via Unicode) Date: Mon, 3 Jul 2017 18:31:06 +0200 Subject: LATIN CAPITAL LETTER SHARP S officially recognized In-Reply-To: <2acb5df6-1f56-fe59-bff2-2dfb67637f3c@uni-konstanz.de> References: <59574654.7090409@yspu.org> <6063F267-BF9A-42E3-B679-E48BACE47541@alastairs-place.net> <2acb5df6-1f56-fe59-bff2-2dfb67637f3c@uni-konstanz.de> Message-ID: Hello, am 2017-07-03 um 18:16 Uhr habe ich geschrieben: > This rule did hold for all consonants, there?s nothing > particular about double-s. On 2017-07-03 at 18:05 J?rg Knappen had written: > the hyphenation oddity ? never affected the letter s. J?rg is right. I forgot the additional rule that you had to spell ??? instead of ?ss? at the end of every constituent of a compound word, so the rule I reported would never be applied to ?ss?. Also the ?ss? ? ??? rule has been dropped by the spelling reform of 1996. Btw., the dropping of said ? rule has led to much controversy during the ?90s. Most people were not aware that that very rule had been introduced by the pen-ultimate spelling reform, in 1901. Best wishes, Otto From unicode at unicode.org Mon Jul 3 11:49:46 2017 From: unicode at unicode.org (Werner LEMBERG via Unicode) Date: Mon, 03 Jul 2017 18:49:46 +0200 (CEST) Subject: Aw: Re: LATIN CAPITAL LETTER SHARP S officially recognized In-Reply-To: References: <6063F267-BF9A-42E3-B679-E48BACE47541@alastairs-place.net> Message-ID: <20170703.184946.1082299263384367210.wl@gnu.org> > No, the hyphenation oddity involving the addition of letters with > hyphenation (or, to be more precise, to suppress letters in > unhyphenated words) never affected the letter s. I'm not sure that this is really true. As far as I know, `sss' in Swiss German was handled similar to other triplet consonants before the 1996 spelling reform. In other words, you would have written Abschlussatz (`closing sentence') instead of Abschlusssatz , and which would have been hyphenated as Abschluss-satz Werner From unicode at unicode.org Mon Jul 3 12:01:38 2017 From: unicode at unicode.org (Otto Stolz via Unicode) Date: Mon, 3 Jul 2017 19:01:38 +0200 Subject: LATIN CAPITAL LETTER SHARP S officially recognized In-Reply-To: <665C3878-485E-4838-9706-E592AC79EBCE@evertype.com> References: <307c150c-35b3-53ea-867e-72d6fa092184@uni-konstanz.de> <665C3878-485E-4838-9706-E592AC79EBCE@evertype.com> Message-ID: Hello, am 2017-06-30 um 17:34 Uhr hat Michael Everson geschrieben: > It would be sensible to case-map ? to ? however. Since German ist the only language using ??? (if I am not mistaken), Unicode should comply with the official German orthographic rules with respect to this letter. As I have reported to this list, ? 25 E3 of the official German spelling rules clearly give preference to ?SS? as the uppercase equivalent for ???. And before the latest (2017) update of those rules, ??? was not allowed, at all. Best wishes, Otto From unicode at unicode.org Mon Jul 3 12:29:56 2017 From: unicode at unicode.org (Doug Ewell via Unicode) Date: Mon, 03 Jul 2017 10:29:56 -0700 Subject: LATIN CAPITAL LETTER SHARP S officially recognized Message-ID: <20170703102956.665a7a7059d7ee80bb4d670165c8327d.56e1df46d0.wbe@email03.godaddy.com> a.lukyanov wrote: > Is it possible to design fonts that will render ? as SS? > > So we could choose between ? and SS by just selecting the proper font, > without changing the text itself. > > Or perhaps there will be a "font feature" to select this rendering > within the same font. I thought that was one of the main reasons we had Unicode: so we would no longer have to rely on particular fonts, or magic font behavior, to get character identities we expected and could interchange reliably. -- Doug Ewell | Thornton, CO, US | ewellic.org From unicode at unicode.org Mon Jul 3 13:15:20 2017 From: unicode at unicode.org (Gerrit Ansmann via Unicode) Date: Mon, 3 Jul 2017 20:15:20 +0200 Subject: LATIN CAPITAL LETTER SHARP S officially recognized In-Reply-To: References: <307c150c-35b3-53ea-867e-72d6fa092184@uni-konstanz.de> <665C3878-485E-4838-9706-E592AC79EBCE@evertype.com> Message-ID: <34fa3eed-18d3-7c2f-3e1a-828b5b7c5df2@uni-bonn.de> On 03.07.2017 19:01, Otto Stolz via Unicode wrote: > Since German ist the only language using ??? (if I am not mistaken), [?] Some old Sorbian (blackletter) orthographies also employed the ?. It was also used at the beginning of words where it was capitalised to S? at the beginning of sentences or similar. I am not aware of all-caps being used (which was very rare in blackletter in general). From unicode at unicode.org Tue Jul 4 05:19:18 2017 From: unicode at unicode.org (Otto Stolz via Unicode) Date: Tue, 4 Jul 2017 12:19:18 +0200 Subject: LATIN CAPITAL LETTER SHARP S officially recognized In-Reply-To: <34fa3eed-18d3-7c2f-3e1a-828b5b7c5df2@uni-bonn.de> References: <307c150c-35b3-53ea-867e-72d6fa092184@uni-konstanz.de> <665C3878-485E-4838-9706-E592AC79EBCE@evertype.com> <34fa3eed-18d3-7c2f-3e1a-828b5b7c5df2@uni-bonn.de> Message-ID: <151d20c8-89d1-c9f9-3ffa-ad401878f985@uni-konstanz.de> Hello, on 03.07.2017 19:01, Otto Stolz via Unicode wrote: > Since German ist the only language using ??? (if I am not mistaken), [?] Am 2017-07-03 um 20:15 Uhr hat Gerrit Ansmann geschrieben: > Some old Sorbian (blackletter) orthographies also employed the ?. It was > also used at the beginning of words where it was capitalised to S? at > the beginning of sentences or similar. I was referring to contemporary writing systems. Indeed, several east European languages (including, e. g. Latvian) were written in blackletter, with German sound-letter correspondence, before they developped their own writing systems. Thanks for pointing to this particular uppercasing rule. I have not thought of Yiddish, though. This used to be written with Hebrew letters (plus some particular ligatures). Usually, it is transliterated into the Latin script according to the YIVO rules of 1936. In Germany, there is an alternative tran- scription in use, defined by Ronald L?tzsch in 1990. The latter has the ??? also in the beginning of words. However, there is no upper-case equivalent, as Yiddish has no case distinction, hence all Yiddish letters are transcribed to lower-case Latin, even in the beginning of a sentence. > I am not aware of all-caps being > used (which was very rare in blackletter in general). The only word to be printed in blackletter all-caps was ? as far as I know ? ?der HERR?, or ?der HErr?, meaning ?the Lord? (in texts from the bible). In general, blackletter capitals are not designed for all-caps, so that would look disgustingly. Thence the form ?HErr? which is a bit more readable. Best wishes, Otto From unicode at unicode.org Tue Jul 4 07:32:43 2017 From: unicode at unicode.org (Gerrit Ansmann via Unicode) Date: Tue, 4 Jul 2017 14:32:43 +0200 Subject: LATIN CAPITAL LETTER SHARP S officially recognized In-Reply-To: <151d20c8-89d1-c9f9-3ffa-ad401878f985@uni-konstanz.de> References: <307c150c-35b3-53ea-867e-72d6fa092184@uni-konstanz.de> <665C3878-485E-4838-9706-E592AC79EBCE@evertype.com> <34fa3eed-18d3-7c2f-3e1a-828b5b7c5df2@uni-bonn.de> <151d20c8-89d1-c9f9-3ffa-ad401878f985@uni-konstanz.de> Message-ID: <3b0847cc-6de1-64ba-ec10-df7a3daa41b8@uni-bonn.de> On 04.07.2017 12:19, Otto Stolz via Unicode wrote: > I was referring to contemporary writing systems. Indeed, several east European languages (including, e. g. Latvian) were written in blackletter, with German sound-letter correspondence, before they developped their own writing systems. Sure. It?s nothing that needs to be taken into account, if you ask me. > The only word to be printed in blackletter all-caps was ? as far as I know ? ?der HERR?, or ?der HErr?, meaning ?the Lord? (in texts from the bible). In general, blackletter capitals are not designed for all-caps, so that would look disgustingly. Thence the form ?HErr? which is a bit more readable. You can rarely find blackletter all-caps on title pages, e.g.: https://commons.wikimedia.org/wiki/File:Die_Lesung_derer_Romans,_als_ein_sehr_bedenkliches_Mittel_seine_Schreibart_zu_verbessern.djvu (While the word in all-caps is ?Herr?, it is here used in the meaning of ?mister? and not ?the Lord?.) Most often this happens to place names. From unicode at unicode.org Wed Jul 5 12:01:25 2017 From: unicode at unicode.org (=?utf-8?Q?Daniel_B=C3=BCnzli?= via Unicode) Date: Wed, 5 Jul 2017 19:01:25 +0200 Subject: emoji props in the ucdxml ? Message-ID: Hello,? I know the emoji properties [1] are no formally part of the UCD (not sure exactly why though), but are there any plans to integrate the data in the ucdxml [2] (possibly as separate files) ?? Thanks,? Daniel [1] http://www.unicode.org/reports/tr51/#Emoji_Properties_and_Data_Files [2] http://www.unicode.org/reports/tr42/ From unicode at unicode.org Wed Jul 5 13:59:31 2017 From: unicode at unicode.org (Ken Whistler via Unicode) Date: Wed, 5 Jul 2017 11:59:31 -0700 Subject: emoji props in the ucdxml ? In-Reply-To: References: Message-ID: <3bd60000-4e1f-5afa-adf7-dd5a8537521b@att.net> On 7/5/2017 10:01 AM, Daniel B?nzli via Unicode wrote: > I know the emoji properties [1] are no formally part of the UCD (not sure exactly why though), Because they are maintained as part of an independent standard now (UTS #51), which is still on track to have a faster turnaround -- and hence faster data updates -- not synched with the annual versions of the Unicode Standard. Hence they cannot be formally a part of the UCD -- unless the entire Unicode Standard were going to be churned on a faster cycle as well. > but are there any plans to integrate the data in the ucdxml [2] (possibly as separate files) ? No. Not unless and until they become formally part of the UCD. --Ken From unicode at unicode.org Wed Jul 5 14:37:16 2017 From: unicode at unicode.org (Manuel Strehl via Unicode) Date: Wed, 5 Jul 2017 21:37:16 +0200 Subject: emoji props in the ucdxml ? In-Reply-To: <3bd60000-4e1f-5afa-adf7-dd5a8537521b@att.net> References: <3bd60000-4e1f-5afa-adf7-dd5a8537521b@att.net> Message-ID: >> but are there any plans to integrate the data in the ucdxml [2] >> (possibly as separate files) ? > > No. Not unless and until they become formally part of the UCD. In this context: Would it be possible for the maintainers of the TR #51 data files to add a symlink "latest" under unicode.org/Public/emoji/latest like there is for the UCD? That would be a tremendous time saver, at least for me, having a constant URL to fetch the latest Emoji data from. Who should I ask for such a link? Cheers, Manuel From unicode at unicode.org Wed Jul 5 18:43:29 2017 From: unicode at unicode.org (Simon Cozens via Unicode) Date: Thu, 6 Jul 2017 09:43:29 +1000 Subject: Algorithms for Unicode script detection Message-ID: <6bd1ce8d-e299-6983-93a6-62ddc19f20ee@simon-cozens.org> I want to segment a Unicode text into runs according to their script. I've had a look through UAX#24 in the hope of finding a standard algorithm for doing this, but there isn't one specified. The implementation section gives some good pointers for what to be careful with (paired punctuation, etc.) but I can't find a step-by-step algorithm similar to the bidi algorithm or collation algorithm. Equally, I don't see anything in ICU that segments into script-based runs. You can get script properties, but that doesn't help you resolve common characters in the context of a run. Does anyone know of an open-source algorithm for doing this? From unicode at unicode.org Wed Jul 5 18:59:26 2017 From: unicode at unicode.org (Khaled Hosny via Unicode) Date: Thu, 6 Jul 2017 01:59:26 +0200 Subject: Algorithms for Unicode script detection In-Reply-To: <6bd1ce8d-e299-6983-93a6-62ddc19f20ee@simon-cozens.org> References: <6bd1ce8d-e299-6983-93a6-62ddc19f20ee@simon-cozens.org> Message-ID: <20170705235926.GB1637@macbook.localdomain> On Thu, Jul 06, 2017 at 09:43:29AM +1000, Simon Cozens via Unicode wrote: > I want to segment a Unicode text into runs according to their script. > I've had a look through UAX#24 in the hope of finding a standard > algorithm for doing this, but there isn't one specified. The > implementation section gives some good pointers for what to be careful > with (paired punctuation, etc.) but I can't find a step-by-step > algorithm similar to the bidi algorithm or collation algorithm. > > Equally, I don't see anything in ICU that segments into script-based > runs. You can get script properties, but that doesn't help you resolve > common characters in the context of a run. > > Does anyone know of an open-source algorithm for doing this? There is source/extra/scrptrun/ in ICU source tree (but not part of the API), apparently it is used by its ParagraphLayout library. (A copy if this code is used by Pango, and another copy is used by LibreOffice). Regards, Khaled From unicode at unicode.org Wed Jul 5 20:04:33 2017 From: unicode at unicode.org (Ken Whistler via Unicode) Date: Wed, 5 Jul 2017 18:04:33 -0700 Subject: emoji props in the ucdxml ? In-Reply-To: References: <3bd60000-4e1f-5afa-adf7-dd5a8537521b@att.net> Message-ID: <8c3bc827-19ee-7a9b-5085-4157528622d2@att.net> Manuel, I suspect that such a link may already be in the works for the /Public/emoji/ data directory. But if you want to make sure your suggestion is reviewed by the UTC, you should submit it via the contact form: http://www.unicode.org/reporting.html --Ken On 7/5/2017 12:37 PM, Manuel Strehl via Unicode wrote: >>> but are there any plans to integrate the data in the ucdxml [2] >>> (possibly as separate files) ? >> No. Not unless and until they become formally part of the UCD. > In this context: Would it be possible for the maintainers of the TR #51 > data files to add a symlink "latest" under > unicode.org/Public/emoji/latest like there is for the UCD? That would be > a tremendous time saver, at least for me, having a constant URL to fetch > the latest Emoji data from. > > Who should I ask for such a link? > > Cheers, > Manuel > From unicode at unicode.org Thu Jul 6 13:05:03 2017 From: unicode at unicode.org (=?utf-8?Q?Daniel_B=C3=BCnzli?= via Unicode) Date: Thu, 6 Jul 2017 20:05:03 +0200 Subject: emoji props in the ucdxml ? In-Reply-To: <3bd60000-4e1f-5afa-adf7-dd5a8537521b@att.net> References: <3bd60000-4e1f-5afa-adf7-dd5a8537521b@att.net> Message-ID: Ken,? Thanks for your explanations.? I would just like to note that UAX42 expresses a general xml data format to associate properties to code points. So it would be possible for the standard maintainers to publish, independently from the UCD, alongside the ad-hoc text files, xml files that have the properties. Best,? Daniel P.S. I don't have a particular obsession or love with XML but when I started to implement bits of the standard a few years ago I apparently made the mistake to think that the UTC would eventually move away from creating ad-hoc text files and favour the structured data format of the ucdxml. So most of my implementation pipeline is geared at consuming character properties from these files. From unicode at unicode.org Fri Jul 7 04:02:26 2017 From: unicode at unicode.org (Andre Schappo via Unicode) Date: Fri, 7 Jul 2017 09:02:26 +0000 Subject: Unicode education in UK Schools Message-ID: <916BBF31-95CA-43EA-9875-1809CEF99E59@lboro.ac.uk> There is some evidence that Unicode is now being introduced to Computer Science pupils in UK Schools. Hove Park School give a summary of their Computer Science curriculum for Years 8 and 9 http://www.hovepark.brighton-hove.sch.uk/department/computer-science From Year 9 curriculum summary: "? Students code text into binary using ASCII and understand the limitations of this and the need for Unicode" I think it unlikely they give much coverage of Unicode at Hove Park School but it is a promising start. Personally I am much encouraged, as Computer Science education in the UK, at all levels, continues to be dominated by ASCII. ?and? as part of my continuing endeavours to get Computer Science/IT/ICT Internationalization on the School/College/University curricula I recently setup a google discussion forum https://groups.google.com/forum/#!forum/computer-science-curriculum-internationalization If you know of any academics who might be interested please do let them know of this new forum. Unicode is, of course, a fundamental building block for internationalization and so should feature prominently in Computer Science teaching, at all levels. Andr? Schappo -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Jul 7 10:14:04 2017 From: unicode at unicode.org (William_J_G Overington via Unicode) Date: Fri, 7 Jul 2017 16:14:04 +0100 (BST) Subject: Unicode education in UK Schools In-Reply-To: <916BBF31-95CA-43EA-9875-1809CEF99E59@lboro.ac.uk> References: <916BBF31-95CA-43EA-9875-1809CEF99E59@lboro.ac.uk> Message-ID: <23206499.39377.1499440444300.JavaMail.defaultUser@defaultHost> Around 1991 I was shopping in a supermarket and I noticed some product that I was buying had its ingredients list in a lot of languages. I have been interested in typography and languages since the 1960s. During the 1960s I was given a copy of the Riscatype Accents Catalogue. A page of particular interest had a list of the accented characters needed to typeset various languages of Europe. This was only of languages that used Latin script. Esperanto was in the list. This list fascinated me. For example, it mentioned the u diaeresis used in French, though I learned later that words that have a u diaeresis in French are rather rare. There were the accents used for various Scandinavian languages. The various languages, if I remember correctly, each having a different selection of accented characters than the other Scandinavian languages. I found that the character a tilde as I now know it to be called is only used in Portuguese. Some years later, in the early 1970s, two researchers were trying to translate a research paper using a Spanish dictionary and having great problems. I glanced at the text and said that it was not Spanish, it was Portuguese. I was asked if I spoke Portuguese and I replied that I did not and mentioned my interest in typography. As I was saying, around 1991 I was shopping in a supermarket and I noticed some product that I was buying had its ingredients list in a lot of languages. Thinking about this, I devised a scenario that I called The Caf? ?pfel. https://forum.high-logic.com/viewtopic.php?p=5311#p5311 Around the same time I set up a roomful of PCs so that the start up page of each had text at the lower edge showing the sentence Good Day. in about six or seven languages. There was Good Day, Bonjour, and German and Italian versions, Bonan Tagon which is Esperanto and one or two others. I sought advice from linguists for some of them. Fortunately the Esperanto version did not need any accented characters otherwise it would not have been possible to include it at that time. Here is what I wrote about The Caf? ?pfel in 2006 in the above-linked High-Logic Forum post. quote Many years ago I devised a scenario to encourage people to learn how to enter words with accented characters in them even if they did not know the language. I called it The Caf? ?pfel and the idea was that text from ingredients lists from multilingual food packaging could be keyed. The Caf? ?pfel would have menus in English, French, German and the language of the musicians and singers who were performing in the caf? that evening. I had this idea of a television show series with each episode combining cookery, computing and music with actors playing the continuing characters and guest musicians and singers arriving as guest stars. Well, a Portuguese band and singer would be fairly straightforward. Once the musicians come from further afield the computing gets rather more complicated! :-) end quote So can the idea of The Caf? ?pfel be updated, extended so as to promote the use of Unicode, and applied to help with education? For example, the original idea included a television series. Now there is widespread production of videos. Previously I wrote: > Once the musicians come from further afield the computing gets rather more complicated! :-) What if the musicians are from Latvia? What if the musicians are from Bulgaria? What if the musicians are from Japan? What if the musicians are from .... well, how about dividing the class into small groups and giving each group a language to investigate. They could all use emoji as well if you like! The whole exercise could take them beyond 7-bit to 8-bit, beyond 8-bit to 16-bit, beyond 16-bit to 21-bit. Grocery packaging, yes, but today there is the PanLex database too. https://www.panlex.org/ So how about as an exercise for the students to typeset the list of ingredients of a gluten-free vegetable stew. There could be a list of several vegetables and the students could use the PanLex database and Google translate to look them up and then typeset the menu, making use of Unicode code charts to find the code point of each accented character and finding out about that character. For example, the reason why a number of Central European languages each have a c caron in them. Some interesting history there. The first exercises could use languages that only use 8-bit characters, so as to get started and some print outs produced. Maybe French, German, Portuguese and Swedish. I have tried looking for carrot in the PanLex website. https://apps.panlex.org/panlinx/ https://apps.panlex.org/panlinx/gp/29 https://apps.panlex.org/panlinx/gp/29/sub/8581 https://apps.panlex.org/panlinx/ex/368537 That was fortunate, the Latvian word for carrot has an a macron in it. So if The Caf? ?pfel is having musicians and singers from Latvia to perform, and the vegetable stew has carrots in it, the students need to get an a macron into the computer so as to produce the menu for the event. The menu exercise could also be useful so that the students find that fonts get harder to find for some languages and that fancy fonts for some languages are harder still to find, if they even exist! William Overington Friday 7 July 2017 From unicode at unicode.org Fri Jul 7 10:33:32 2017 From: unicode at unicode.org (William_J_G Overington via Unicode) Date: Fri, 7 Jul 2017 16:33:32 +0100 (BST) Subject: The management of the encoding process of emoji In-Reply-To: <11080423.4556.1497686381079.JavaMail.defaultUser@defaultHost> References: <1036413.42552.1497632385613.JavaMail.root@webmail39.bt.ext.cpcloud.co.uk> <11080423.4556.1497686381079.JavaMail.defaultUser@defaultHost> Message-ID: <25058408.40732.1499441612488.JavaMail.defaultUser@defaultHost> An issue that seems to be coming into prominence is that as a result of the requirement that emoji proposals should not be overly specific, some recent proposals seem to be trying to emphasise that they are not overly specific by suggesting that the particular emoji proposed could mean various things. This seems to present increasing ambiguity of meaning. http://unicode.org/emoji/selection.html#Specific Now, the overly in overly specific is rather subjective in its interpretation. Yet is the pendulum swinging too far the other way perhaps? Some readers may already know of the following video from the Unicode 39 Conference in 2015. https://www.youtube.com/watch?v=9ldSVbXbjl4 William Overington Friday 7 July 2017 From unicode at unicode.org Fri Jul 7 12:02:35 2017 From: unicode at unicode.org (Doug Ewell via Unicode) Date: Fri, 07 Jul 2017 10:02:35 -0700 Subject: Unicode education in the professional world Message-ID: <20170707100235.665a7a7059d7ee80bb4d670165c8327d.5bb4e6c72b.wbe@email03.godaddy.com> Sort of along the lines of "education"... I've been helping a colleague who is using the Oracle database and trying to work through a customer's character conversion and mojibake issues. I started suspecting the NLS_LANG variable and looked up some references, and found the following alternative facts on the Oracle FAQ and community pages: > SQL> SELECT DUMP(col,1016)FROM table; > > Typ=1 Len=39 CharacterSet=UTF8: 227,131,143,227,131,170 > > returns the value of a column consisting of 3 Japanese characters in > UTF8 encoding . For example the 1st char is 227(*255)+131. and: > While UTF8 uses only 2 bytes to store data AL32UTF8 uses 2 or 4 bytes. Unicode and UTF-8 have been around a long time by now. The fact that there is still fake news like this out there, steering our less Unicode-aware colleagues waaay down the wrong path, is disconcerting. -- Doug Ewell | Thornton, CO, US | ewellic.org From unicode at unicode.org Fri Jul 7 13:45:36 2017 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Fri, 7 Jul 2017 11:45:36 -0700 Subject: Unicode education in UK Schools In-Reply-To: <916BBF31-95CA-43EA-9875-1809CEF99E59@lboro.ac.uk> References: <916BBF31-95CA-43EA-9875-1809CEF99E59@lboro.ac.uk> Message-ID: <557735b0-7617-dee8-fa2c-48e685b672c2@ix.netcom.com> I performed a quick search "Informatik und Unicode" to see whether I could find documents from German academic institutions discussing Unicode in the context of computer science (Informatik). Among the first page of search results I found a number of summaries and presentations that may have been (or possibly are) usable as introductory lectures. One item looked like it could have been intended as source material for secondary schools rather than for use in the University. I also checked whether there are accessible homework assignments that mention Unicode ("Hausaufgabe Unicode"). I didn't go very deep, but it seems that it's not untypical to relegate Unicode to a sidebar, explaining the "\u" notation and mentioning that you get ASCII if you set the upper byte to 0 (in a UTF-16 string, as supported by Java etc.). I've not (yet) located any assignments that try to address any of the "tricky" issues in the use of Unicode. A./ On 7/7/2017 2:02 AM, Andre Schappo via Unicode wrote: > > There is some evidence that Unicode is now being introduced to > Computer Science pupils in UK Schools. Hove Park School give a summary > of their Computer Science curriculum for Years 8 and 9 > http://www.hovepark.brighton-hove.sch.uk/department/computer-science > > From Year 9 curriculum summary: "? Students code text into binary > using ASCII and understand the limitations of this and the need for > Unicode" > > I think it unlikely they give much coverage of Unicode at Hove Park > School but it is a promising start. Personally I am much encouraged, > as Computer Science education in the UK, at all levels, continues to > be dominated by ASCII. > > ?and? > > as part of my continuing endeavours to get Computer Science/IT/ICT > Internationalization on the School/College/University curricula I > recently setup a google discussion forum > https://groups.google.com/forum/#!forum/computer-science-curriculum-internationalization > If > you know of any academics who might be interested please do let them > know of this new forum. Unicode is, of course, a fundamental building > block for internationalization and so should feature prominently in > Computer Science teaching, at all levels. > > Andr? Schappo > From unicode at unicode.org Fri Jul 7 13:49:08 2017 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Fri, 7 Jul 2017 20:49:08 +0200 Subject: Unicode education in the professional world In-Reply-To: <20170707100235.665a7a7059d7ee80bb4d670165c8327d.5bb4e6c72b.wbe@email03.godaddy.com> References: <20170707100235.665a7a7059d7ee80bb4d670165c8327d.5bb4e6c72b.wbe@email03.godaddy.com> Message-ID: 2017-07-07 19:02 GMT+02:00 Doug Ewell via Unicode : > Oracle FAQ: > While UTF8 uses only 2 bytes to store data AL32UTF8 uses 2 or 4 bytes. > > Unicode and UTF-8 have been around a long time by now. The fact that > there is still fake news like this out there, steering our less > Unicode-aware colleagues waaay down the wrong path, is disconcerting. > Well, these are old archived docs that have not been corrected since long. FAQ's are rarely reviewed once published and frequently become obsolete when they suggest old solutions for problems that no longer exist, or old bad workarounds with their known caveats. They were designed only for specific software versions and kept as is because newer versions are documented elsewhere (but older versions may still be in use). The situation is even worse in "community pages": their interest move over time to something else and noone in these communities have a dedicated mandatory task to review old documents made by others, no one leads them or can order them what to to in a scheduled time. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Jul 7 14:55:01 2017 From: unicode at unicode.org (Doug Ewell via Unicode) Date: Fri, 07 Jul 2017 12:55:01 -0700 Subject: Unicode education in UK Schools Message-ID: <20170707125501.665a7a7059d7ee80bb4d670165c8327d.757e01c75e.wbe@email03.godaddy.com> Asmus Freytag wrote: > I've not (yet) located any assignments that try to address any of the > "tricky" issues in the use of Unicode. That might be a good thing. Many introductory lessons or chapters or talks about Unicode dive almost immediately into the complexities and weirdnesses, much more so than with other technical topics. This scares newbies and they walk away thinking every aspect of Unicode is complex and weird. -- Doug Ewell | Thornton, CO, US | ewellic.org From unicode at unicode.org Fri Jul 7 15:53:16 2017 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Fri, 7 Jul 2017 13:53:16 -0700 Subject: Unicode education in UK Schools In-Reply-To: <20170707125501.665a7a7059d7ee80bb4d670165c8327d.757e01c75e.wbe@email03.godaddy.com> References: <20170707125501.665a7a7059d7ee80bb4d670165c8327d.757e01c75e.wbe@email03.godaddy.com> Message-ID: On 7/7/2017 12:55 PM, Doug Ewell via Unicode wrote: > Asmus Freytag wrote: > >> I've not (yet) located any assignments that try to address any of the >> "tricky" issues in the use of Unicode. > That might be a good thing. Many introductory lessons or chapters or > talks about Unicode dive almost immediately into the complexities and > weirdnesses, much more so than with other technical topics. This scares > newbies and they walk away thinking every aspect of Unicode is complex > and weird. For a CS curriculum you really want more than asking students to use Unicode to spell their name for a modified "Hello World!" program. (For a German university, this is an interesting assignment as at least half if not more of the students would be able to complete this assignment using the ASCII subset.... except for a small minority, the others would not actually need to use something like the \u syntax, as the local keyboard would work for their names). Some of the presentations I found did mention collation and similar issues (and gave non-Latin examples) but I have not located any homework assignments that cover any of these issues (and they are not corner cases, but the ordinary complexity of text data). A./ From unicode at unicode.org Sat Jul 8 00:57:22 2017 From: unicode at unicode.org (Andre Schappo via Unicode) Date: Sat, 8 Jul 2017 05:57:22 +0000 Subject: Unicode education in UK Schools In-Reply-To: <557735b0-7617-dee8-fa2c-48e685b672c2@ix.netcom.com> References: <916BBF31-95CA-43EA-9875-1809CEF99E59@lboro.ac.uk> <557735b0-7617-dee8-fa2c-48e685b672c2@ix.netcom.com> Message-ID: <61B06B9F-2F9E-45DC-8689-1C980BFA02FC@lboro.ac.uk> Interesting. Thanks Asmus. So what of other countries? Anyone on this list from China, Japan, Korea, Russia, Thailand ...etc... What is the situation in your countries with respect to Unicode education in your country's Schools, Colleges and Universities? TIA Andr? Schappo > On 7 Jul 2017, at 19:45, Asmus Freytag via Unicode wrote: > > I performed a quick search "Informatik und Unicode" to see whether I could find documents from German academic institutions discussing Unicode in the context of computer science (Informatik). > > Among the first page of search results I found a number of summaries and presentations that may have been (or possibly are) usable as introductory lectures. > > One item looked like it could have been intended as source material for secondary schools rather than for use in the University. > > I also checked whether there are accessible homework assignments that mention Unicode ("Hausaufgabe Unicode"). I didn't go very deep, but it seems that it's not untypical to relegate Unicode to a sidebar, explaining the "\u" notation and mentioning that you get ASCII if you set the upper byte to 0 (in a UTF-16 string, as supported by Java etc.). > > I've not (yet) located any assignments that try to address any of the "tricky" issues in the use of Unicode. > > A./ > > > On 7/7/2017 2:02 AM, Andre Schappo via Unicode wrote: >> >> There is some evidence that Unicode is now being introduced to Computer Science pupils in UK Schools. Hove Park School give a summary of their Computer Science curriculum for Years 8 and 9 http://www.hovepark.brighton-hove.sch.uk/department/computer-science >> >> From Year 9 curriculum summary: "? Students code text into binary using ASCII and understand the limitations of this and the need for Unicode" >> >> I think it unlikely they give much coverage of Unicode at Hove Park School but it is a promising start. Personally I am much encouraged, as Computer Science education in the UK, at all levels, continues to be dominated by ASCII. >> >> ?and? >> >> as part of my continuing endeavours to get Computer Science/IT/ICT Internationalization on the School/College/University curricula I recently setup a google discussion forum https://groups.google.com/forum/#!forum/computer-science-curriculum-internationalization If you know of any academics who might be interested please do let them know of this new forum. Unicode is, of course, a fundamental building block for internationalization and so should feature prominently in Computer Science teaching, at all levels. >> >> Andr? Schappo >> > From unicode at unicode.org Sat Jul 8 01:16:02 2017 From: unicode at unicode.org (Rebecca T via Unicode) Date: Sat, 8 Jul 2017 02:16:02 -0400 Subject: Unicode education in UK Schools In-Reply-To: <20170707125501.665a7a7059d7ee80bb4d670165c8327d.757e01c75e.wbe@email03.godaddy.com> References: <20170707125501.665a7a7059d7ee80bb4d670165c8327d.757e01c75e.wbe@email03.godaddy.com> Message-ID: > That might be a good thing. Yeah. Very seriously, it?s very important to introduce Unicode early on in CS education, even in a ?hey, it?s not OK to exclude people who don?t speak English or people whose names have diacritics from using the programs you create? sort of way. Ignorance and apathy for the world?s citizens is a terrible thing and I hope that every year brings more access to tech, Unicode-enabled and ready, to more of humanity. On Fri, Jul 7, 2017 at 3:55 PM, Doug Ewell via Unicode wrote: > Asmus Freytag wrote: > > > I've not (yet) located any assignments that try to address any of the > > "tricky" issues in the use of Unicode. > > That might be a good thing. Many introductory lessons or chapters or > talks about Unicode dive almost immediately into the complexities and > weirdnesses, much more so than with other technical topics. This scares > newbies and they walk away thinking every aspect of Unicode is complex > and weird. > > -- > Doug Ewell | Thornton, CO, US | ewellic.org > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sat Jul 8 05:33:14 2017 From: unicode at unicode.org (Otto Stolz via Unicode) Date: Sat, 8 Jul 2017 12:33:14 +0200 Subject: Tilde (was: Unicode education in UK Schools) In-Reply-To: <23206499.39377.1499440444300.JavaMail.defaultUser@defaultHost> References: <916BBF31-95CA-43EA-9875-1809CEF99E59@lboro.ac.uk> <23206499.39377.1499440444300.JavaMail.defaultUser@defaultHost> Message-ID: <6ad0d6f5-e495-e841-7f05-1728c2961a77@uni-konstanz.de> Hello, am 2017-07-07 um 17:14 Uhr hat William_J_G Overington geschrieben: > I found that the character a tilde as I now know it to be called is only used in Portuguese. Just for the record: ??? is used in Portuguese, Kashubian; ??? is used in Galician, Spanish, Mirandese, Catalan (only for Spanish loan words), even English (for Spanish loan words), Breton (in Peurunvan spelling), Basque; ??? is used in Estonian, Livonian (extinct since 2013); ??? is used in Livonian; ??? is used in Mirandese. I have only considered European official, and regional, languages. Cheers, Otto From unicode at unicode.org Sat Jul 8 05:36:53 2017 From: unicode at unicode.org (Otto Stolz via Unicode) Date: Sat, 8 Jul 2017 12:36:53 +0200 Subject: Unicode education in UK Schools In-Reply-To: <557735b0-7617-dee8-fa2c-48e685b672c2@ix.netcom.com> References: <916BBF31-95CA-43EA-9875-1809CEF99E59@lboro.ac.uk> <557735b0-7617-dee8-fa2c-48e685b672c2@ix.netcom.com> Message-ID: <8911b9d4-e454-dc39-3065-c009781b4df2@uni-konstanz.de> Hello, am 2017-07-07 um 20:45 Uhr hat Asmus Freytag geschrieben: > I also checked whether there are accessible homework assignments that > mention Unicode ("Hausaufgabe Unicode"). I didn't go very deep, but it > seems that it's not untypical to relegate Unicode to a sidebar, > explaining the "\u" notation and mentioning that you get ASCII if you > set the upper byte to 0 (in a UTF-16 string, as supported by Java etc.). Try also ??bung Unicode?. Best wishes, Otto From unicode at unicode.org Sat Jul 8 07:50:24 2017 From: unicode at unicode.org (Jonathan Rosenne via Unicode) Date: Sat, 8 Jul 2017 12:50:24 +0000 Subject: Tilde (was: Unicode education in UK Schools) In-Reply-To: <6ad0d6f5-e495-e841-7f05-1728c2961a77@uni-konstanz.de> References: <916BBF31-95CA-43EA-9875-1809CEF99E59@lboro.ac.uk> <23206499.39377.1499440444300.JavaMail.defaultUser@defaultHost> <6ad0d6f5-e495-e841-7f05-1728c2961a77@uni-konstanz.de> Message-ID: Hello, To be precise, this is the COMBINING TILDE not the character TILDE (U+007E). Best Regards, Jonathan Rosenne -----Original Message----- From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Otto Stolz via Unicode Sent: Saturday, July 08, 2017 1:33 PM To: unicode at unicode.org Subject: Tilde (was: Unicode education in UK Schools) Hello, am 2017-07-07 um 17:14 Uhr hat William_J_G Overington geschrieben: > I found that the character a tilde as I now know it to be called is only used in Portuguese. Just for the record: ??? is used in Portuguese, Kashubian; ??? is used in Galician, Spanish, Mirandese, Catalan (only for Spanish loan words), even English (for Spanish loan words), Breton (in Peurunvan spelling), Basque; ??? is used in Estonian, Livonian (extinct since 2013); ??? is used in Livonian; ??? is used in Mirandese. I have only considered European official, and regional, languages. Cheers, Otto -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sat Jul 8 11:04:39 2017 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Sat, 8 Jul 2017 09:04:39 -0700 Subject: Unicode education in UK Schools In-Reply-To: <8911b9d4-e454-dc39-3065-c009781b4df2@uni-konstanz.de> References: <916BBF31-95CA-43EA-9875-1809CEF99E59@lboro.ac.uk> <557735b0-7617-dee8-fa2c-48e685b672c2@ix.netcom.com> <8911b9d4-e454-dc39-3065-c009781b4df2@uni-konstanz.de> Message-ID: <43b7f530-8fde-c9ae-8c33-5fb78b4c053b@ix.netcom.com> An HTML attachment was scrubbed... URL: From unicode at unicode.org Sat Jul 8 14:28:24 2017 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Sat, 8 Jul 2017 20:28:24 +0100 Subject: Unicode education in UK Schools In-Reply-To: <43b7f530-8fde-c9ae-8c33-5fb78b4c053b@ix.netcom.com> References: <916BBF31-95CA-43EA-9875-1809CEF99E59@lboro.ac.uk> <557735b0-7617-dee8-fa2c-48e685b672c2@ix.netcom.com> <8911b9d4-e454-dc39-3065-c009781b4df2@uni-konstanz.de> <43b7f530-8fde-c9ae-8c33-5fb78b4c053b@ix.netcom.com> Message-ID: <20170708202824.1f017179@JRWUBU2> On Sat, 8 Jul 2017 09:04:39 -0700 Asmus Freytag via Unicode wrote: > But some handling > of combining mark (and also the new emoji sequences) would equally > constitute "basic" knowledge, with the Unicode algorithms like > sorting, Which major applications actually use the Unicode Collation Algorithm for sorting, that is for the key comparison part? ICU doesn't. Richard. From unicode at unicode.org Sat Jul 8 15:41:03 2017 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Sat, 8 Jul 2017 13:41:03 -0700 Subject: Unicode education in UK Schools In-Reply-To: <20170708202824.1f017179@JRWUBU2> References: <916BBF31-95CA-43EA-9875-1809CEF99E59@lboro.ac.uk> <557735b0-7617-dee8-fa2c-48e685b672c2@ix.netcom.com> <8911b9d4-e454-dc39-3065-c009781b4df2@uni-konstanz.de> <43b7f530-8fde-c9ae-8c33-5fb78b4c053b@ix.netcom.com> <20170708202824.1f017179@JRWUBU2> Message-ID: <37e2a4be-4973-9cc8-e717-e71165b0cfbb@ix.netcom.com> An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon Jul 10 05:09:51 2017 From: unicode at unicode.org (Shriramana Sharma via Unicode) Date: Mon, 10 Jul 2017 15:39:51 +0530 Subject: Wagging finger emoji? Message-ID: Hello. Searching UnicodeData.txt for emoji-s with the word "finger" I am getting: 1F590;RAISED HAND WITH FINGERS SPLAYED;So;0;ON;;;;;N;;;;; 1F591;REVERSED RAISED HAND WITH FINGERS SPLAYED;So;0;ON;;;;;N;;;;; 1F595;REVERSED HAND WITH MIDDLE FINGER EXTENDED;So;0;ON;;;;;N;;;;; 1F596;RAISED HAND WITH PART BETWEEN MIDDLE AND RING FINGERS;So;0;ON;;;;;N;;;;; 1F834;LEFTWARDS FINGER-POST ARROW;So;0;ON;;;;;N;;;;; 1F835;UPWARDS FINGER-POST ARROW;So;0;ON;;;;;N;;;;; 1F836;RIGHTWARDS FINGER-POST ARROW;So;0;ON;;;;;N;;;;; 1F837;DOWNWARDS FINGER-POST ARROW;So;0;ON;;;;;N;;;;; 1F91E;HAND WITH INDEX AND MIDDLE FINGERS CROSSED;So;0;ON;;;;;N;;;;; 1F92B;FACE WITH FINGER COVERING CLOSED LIPS;So;0;ON;;;;;N;;;;; Doesn't seem to be something that is equivalent to https://goo.gl/images/dWMpQd. Is there a wagging finger emoji I missed or one in the pipeline? ? doesn't seem to cut it. It tells me "Look up!" And sure enough: 261D;WHITE UP POINTING INDEX;So;0;ON;;;;;N;;;;; -- Shriramana Sharma ???????????? ???????????? From unicode at unicode.org Wed Jul 12 08:35:02 2017 From: unicode at unicode.org (J Decker via Unicode) Date: Wed, 12 Jul 2017 06:35:02 -0700 Subject: Database missing/erroneous information Message-ID: I started looking more deeply at the javascript specification. Identifiers are defined as starting with characters with ID_Start and continued with ID_Continue attributes. I grabbed the xml database (ucd.all.grouped.xml ) in which I was able to find IDS, IDC flags ( also OIDS,OIDC, XIDS,XIDC of which meaning I'm not entirely sure of) but I started filtering out to find characters that are NOT IDS|IDC.... Something simple like numbers 0x30-0x39 are marked with IDS='N' but have no [ OX]IDC flags specified. Is a lack of flag assumed N or Y? www.unicode.org/reports/tr42/ documentation on the XML file format doesn't specify. http://www.unicode.org/reports/tr31/ I see 'ID_Continue characters include ID_Start characters, plus characters ' most languages do support identifiers like a1, a2, etc as valid identifiers, so certainly numbers should have IDC even though they're not IDS. Are there characters that are IDS without being IDC? There are certainly characters that are IDC without IDS. some examples..... found char { cp: '0034', na: 'DIGIT FOUR', gc: 'Nd', nt: 'De', nv: '4', bc: 'EN', lb: 'NU', sc: 'Zyyy', scx: 'Zyyy', Alpha: 'N', Hex: 'Y', AHex: 'Y', IDS: 'N', XIDS: 'N', WB: 'NU', SB: 'NU', Cased: 'N', CWCM: 'N', InSC: 'Number' } (this has IDC notation but not IDS; since it says 'digit' I assume this is a number type, and should not be IDS.) found char { cp: '0F32', na: 'TIBETAN DIGIT HALF NINE', gc: 'No', nt: 'Nu', nv: '17/2', Alpha: 'N', IDC: 'N', XIDC: 'N', SB: 'XX', InSC: 'Number' } This might be not IDS but is IDC? found char { cp: '203F', na: 'UNDERTIE', gc: 'Pc', IDC: 'Y', XIDC: 'Y', Pat_Syn: 'N', WB: 'EX' } this is sort of IDS but not IDC? found char { cp: '309B', na: 'KATAKANA-HIRAGANA VOICED SOUND MARK', gc: 'Sk', dt: 'com', dm: '0020 3099', bc: 'ON', lb: 'NS', sc: 'Zyyy', scx: 'Hira Kana', Alpha: 'N', Dia: 'Y', OIDS: 'Y', XIDS: 'N', XIDC: 'N', WB: 'KA', SB: 'XX', NFKC_QC: 'N', NFKD_QC: 'N', XO_NFKC: 'Y', XO_NFKD: 'Y', CI: 'Y', CWKCF: 'Y', NFKC_CF: '0020 3099', vo: 'Tu' } -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Jul 12 09:22:28 2017 From: unicode at unicode.org (Eric Muller via Unicode) Date: Wed, 12 Jul 2017 07:22:28 -0700 Subject: Database missing/erroneous information In-Reply-To: References: Message-ID: An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Jul 14 19:32:44 2017 From: unicode at unicode.org (Marcel Schneider via Unicode) Date: Sat, 15 Jul 2017 02:32:44 +0200 (CEST) Subject: Unicode education in UK Schools In-Reply-To: <23206499.39377.1499440444300.JavaMail.defaultUser@defaultHost> References: <916BBF31-95CA-43EA-9875-1809CEF99E59@lboro.ac.uk> <23206499.39377.1499440444300.JavaMail.defaultUser@defaultHost> Message-ID: <294329413.6711.1500078764872.JavaMail.www@wwinf1d23> On Fri, 7 Jul 2017 16:14:04 +0100 (BST), William_J_G Overington via Unicode wrote: > [?] > > For example, it mentioned the u diaeresis used in French, though I learned later that words that have a u diaeresis in French are rather rare. > Today, words containing 'u diaeresis' have become more frequent in French, since last fall (2016) a reformed orthography designed as soon as in 1990 [1] has become valid (though it is not mandatory [2]). Among the novelties, it specifies that words like "to disambiguate" have the diaeresis shifted from the last 'i' of ?d?sambigu?ser? to the 'u' of ?d?sambig?iser?. Kind regards, Marcel [1] https://en.wikipedia.org/wiki/Reforms_of_French_orthography#Tr.C3.A9ma [2] https://en.wikipedia.org/wiki/Reforms_of_French_orthography#cite_note-6 From unicode at unicode.org Sat Jul 15 15:30:03 2017 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Sat, 15 Jul 2017 22:30:03 +0200 Subject: Unicode education in UK Schools In-Reply-To: <294329413.6711.1500078764872.JavaMail.www@wwinf1d23> References: <916BBF31-95CA-43EA-9875-1809CEF99E59@lboro.ac.uk> <23206499.39377.1499440444300.JavaMail.defaultUser@defaultHost> <294329413.6711.1500078764872.JavaMail.www@wwinf1d23> Message-ID: As well the feminine form of the common adjective "ambigu" has been "regularized" to place the diaeresis ("tr?ma" in French) on the pronounced u rather than an on the mute e added for the regular feminine "ambig?e": it also correctly forces the pronunciation of this u, which would otherwise be mute too as an "u" after a "g" is often there only to avoid to read it as a "j" (like in "exergue", "digue" and many terms ending in "-gue(s)" where only the final /g/ is pronounced). Not writing this tr?ma anywhere would be false. The tradition placed the diaereis on the mute e but it was not clear that it meant pronoucing the "u" before as a vowel. For terms like "ambig?it?" it is also more natural to place it on the "u" (to break the normal "gu" digram which is consonnantal only and have some vocal rendering of the "u" vowel, even if here it would be pronounced more like a short but clearly spelled half-vowel sliding to the next "i", as in "huile" or "lui", but still not like a /w/ as in "oui" /wi/: normal French never pronounces an isolated "u" as /u/ like in English, except where it occurs in the French digram "ou" /u/ which is itself never like an English diphtong; the standard French "u" is pronounced like the German /y/ written as the digram "ue" or as "?" with its umlaut... which is not a diareasis phonetically; French transforms this "u" /y/ into a gliding semivowel where it immediately precedes another non-mute and non-nasal vowel; but French ortography has no specific letter for this semivowel which remains written "u", or "?" only where it has to be detached to avoid prononcing it as with normal digrams composed with it) a 2017-07-15 2:32 GMT+02:00 Marcel Schneider via Unicode : > On Fri, 7 Jul 2017 16:14:04 +0100 (BST), William_J_G Overington via > Unicode wrote: > > > [?] > > > > For example, it mentioned the u diaeresis used in French, though I > learned later that words that have a u diaeresis in French are rather rare. > > > Today, words containing 'u diaeresis' have become more frequent in French, > since last fall (2016) a reformed orthography designed as soon > as in 1990 [1] has become valid (though it is not mandatory [2]). Among > the novelties, it specifies that words like "to disambiguate" have > the diaeresis shifted from the last 'i' of ?d?sambigu?ser? to the 'u' of > ?d?sambig?iser?. > > Kind regards, > > Marcel > > [1] https://en.wikipedia.org/wiki/Reforms_of_French_orthography#Tr.C3.A9ma > [2] https://en.wikipedia.org/wiki/Reforms_of_French_orthography# > cite_note-6 > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sat Jul 15 22:12:37 2017 From: unicode at unicode.org (Marcel Schneider via Unicode) Date: Sun, 16 Jul 2017 05:12:37 +0200 (CEST) Subject: Unicode education in UK Schools In-Reply-To: References: <916BBF31-95CA-43EA-9875-1809CEF99E59@lboro.ac.uk> <23206499.39377.1499440444300.JavaMail.defaultUser@defaultHost> <294329413.6711.1500078764872.JavaMail.www@wwinf1d23> Message-ID: <539921443.50.1500174757540.JavaMail.www@wwinf1p07> On Sat, 15 Jul 2017 22:30:03 +0200, Philippe Verdy wrote: > > As well the feminine form of the common adjective "ambigu" has been "regularized" to place the diaeresis ("tr?ma" in French) on the pronounced u > rather than an on the mute e added for the regular feminine "ambig?e": it also correctly forces the pronunciation of this u, which would otherwise be > mute too as an "u" after a "g" is often there only to avoid to read it as a "j" (like in "exergue", "digue" and many terms ending in "-gue(s)" where only > the final /g/ is pronounced). Not writing this tr?ma anywhere would be false. The tradition placed the diaereis on the mute e but it was not clear that it > meant pronoucing the "u" before as a vowel. > > For terms like "ambig?it?" it is also more natural to place it on the "u" (to break the normal "gu" digram which is consonnantal only and have some > vocal rendering of the "u" vowel, even if here it would be pronounced more like a short but clearly spelled half-vowel sliding to the next "i", as in > "huile" or "lui", but still not like a /w/ as in "oui" /wi/: normal French never pronounces an isolated "u" as /u/ like in English, except where it occurs in > the French digram "ou" /u/ which is itself never like an English diphtong; the standard French "u" is pronounced like the German /y/ written as the > digram "ue" or as "?" with its umlaut... which is not a diareasis phonetically; French transforms this "u" /y/ into a gliding semivowel where it > immediately precedes another non-mute and non-nasal vowel; but French ortography has no specific letter for this semivowel which remains written > "u", or "?" only where it has to be detached to avoid prononcing it as with normal digrams composed with it) Indeed, following the basic grammatical meaning of the diaeresis as the ?resolution of a diphthong into two syllables? (Liddell&Scott), one might wonder whether the tr?ma should be placed on the first vowel or on the second vowel. On 'oe' it stays the old way: "Trono?n", "Citro?n". Since I?ve been kindly informed off-list that this point of the reform actually ?regularizes? (as you put it) a mistake, I?ll have to make use of the optionality of applying the new rules, and reset the words in my files to the old spelling. As you know, I disagree with that way of designing standards. > 2017-07-15 2:32 GMT+02:00 Marcel Schneider via Unicode : > > > On Fri, 7 Jul 2017 16:14:04 +0100 (BST), William_J_G Overington via Unicode wrote: > > > > > [?] > > > > > > For example, it mentioned the u diaeresis used in French, though I learned later that words that have a u diaeresis in French are rather rare. > > > > > Today, words containing 'u diaeresis' have become more frequent in French, since last fall (2016) a reformed orthography designed as soon > > as in 1990 [1] has become valid (though it is not mandatory [2]). Among the novelties, it specifies that words like "to disambiguate" have > > the diaeresis shifted from the last 'i' of ?d?sambigu?ser? to the 'u' of ?d?sambig?iser?. > > > > Kind regards, > > > > Marcel > > > > [1] https://en.wikipedia.org/wiki/Reforms_of_French_orthography#Tr.C3.A9ma > > [2] https://en.wikipedia.org/wiki/Reforms_of_French_orthography#cite_note-6 > > From unicode at unicode.org Sat Jul 15 23:13:02 2017 From: unicode at unicode.org (Dov Grobgeld via Unicode) Date: Sun, 16 Jul 2017 07:13:02 +0300 Subject: Problems with BidiCharTest.txt Message-ID: Hello, I sent the following message as a report to the unicode consortium, but I thought that perhaps someone here might give me some feedback as well. While implementing UAX#9 for Unicode 6.3 (and beyond) in FriBidi, I'm trying to pass all the tests of BidiCharacterTest.txt , and I'm having problem understanding a few of the tests that to me appear to contradict the spefication. The problematic lines in BidiCharacterTest-10.0.0.txt are the tests on lines 262, 263, and 264. Let's consider test from line 262: Dir: RTL Input: a ( b c ) _ 1 Level: 2 2 2 x 4 x 1 1 2 The problem I'm having is that the first opening bracket is assigned level 2 and the closing bracket level 1. This seems to contradict the three rules N0.b, N0.c.1, and N0.c.2 in the specification that all describe overriding the type of both brackets with either the embedding or the opposite direction. The only case we can possibly get different levels (correct me if I'm wrong!) is if rule N0.d is applied and the brackets retain their neutral status until they are resolved in subsequent rules. I would very much appreciate if you would either acknowledge a bug or correct a misunderstanding on my part. Thank you in advance! Dov -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sun Jul 16 12:09:19 2017 From: unicode at unicode.org (Eli Zaretskii via Unicode) Date: Sun, 16 Jul 2017 20:09:19 +0300 Subject: Problems with BidiCharTest.txt In-Reply-To: (message from Dov Grobgeld via Unicode on Sun, 16 Jul 2017 07:13:02 +0300) References: Message-ID: <8360ese2bk.fsf@gnu.org> > Date: Sun, 16 Jul 2017 07:13:02 +0300 > From: Dov Grobgeld via Unicode > > While implementing UAX#9 for Unicode 6.3 (and beyond) in FriBidi, I'm trying to pass all the tests of > BidiCharacterTest.txt , and I'm having problem understanding a few of the tests that to me appear to > contradict the spefication. The problematic lines in BidiCharacterTest-10.0.0.txt are the tests on lines 262, > 263, and 264. > > Let's consider test from line 262: (I believe you meant line 264.) > Dir: RTL > Input: a ( b c ) _ 1 > Level: 2 2 2 x 4 x 1 1 2 > > The problem I'm having is that the first opening bracket is assigned level 2 and the closing bracket level 1. > > This seems to contradict the three rules N0.b, N0.c.1, and N0.c.2 in the specification that all describe > overriding the type of both brackets with either the embedding or the opposite direction. The only case we can > possibly get different levels (correct me if I'm wrong!) is if rule N0.d is applied and the brackets retain their > neutral status until they are resolved in subsequent rules. The example is correct, IMO. (FWIW, Emacs produces the same reordered display as expected by the test.) I think the effect you mention is produced by the RLE..PDF embedding: it causes the opening and the closing parentheses to be in 2 different isolating run sequences, see examples in BD13. Bracket pairs are processed as such only if they are in the same isolating run sequence. Try the same test without the RLE..PDF part, and you will see the result you expect. From unicode at unicode.org Sun Jul 16 14:28:15 2017 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Sun, 16 Jul 2017 21:28:15 +0200 Subject: Problems with BidiCharTest.txt In-Reply-To: <8360ese2bk.fsf@gnu.org> References: <8360ese2bk.fsf@gnu.org> Message-ID: That's another argument to deprecate the use of RLE/PDF (or embedding mode) in favor of the more recent isolating mode (which causes the text just after the isolated text to not inherit the direction context of the last inner content, as it occurs here with parentheses that cannot match the same context before a RLE, but would match within the same context with the isolation mode). The isolation mode is also the one strongly recommended by default for elements in HTML, but legacy browsers still don't have support for isolation mode and were mapping element by default using the legacy embedding mode which has such caveats. So the specification is correct, it reproduces the legacy behavior as it was initially defined (and did not change) for RLE/PDF. 2017-07-16 19:09 GMT+02:00 Eli Zaretskii via Unicode : > > Date: Sun, 16 Jul 2017 07:13:02 +0300 > > From: Dov Grobgeld via Unicode > > > > While implementing UAX#9 for Unicode 6.3 (and beyond) in FriBidi, I'm > trying to pass all the tests of > > BidiCharacterTest.txt , and I'm having problem understanding a few of > the tests that to me appear to > > contradict the spefication. The problematic lines in > BidiCharacterTest-10.0.0.txt are the tests on lines 262, > > 263, and 264. > > > > Let's consider test from line 262: > > (I believe you meant line 264.) > > > Dir: RTL > > Input: a ( b c ) _ 1 > > Level: 2 2 2 x 4 x 1 1 2 > > > > The problem I'm having is that the first opening bracket is assigned > level 2 and the closing bracket level 1. > > > > This seems to contradict the three rules N0.b, N0.c.1, and N0.c.2 in the > specification that all describe > > overriding the type of both brackets with either the embedding or the > opposite direction. The only case we can > > possibly get different levels (correct me if I'm wrong!) is if rule N0.d > is applied and the brackets retain their > > neutral status until they are resolved in subsequent rules. > > The example is correct, IMO. (FWIW, Emacs produces the same reordered > display as expected by the test.) I think the effect you mention is > produced by the RLE..PDF embedding: it causes the opening and the > closing parentheses to be in 2 different isolating run sequences, see > examples in BD13. Bracket pairs are processed as such only if they > are in the same isolating run sequence. Try the same test without the > RLE..PDF part, and you will see the result you expect. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sun Jul 16 14:56:41 2017 From: unicode at unicode.org (Dov Grobgeld via Unicode) Date: Sun, 16 Jul 2017 22:56:41 +0300 Subject: Problems with BidiCharTest.txt In-Reply-To: <8360ese2bk.fsf@gnu.org> References: <8360ese2bk.fsf@gnu.org> Message-ID: Thanks Eli. That makes sense of the test. Now I just need to figure out how to implement it... Indeed, Philippe, the isolate semantics is much easier to wrap your head around. Regards, Dov On Sun, Jul 16, 2017 at 8:09 PM, Eli Zaretskii wrote: > > Date: Sun, 16 Jul 2017 07:13:02 +0300 > > From: Dov Grobgeld via Unicode > > > > While implementing UAX#9 for Unicode 6.3 (and beyond) in FriBidi, I'm > trying to pass all the tests of > > BidiCharacterTest.txt , and I'm having problem understanding a few of > the tests that to me appear to > > contradict the spefication. The problematic lines in > BidiCharacterTest-10.0.0.txt are the tests on lines 262, > > 263, and 264. > > > > Let's consider test from line 262: > > (I believe you meant line 264.) > > > Dir: RTL > > Input: a ( b c ) _ 1 > > Level: 2 2 2 x 4 x 1 1 2 > > > > The problem I'm having is that the first opening bracket is assigned > level 2 and the closing bracket level 1. > > > > This seems to contradict the three rules N0.b, N0.c.1, and N0.c.2 in the > specification that all describe > > overriding the type of both brackets with either the embedding or the > opposite direction. The only case we can > > possibly get different levels (correct me if I'm wrong!) is if rule N0.d > is applied and the brackets retain their > > neutral status until they are resolved in subsequent rules. > > The example is correct, IMO. (FWIW, Emacs produces the same reordered > display as expected by the test.) I think the effect you mention is > produced by the RLE..PDF embedding: it causes the opening and the > closing parentheses to be in 2 different isolating run sequences, see > examples in BD13. Bracket pairs are processed as such only if they > are in the same isolating run sequence. Try the same test without the > RLE..PDF part, and you will see the result you expect. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sun Jul 16 21:42:58 2017 From: unicode at unicode.org (=?UTF-8?Q?Martin_J._D=c3=bcrst?= via Unicode) Date: Mon, 17 Jul 2017 11:42:58 +0900 Subject: Problems with BidiCharTest.txt In-Reply-To: References: <8360ese2bk.fsf@gnu.org> Message-ID: <54e121f0-293a-d2a5-7e55-2c716d9f9e1b@it.aoyama.ac.jp> On 2017/07/17 04:28, Philippe Verdy via Unicode wrote: > The isolation mode is also the one strongly recommended by default for > elements in HTML, Well, that's for sure, because the "i" in "bdi" stands for "isolation", and the element was newly created for the isolation mode. Regards, Martin. > but legacy browsers still don't have support for > isolation mode and were mapping element by default using the legacy > embedding mode which has such caveats. So the specification is correct, it > reproduces the legacy behavior as it was initially defined (and did not > change) for RLE/PDF. From unicode at unicode.org Mon Jul 17 04:43:23 2017 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Mon, 17 Jul 2017 11:43:23 +0200 Subject: Problems with BidiCharTest.txt In-Reply-To: <54e121f0-293a-d2a5-7e55-2c716d9f9e1b@it.aoyama.ac.jp> References: <8360ese2bk.fsf@gnu.org> <54e121f0-293a-d2a5-7e55-2c716d9f9e1b@it.aoyama.ac.jp> Message-ID: That's not so sure, there are legacy browsers using an embedded mode for bdi as they don't implement the isolation mode (the older version of the BiDi algorithm). 2017-07-17 4:42 GMT+02:00 Martin J. D?rst : > On 2017/07/17 04:28, Philippe Verdy via Unicode wrote: > > The isolation mode is also the one strongly recommended by default for >> elements in HTML, >> > > Well, that's for sure, because the "i" in "bdi" stands for "isolation", > and the element was newly created for the isolation mode. > > Regards, Martin. > > > but legacy browsers still don't have support for >> isolation mode and were mapping element by default using the legacy >> embedding mode which has such caveats. So the specification is correct, it >> reproduces the legacy behavior as it was initially defined (and did not >> change) for RLE/PDF. >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon Jul 17 07:25:41 2017 From: unicode at unicode.org (=?UTF-8?Q?Christoph_P=C3=A4per?= via Unicode) Date: Mon, 17 Jul 2017 14:25:41 +0200 (CEST) Subject: Emoji Space Message-ID: <1695207752.82686.1500294342062@ox.hosteurope.de> As you may know, the combined original Japanese emoji set included three whitespace characters: one was the full width of a (square) emoji, one was half that and the last one was a quarter blank. Their KDDI Shift-JIS codes were F7A9, F7AA and F7AB, respectively, and their internal numeric IDs were #173, #174 and #175, respectively. They were apparently not adapted as new Unicode characters and no existing space character gained the Emoji property. Which existing characters then should be used to align emojis (in a square grid)? Since emojis are square glyphs in all relevant implementations one would expect at least an em-wide blank would be available in emoji input systems. Is this U+3000 Ideographic Space, U+2003 Em Space or U+2001 Em Quad? I assume the other two original emoji spaces are best mapped to U+2002 En Space and U+2005 Four-per-Em Space. There is no white space character explicitly half an em wide, it seems. Finally, should smart fonts make U+0020 exactly as wide as an em when between two emojis? From unicode at unicode.org Mon Jul 17 08:25:57 2017 From: unicode at unicode.org (Alastair Houghton via Unicode) Date: Mon, 17 Jul 2017 14:25:57 +0100 Subject: Emoji Space In-Reply-To: <1695207752.82686.1500294342062@ox.hosteurope.de> References: <1695207752.82686.1500294342062@ox.hosteurope.de> Message-ID: On 17 Jul 2017, at 13:25, Christoph P?per via Unicode wrote: > > Finally, should smart fonts make U+0020 exactly as wide as an em when between two emojis? I?ll leave it to others to answer the rest (I don?t know the answers to those), but the answer to this is clearly ?no?. Otherwise, a user writing a series of Emoji in prose would find that any attempt to space them using the ordinary space character led to unexpected behaviour. Also, I don?t think you can rely on spaces undergoing glyph mapping; because renderers need to handle spaces specially (for justification and line breaking purposes), it?s quite likely that at least some renderers special case them. It?d be interesting to see what various popular applications do if you try to make spaces change size using OpenType... Kind regards, Alastair. -- http://alastairs-place.net From unicode at unicode.org Mon Jul 17 09:59:10 2017 From: unicode at unicode.org (Markus Scherer via Unicode) Date: Mon, 17 Jul 2017 07:59:10 -0700 Subject: Emoji Space In-Reply-To: <1695207752.82686.1500294342062@ox.hosteurope.de> References: <1695207752.82686.1500294342062@ox.hosteurope.de> Message-ID: On Mon, Jul 17, 2017 at 5:25 AM, Christoph P?per via Unicode < unicode at unicode.org> wrote: > As you may know, the combined original Japanese emoji set included three > whitespace characters: one was the full width of a (square) emoji, one was > half that and the last one was a quarter blank. Their KDDI Shift-JIS codes > were F7A9, F7AA and F7AB, respectively, and their internal numeric IDs were > #173, #174 and #175, respectively. They were apparently not adapted as new > Unicode characters and no existing space character gained the Emoji > property. > They were among the 115 or so emoji unified with Unicode 5.2-and-earlier characters. http://www.unicode.org/Public/UCD/latest/ucd/EmojiSources.txt 2002;;F7AA; 2003;;F7A9; 2005;;F7AB; markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue Jul 18 06:39:34 2017 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Tue, 18 Jul 2017 13:39:34 +0200 Subject: Emoji Space In-Reply-To: <1695207752.82686.1500294342062@ox.hosteurope.de> References: <1695207752.82686.1500294342062@ox.hosteurope.de> Message-ID: 2017-07-17 14:25 GMT+02:00 Christoph P?per via Unicode : > > Finally, should smart fonts make U+0020 exactly as wide as an em when > between two emojis? > Really I don't think so, Emojis are not specific to East-Asian use even if a significant part of them come from there. These bsaic spaces are separate clusters separating separate emoji sequences, they are not part themselves of the sequences: why would it be specific to spaces and why wouldn't you apply this logic to all other Basic latin characters to render their full width version between emojis, if even before and after them ? It would simply break things everywhere. In reality, the emojis are always composed separately, they are isolated symbols in a stream of normal text in any script (with weak direction coming from their context of occurence, but not mirrorable in BiDi contexts). Then people will want to separate them by spaces... or not if used in South-East Asian scripts, or will surround them with standard punctuation, or will join them with prefixes/suffixes from words such as "I ?ed it !". If one wants specific metrics for spaces around emojis to separate them, Unicode already has plenty of them usable from normal scripts, U+0020 is not the only choice but it should still use standard font metrics in the scripts the font was designed for. Usually emojis fonts are very specific (to support colors) except in symbol scripts. Remember also that many emojis often have two standardized presentations: one for use as normal symbols with simple monochromatic glyphs (where colors may be replaced by patterns similated by strokes, and slight modification of the internal metrics to make them still recognizable), another which is more colorful and elaborated and that could be more compact. Emojis also don't necessarily have to be drawn in an em-square, they are variable-width like in normal scripts. The monospaced rendering is just a font design style: if you have such font (e.g. for rendering on a text console), your basic spaces and other basic Latin will use half-em width everywhere, or fullwidth everywhere: you don't need any "smart font" feature using contextual rendering rules, and in fact these rules will be undesirable in most cases of use of monospaced fonts. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Jul 19 08:32:20 2017 From: unicode at unicode.org (Leonardo Boiko via Unicode) Date: Wed, 19 Jul 2017 15:32:20 +0200 Subject: =?UTF-8?Q?First_bonafide_use_=28=E2=89=A0_mention=29_of_emoji_by_an_acad?= =?UTF-8?Q?emic_publisher=3F?= Message-ID: Perhaps not the first, but that I know of at least. I don't know since when, but *Writing Systems Research*, published by Taylor & Francis, is using cute emojis as markers for references/hyperlinks, in the web edition only (not in the PDF release): http://www.tandfonline.com/doi/full/10.1080/17586801.2017.1335634 ? Presumably they might do this for all their online journals; I can't find out because this publisher's a paywalling capitalist pig. To my boundless, heartbreaking disappointment, these emojis are not U+1F4D8 BLUE BOOKs ?? from a custom @css font, but rather private-use U+F02Ds, which index a book glyph in some icon pack called Font Awesome . At least they're inserted via CSS :before-selectors, which means they'll be automatically treated as decorations and seamlessly excluded from copy-paste operations. I rate this a typographic blunder, as they inelegantly crowd the page overwhelming the text, and are neither pretty nor functional (the somber blue is a good highlight for the hyperlinks; the big dark blobs highlight them too much?I'm not *that *interested in the reference links that I'd want them metaphorically blown all over my face; I'd rather have them unobtrusively in margin notes, along with their metadata and asides). Still, I like the sheer iconoclastic bravado of using emoji in such a context. I kind of hope that someone comes up with *good *uses of emoji in otherwise serious media. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: tand_emoji.png Type: image/png Size: 248731 bytes Desc: not available URL: From unicode at unicode.org Sun Jul 23 18:45:03 2017 From: unicode at unicode.org (Doug Ewell via Unicode) Date: Sun, 23 Jul 2017 17:45:03 -0600 Subject: =?utf-8?Q?Re:_First_bonafide_use_=28=E2=89=A0_mentio?= =?utf-8?Q?n=29_of_emoji_by_an_academic_publi?= =?utf-8?Q?sher=3F?= Message-ID: <1146C85B5D3D4B1387969FF4C66E2184@DougEwell> Leonardo Boiko wrote: > To my boundless, heartbreaking disappointment, these emojis are not > U+1F4D8 BLUE BOOKs ?? from a custom @css font, but rather private-use > U+F02Ds, which index a book glyph in some icon pack called Font > Awesome . At least they're > inserted via CSS :before-selectors, which means they'll be > automatically treated as decorations and seamlessly excluded from > copy-paste operations. We use Font Awesome for my project at work, for symbols embedded in text which have no reason and no need to be interchanged, converted to other character sets, or indexed in search engines. Font Awesome also includes some symbols that, we think, won't ever be Unicode emoji, such as the Android, Apple, Bluetooth, and Windows logos. -- Doug Ewell | Thornton, CO, US | ewellic.org From unicode at unicode.org Mon Jul 24 05:32:40 2017 From: unicode at unicode.org (Leonardo Boiko via Unicode) Date: Mon, 24 Jul 2017 12:32:40 +0200 Subject: =?UTF-8?Q?Re=3A_First_bonafide_use_=28=E2=89=A0_mention=29_of_emoji_by_an_?= =?UTF-8?Q?academic_publisher=3F?= In-Reply-To: <1146C85B5D3D4B1387969FF4C66E2184@DougEwell> References: <1146C85B5D3D4B1387969FF4C66E2184@DougEwell> Message-ID: I don't have anything against that, in principle. It would just be more satisfying for me if the blue books were encoded in the font as U+1F4D8s, rather than U+F02Ds. Or, if the colors are done at a CSS level, as ?? U+1F4D5 CLOSED BOOKs or the like. Same goes for the other icons in FA which *do *have an emoji counterpart (which would be, I suspect, the majority). The reasons I'd prefer such an encoding are, to be honest, purely ?sthetic; but they could also be argued on functional terms. Consider Instagram's fascinating results when applying word-vector models to emoji, for example ( https://engineering.instagram.com/emojineering-part-1-machine-learning-for-emoji-trendsmachine-learning-for-emoji-trends-7f5f9cb979ad ). One never knows just *when *someone will want to interchange, convert, or index characters; even emoji symbols can find valid, unexpected applications. Suppose a researcher in the future wants to investigate early usage of academic emoji in the 21st century. Or suppose something as simple as trying to find out which emoji are used most frequently in a field, country, or time period. Having the icon encoded as U+1F4D5 rather than U+F02D would help this sort of interoperability, while causing no problems for anyone (it's, after all, just a matter of choosing which numbers you give to which icons; calling it #128213 is as easy as calling it #61485). 2017-07-24 1:45 GMT+02:00 Doug Ewell via Unicode : > Leonardo Boiko wrote: > > To my boundless, heartbreaking disappointment, these emojis are not >> U+1F4D8 BLUE BOOKs ?? from a custom @css font, but rather private-use >> U+F02Ds, which index a book glyph in some icon pack called Font >> Awesome . At least they're >> inserted via CSS :before-selectors, which means they'll be >> automatically treated as decorations and seamlessly excluded from >> copy-paste operations. >> > > We use Font Awesome for my project at work, for symbols embedded in text > which have no reason and no need to be interchanged, converted to other > character sets, or indexed in search engines. > > Font Awesome also includes some symbols that, we think, won't ever be > Unicode emoji, such as the Android, Apple, Bluetooth, and Windows logos. > > -- > Doug Ewell | Thornton, CO, US | ewellic.org > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon Jul 24 09:24:50 2017 From: unicode at unicode.org (=?UTF-8?Q?Christoph_P=C3=A4per?= via Unicode) Date: Mon, 24 Jul 2017 16:24:50 +0200 (CEST) Subject: =?UTF-8?Q?Re:_First_bonafide_use_(=E2=89=A0_mention)?= =?UTF-8?Q?_of_emoji_by_an_academic_publisher=3F?= In-Reply-To: References: <1146C85B5D3D4B1387969FF4C66E2184@DougEwell> Message-ID: <1341523565.30759.1500906290152@ox.hosteurope.de> Leonardo Boiko: > > It would just be more > satisfying for me if the blue books were encoded in the font as U+1F4D8s, > rather than U+F02Ds. Or, if the colors are done at a CSS level, as ?? > U+1F4D5 CLOSED BOOKs or the like. Same goes for the other icons in FA > which *do *have an emoji counterpart (which would be, I suspect, the > majority). This issue has been raised long ago with the developers of such symbol fonts: The reason why this is not being done is the special treatment of emoji characters by vendors who always replace them by their custom images. Since such fonts are mostly used on the Web platform, the solution would be a CSS property to force `text` rendering of emojis: From unicode at unicode.org Mon Jul 24 09:34:31 2017 From: unicode at unicode.org (Leonardo Boiko via Unicode) Date: Mon, 24 Jul 2017 16:34:31 +0200 Subject: =?UTF-8?Q?Text_rendering_of_emojis_=28was=3A_Re=3A_First_bonafide_us?= =?UTF-8?Q?e_=28=E2=89=A0_mention=29_of_emoji_by_an_academic_publisher=3F=29?= Message-ID: Speaking of which?sorry if this is going off-topic, but I don't know where else could I ask?I don't think there's a way to configure Linux or Android systems to always prefer text rendering for emojis, is there? (I love text emojis.) 2017-07-24 16:24 GMT+02:00 Christoph P?per via Unicode : > Leonardo Boiko: > > > > It would just be more > > satisfying for me if the blue books were encoded in the font as U+1F4D8s, > > rather than U+F02Ds. Or, if the colors are done at a CSS level, as ?? > > U+1F4D5 CLOSED BOOKs or the like. Same goes for the other icons in FA > > which *do *have an emoji counterpart (which would be, I suspect, the > > majority). > > This issue has been raised long ago with the developers of such symbol > fonts: > > > > The reason why this is not being done is the special treatment of emoji > characters by vendors who always replace them by their custom images. Since > such fonts are mostly used on the Web platform, the solution would be a CSS > property to force `text` rendering of emojis: > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon Jul 24 09:39:40 2017 From: unicode at unicode.org (Costello, Roger L. via Unicode) Date: Mon, 24 Jul 2017 14:39:40 +0000 Subject: Split a UTF-8 multi-octet sequence such that it cannot be unambiguously restored? Message-ID: Hello Unicode Experts! Suppose an application splits a UTF-8 multi-octet sequence. The application then sends the split sequence to a client. The client must restore the original sequence. Question: is it possible to split a UTF-8 multi-octet sequence in such a way that the client cannot unambiguously restore the original sequence? Here is the source of my question: The iCalendar specification [RFC 5545] says that long lines must be folded: Long content lines SHOULD be split into a multiple line representations using a line "folding" technique. That is, a long line can be split between any two characters by inserting a CRLF immediately followed by a single linear white-space character (i.e., SPACE or HTAB). The RFC says that, when parsing a content line, folded lines must first be unfolded using this technique: Unfolding is accomplished by removing the CRLF and the linear white-space character that immediately follows. The RFC acknowledges that simple implementations might generate improperly folded lines: Note: It is possible for very simple implementations to generate improperly folded lines in the middle of a UTF-8 multi-octet sequence. For this reason, implementations need to unfold lines in such a way to properly restore the original sequence. Can you provide an example of folding a UTF-8 multi-octet sequence such that there is no unambiguous way to restore the original sequence? /Roger From unicode at unicode.org Mon Jul 24 10:01:42 2017 From: unicode at unicode.org (Steffen Nurpmeso via Unicode) Date: Mon, 24 Jul 2017 17:01:42 +0200 Subject: Split a UTF-8 multi-octet sequence such that it cannot be unambiguously restored? References: Message-ID: <20170724150142.wJ3mc%steffen@sdaoden.eu> "Costello, Roger L. via Unicode" wrote: |Suppose an application splits a UTF-8 multi-octet sequence. The application \ |then sends the split sequence to a client. The client must restore \ |the original sequence. | |Question: is it possible to split a UTF-8 multi-octet sequence in such \ |a way that the client cannot unambiguously restore the original sequence? | |Here is the source of my question: | |The iCalendar specification [RFC 5545] says that long lines must be folded: | | Long content lines SHOULD be split | into a multiple line representations | using a line "folding" technique. | That is, a long line can be split between | any two characters by inserting a CRLF | immediately followed by a single linear | white-space character (i.e., SPACE or HTAB). | |The RFC says that, when parsing a content line, folded lines must first \ |be unfolded using this technique: | | Unfolding is accomplished by removing | the CRLF and the linear white-space | character that immediately follows. | |The RFC acknowledges that simple implementations might generate improperly \ |folded lines: | | Note: It is possible for very simple | implementations to generate improperly | folded lines in the middle of a UTF-8 | multi-octet sequence. For this reason, | implementations need to unfold lines | in such a way to properly restore the | original sequence. That is not what the RFC says. It says that simple implementations simply split lines when the limit is reached, which might be in the middle of an UTF-8 sequence. The RFC is thus improved compared to other RFCs in the email standard section, which do not give any hints on how to do that. Even RFC 2231, which avoids many of the ambiguities and problems of RFC 2047 (for a different purpose, but still), does not say it so exactly for the reversing character set conversion (which i for one perform _once_ after joining together the chunks, but is not a written word and, thus, ...). --steffen | |Der Kragenbaer, The moon bear, |der holt sich munter he cheerfully and one by one |einen nach dem anderen runter wa.ks himself off |(By Robert Gernhardt) From unicode at unicode.org Mon Jul 24 10:27:09 2017 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Mon, 24 Jul 2017 17:27:09 +0200 Subject: Split a UTF-8 multi-octet sequence such that it cannot be unambiguously restored? In-Reply-To: <20170724150142.wJ3mc%steffen@sdaoden.eu> References: <20170724150142.wJ3mc%steffen@sdaoden.eu> Message-ID: But at the same time that RFC makes a direct reference as UTF-8 as being the default charset, so an implementation of the RFC cannot be agnostic to what is UTF-8 and will not break in the middle of a conforming UTF-8 sequence. When the limit is reached, that implementations knows that it cannot cut at a position of an UTF-8 trailing byte, and knows that it can safely rollaback at most 3 bytes before to locate conforming leading UTF-8 byte to split the line **before** it, or any 7-bit ASCII byte to split the line just **after** it). This requires very small buffering and this is a fundamendal property of UTF-8. Other character sets -- including /UTF-(16|32)([LB]E)?/ !!! --- are not directly supported, except by external decoders which would convert their input stream to UTF-8 (with all the same issues that may occur for such conversion when it is not roundtrip compatible or the input does not conform the specificvation of the input charset, but this is not the problem of this RFC: these decoders may also rollback internally or attempt to guess another charset or will use substitution, but they are supposed to generate conforming UTF-8 on output). 2017-07-24 17:01 GMT+02:00 Steffen Nurpmeso via Unicode : > "Costello, Roger L. via Unicode" wrote: > |Suppose an application splits a UTF-8 multi-octet sequence. The > application \ > |then sends the split sequence to a client. The client must restore \ > |the original sequence. > | > |Question: is it possible to split a UTF-8 multi-octet sequence in such \ > |a way that the client cannot unambiguously restore the original sequence? > | > |Here is the source of my question: > | > |The iCalendar specification [RFC 5545] says that long lines must be > folded: > | > | Long content lines SHOULD be split > | into a multiple line representations > | using a line "folding" technique. > | That is, a long line can be split between > | any two characters by inserting a CRLF > | immediately followed by a single linear > | white-space character (i.e., SPACE or HTAB). > | > |The RFC says that, when parsing a content line, folded lines must first \ > |be unfolded using this technique: > | > | Unfolding is accomplished by removing > | the CRLF and the linear white-space > | character that immediately follows. > | > |The RFC acknowledges that simple implementations might generate > improperly \ > |folded lines: > | > | Note: It is possible for very simple > | implementations to generate improperly > | folded lines in the middle of a UTF-8 > | multi-octet sequence. For this reason, > | implementations need to unfold lines > | in such a way to properly restore the > | original sequence. > > That is not what the RFC says. It says that simple > implementations simply split lines when the limit is reached, > which might be in the middle of an UTF-8 sequence. The RFC is > thus improved compared to other RFCs in the email standard > section, which do not give any hints on how to do that. Even > RFC 2231, which avoids many of the ambiguities and problems of RFC > 2047 (for a different purpose, but still), does not say it so > exactly for the reversing character set conversion (which i for > one perform _once_ after joining together the chunks, but is not > a written word and, thus, ...). > > --steffen > | > |Der Kragenbaer, The moon bear, > |der holt sich munter he cheerfully and one by one > |einen nach dem anderen runter wa.ks himself off > |(By Robert Gernhardt) > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon Jul 24 10:39:46 2017 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Mon, 24 Jul 2017 17:39:46 +0200 Subject: Split a UTF-8 multi-octet sequence such that it cannot be unambiguously restored? In-Reply-To: References: <20170724150142.wJ3mc%steffen@sdaoden.eu> Message-ID: Also note that the maximum line-length in that RFC is a SHOULD and not a MUST. This is intended to give a reasonable hint for the limit used in implementations that process data in the given format: The RFC suggests a maximum line length of 75 "characters", excluding the CRLF+SPACE continuation sequence (not clear here what it means given that it refers to UTF-8: should it be "code units", i.e. bytes?) Due to this ambiguity, all implementations will need to interpret it as id they are actually 75 Unicode characters, which could all be up to 4 bytes in UTF-8, i.e. 300 bytes. Most implementations will use input buffers for lines up to 512 bytes (including the CRLF+SPACE continuation), so it will be simpler to handle the case of continuation just AFTER the line length limit has been reached, without ever rolling back. And in all cases, there should never be any continuation sequence CRLF+SPACE in the middle of any UTF-8 sequence without breaking the initial UTF-8 condition which is assumed by theis RFC, i.e. without breaking conformance to that RFC. If an implementation thinks that 75 is a number of bytes, it is wrong, but anyway given the UTF-8 reference, it could still use it but should not break in the middle of an UTF-8 sequence, but it will be still safe for them to break just after it, even if the line (excluding the the CRLF+SPACE contituation sequence) will be up to 78 bytes long. Decoders will still be able to parse it without breaking if they have the most common 512-byte input buffer. 2017-07-24 17:27 GMT+02:00 Philippe Verdy : > But at the same time that RFC makes a direct reference as UTF-8 as being > the default charset, so an implementation of the RFC cannot be agnostic to > what is UTF-8 and will not break in the middle of a conforming UTF-8 > sequence. > > When the limit is reached, that implementations knows that it cannot cut > at a position of an UTF-8 trailing byte, and knows that it can safely > rollaback at most 3 bytes before to locate conforming leading UTF-8 byte to > split the line **before** it, or any 7-bit ASCII byte to split the line > just **after** it). This requires very small buffering and this is a > fundamendal property of UTF-8. > > Other character sets -- including /UTF-(16|32)([LB]E)?/ !!! --- are not > directly supported, except by external decoders which would convert their > input stream to UTF-8 (with all the same issues that may occur for such > conversion when it is not roundtrip compatible or the input does not > conform the specificvation of the input charset, but this is not the > problem of this RFC: these decoders may also rollback internally or attempt > to guess another charset or will use substitution, but they are supposed to > generate conforming UTF-8 on output). > > > 2017-07-24 17:01 GMT+02:00 Steffen Nurpmeso via Unicode < > unicode at unicode.org>: > >> "Costello, Roger L. via Unicode" wrote: >> |Suppose an application splits a UTF-8 multi-octet sequence. The >> application \ >> |then sends the split sequence to a client. The client must restore \ >> |the original sequence. >> | >> |Question: is it possible to split a UTF-8 multi-octet sequence in such \ >> |a way that the client cannot unambiguously restore the original >> sequence? >> | >> |Here is the source of my question: >> | >> |The iCalendar specification [RFC 5545] says that long lines must be >> folded: >> | >> | Long content lines SHOULD be split >> | into a multiple line representations >> | using a line "folding" technique. >> | That is, a long line can be split between >> | any two characters by inserting a CRLF >> | immediately followed by a single linear >> | white-space character (i.e., SPACE or HTAB). >> | >> |The RFC says that, when parsing a content line, folded lines must first >> \ >> |be unfolded using this technique: >> | >> | Unfolding is accomplished by removing >> | the CRLF and the linear white-space >> | character that immediately follows. >> | >> |The RFC acknowledges that simple implementations might generate >> improperly \ >> |folded lines: >> | >> | Note: It is possible for very simple >> | implementations to generate improperly >> | folded lines in the middle of a UTF-8 >> | multi-octet sequence. For this reason, >> | implementations need to unfold lines >> | in such a way to properly restore the >> | original sequence. >> >> That is not what the RFC says. It says that simple >> implementations simply split lines when the limit is reached, >> which might be in the middle of an UTF-8 sequence. The RFC is >> thus improved compared to other RFCs in the email standard >> section, which do not give any hints on how to do that. Even >> RFC 2231, which avoids many of the ambiguities and problems of RFC >> 2047 (for a different purpose, but still), does not say it so >> exactly for the reversing character set conversion (which i for >> one perform _once_ after joining together the chunks, but is not >> a written word and, thus, ...). >> >> --steffen >> | >> |Der Kragenbaer, The moon bear, >> |der holt sich munter he cheerfully and one by one >> |einen nach dem anderen runter wa.ks himself off >> |(By Robert Gernhardt) >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon Jul 24 10:50:24 2017 From: unicode at unicode.org (Doug Ewell via Unicode) Date: Mon, 24 Jul 2017 08:50:24 -0700 Subject: Split a UTF-8 multi-octet sequence such that it cannot be unambiguously =?UTF-8?Q?restored=3F?= Message-ID: <20170724085024.665a7a7059d7ee80bb4d670165c8327d.eeac8ac2ac.wbe@email03.godaddy.com> Costello, Roger L. wrote: > Suppose an application splits a UTF-8 multi-octet sequence. The > application then sends the split sequence to a client. The client must > restore the original sequence. > > Question: is it possible to split a UTF-8 multi-octet sequence in such > a way that the client cannot unambiguously restore the original > sequence? 1. (Bug) The folding process inserts CRLF plus white space characters, and the unfolding process doesn't properly delete all of them. 2. (Non-conformant behavior) Some process, after folding and before unfolding, attempts to interpret the partial UTF-8 sequences and converts them into replacement characters or worse. In a minimally decent implementation, splitting and reassembling a UTF-8 sequence should always yield the correct result; there should be no ambiguity. A good implementation, of course, would know the character encoding of the data, and would not split multi-byte sequences in that encoding to begin with. -- Doug Ewell | Thornton, CO, US | ewellic.org From unicode at unicode.org Mon Jul 24 12:57:43 2017 From: unicode at unicode.org (Costello, Roger L. via Unicode) Date: Mon, 24 Jul 2017 17:57:43 +0000 Subject: Split a UTF-8 multi-octet sequence such that it cannot be unambiguously restored? In-Reply-To: References: Message-ID: Hi Folks, Thank you very much for your fantastic comments! Below I summarized the issue and your comments. At the bottom is a set of proposed requirements (for my clients) on applications that receive iCalendar files. Some questions: - Have I captured all your comments? Any more comments? - Are the proposed requirements sensible? Any more requirements? /Roger Issue: Folding and unfolding content lines in iCalendar files The iCalendar specification [RFC 5545] says that a content line should not be longer than 75 octets: Lines of text SHOULD NOT be longer than 75 octets, excluding the line break. The RFC says that long lines should be folded: Long content lines SHOULD be split into a multiple line representations using a line "folding" technique. That is, a long line can be split between any two characters by inserting a CRLF immediately followed by a single linear white-space character (i.e., SPACE or HTAB). The RFC says that, when parsing a content line, folded lines must first be unfolded: When parsing a content line, folded lines MUST first be unfolded. using this technique: Unfolding is accomplished by removing the CRLF and the linear white-space character that immediately follows. The RFC acknowledges that some implementations might do folding in the middle of a multi-octet sequence: Note: It is possible for very simple implementations to generate improperly folded lines in the middle of a UTF-8 multi-octet sequence. For this reason, implementations need to unfold lines in such a way to properly restore the original sequence. Here is an example of folding in the middle of a UTF-8 multi-octet sequence: The iCalendar file contains the Yen sign (U+00A5), which is represented by the byte sequence 0xC2 0xA5 in UTF-8. The content line containing the Yen sign is folded in the middle of the two bytes. The result is 0xC2 0x0D 0x0A 0x20 0xA5, which isn't valid UTF-8 any longer. Proposed requirements on the behavior of applications that receive iCalendar files: 1. (Bug) The receiving application does not recognize that it has received an iCalendar file. 2. (Bug) The sending application performs the folding process - inserts CRLF plus white space characters - and the receiving application does the unfolding process but doesn't properly delete all of them. 3. (Non-conformant behavior) The receiving application, after folding and before unfolding, attempts to interpret the partial UTF-8 sequences and convert them into replacement characters or worse. From unicode at unicode.org Mon Jul 24 14:12:06 2017 From: unicode at unicode.org (J Decker via Unicode) Date: Mon, 24 Jul 2017 12:12:06 -0700 Subject: Split a UTF-8 multi-octet sequence such that it cannot be unambiguously restored? In-Reply-To: References: Message-ID: On Mon, Jul 24, 2017 at 10:57 AM, Costello, Roger L. via Unicode < unicode at unicode.org> wrote: > Hi Folks, > > 2. (Bug) The sending application performs the folding process - inserts > CRLF plus white space characters - and the receiving application does the > unfolding process but doesn't properly delete all of them. > > The RFC doesn't say 'characters' but either a space or a tab character (singular) back scanning is simple enough while( ( from[0] & 0xC0 ) == 0x80 ) from--; should probably also check that from > (start+1) but since it should be applied at 75-ish characters, that would be implicitly true. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon Jul 24 15:50:05 2017 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Mon, 24 Jul 2017 22:50:05 +0200 Subject: Split a UTF-8 multi-octet sequence such that it cannot be unambiguously restored? In-Reply-To: References: Message-ID: 2017-07-24 21:12 GMT+02:00 J Decker via Unicode : > > > On Mon, Jul 24, 2017 at 10:57 AM, Costello, Roger L. via Unicode < > unicode at unicode.org> wrote: > >> Hi Folks, >> >> 2. (Bug) The sending application performs the folding process - inserts >> CRLF plus white space characters - and the receiving application does the >> unfolding process but doesn't properly delete all of them. >> >> The RFC doesn't say 'characters' but either a space or a tab character > (singular) > > back scanning is simple enough > > while( ( from[0] & 0xC0 ) == 0x80 ) > from--; > Certainly not like this! Backscanning should only directly use a single assignement to the last known start position, no loop at all ! UTF-8 security is based on the fact that its sequences are strictly limited in length so that you will never have more than 3 trailing bytes. If you don't have that last position in a variable, just use 3 tests but NO loop at all: if all 3 tests are failing, you know the input was not valid at all, and the way to handle this error will not be solved simply by using a very unsecure unbound loop like above but by exiting and returning an error immediately, or throwing an exception. The code should better be: if (from[0]&0xC0 == 0x80) from--; else if (from[-1]&0xC0 == 0x80) from -=2; else if (from[-2]&0xC0 == 0x80) from -=3; if (from[0]&0xC0 == 0x80) throw (some exception); // continue here with character encoded as UTF-8 starting at "from" (an ASCII byte or an UTF-8 leading byte) And it should be secured using a guard byte at start of your buffer in which the "from" pointer was pointing, so that it will never read something else and can generate an error. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon Jul 24 16:03:50 2017 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Mon, 24 Jul 2017 23:03:50 +0200 Subject: Split a UTF-8 multi-octet sequence such that it cannot be unambiguously restored? In-Reply-To: References: Message-ID: 2017-07-24 22:50 GMT+02:00 Philippe Verdy : > 2017-07-24 21:12 GMT+02:00 J Decker via Unicode : > >> >> >> On Mon, Jul 24, 2017 at 10:57 AM, Costello, Roger L. via Unicode < >> unicode at unicode.org> wrote: >> >>> Hi Folks, >>> >>> 2. (Bug) The sending application performs the folding process - inserts >>> CRLF plus white space characters - and the receiving application does the >>> unfolding process but doesn't properly delete all of them. >>> >>> The RFC doesn't say 'characters' but either a space or a tab character >> (singular) >> >> back scanning is simple enough >> >> while( ( from[0] & 0xC0 ) == 0x80 ) >> from--; >> > > Certainly not like this! Backscanning should only directly use a single > assignement to the last known start position, no loop at all ! UTF-8 > security is based on the fact that its sequences are strictly limited in > length so that you will never have more than 3 trailing bytes. > > If you don't have that last position in a variable, just use 3 tests but > NO loop at all: if all 3 tests are failing, you know the input was not > valid at all, and the way to handle this error will not be solved simply by > using a very unsecure unbound loop like above but by exiting and returning > an error immediately, or throwing an exception. > > The code should better be: > > if (from[0]&0xC0 == 0x80) from--; > else if (from[-1]&0xC0 == 0x80) from -=2; > else if (from[-2]&0xC0 == 0x80) from -=3; > if (from[0]&0xC0 == 0x80) throw (some exception); > // continue here with character encoded as UTF-8 starting at "from" > (an ASCII byte or an UTF-8 leading byte) > Sorry, sent too fast, I should not have copy-pasted lines trying to adapt your loop; the correct code uses no "else" at all: > if (from[0]&0xC0 == 0x80) from--; > if (from[0]&0xC0 == 0x80) from--; > if (from[0]&0xC0 == 0x80) from--; > if (from[0]&0xC0 == 0x80) throw (some exception); > // continue here with character encoded as UTF-8 starting at "from" > (an ASCII byte or an UTF-8 leading byte) > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon Jul 24 16:23:12 2017 From: unicode at unicode.org (J Decker via Unicode) Date: Mon, 24 Jul 2017 14:23:12 -0700 Subject: Split a UTF-8 multi-octet sequence such that it cannot be unambiguously restored? In-Reply-To: References: Message-ID: On Mon, Jul 24, 2017 at 1:50 PM, Philippe Verdy wrote: > 2017-07-24 21:12 GMT+02:00 J Decker via Unicode : > >> >> >> If you don't have that last position in a variable, just use 3 tests but > NO loop at all: if all 3 tests are failing, you know the input was not > valid at all, and the way to handle this error will not be solved simply by > using a very unsecure unbound loop like above but by exiting and returning > an error immediately, or throwing an exception. > > The code should better be: > > if (from[0]&0xC0 == 0x80) from--; > else if (from[-1]&0xC0 == 0x80) from -=2; > else if (from[-2]&0xC0 == 0x80) from -=3; > if (from[0]&0xC0 == 0x80) throw (some exception); > // continue here with character encoded as UTF-8 starting at "from" > (an ASCII byte or an UTF-8 leading byte) > > I generally accepted any utf-8 encoding up to 31 bits though ( since I was going from the original spec, and not what was effective limit based on unicode codepoint space) and the while loop is more terse; but is less optimal because of code pipeline flushing from backward jump; so yes if series is much better :) (the original code also has the start of the string, and strings are effecitvly prefixed with a 0 byte anyway because of a long little endian size) and you'd probably be tracking an output offset also, so it becomes a little longer than the above. And it should be secured using a guard byte at start of your buffer in > which the "from" pointer was pointing, so that it will never read something > else and can generate an error. > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon Jul 24 17:35:43 2017 From: unicode at unicode.org (Doug Ewell via Unicode) Date: Mon, 24 Jul 2017 15:35:43 -0700 Subject: Split a UTF-8 multi-octet sequence such that it cannot be unambiguously =?UTF-8?Q?restored=3F?= Message-ID: <20170724153543.665a7a7059d7ee80bb4d670165c8327d.3799fdf420.wbe@email03.godaddy.com> J Decker wrote: > I generally accepted any utf-8 encoding up to 31 bits though ( since > I was going from the original spec, and not what was effective limit > based on unicode codepoint space) Hey, everybody: Don't do that. UTF-8 has been constrained to the Unicode code space (maximum U+10FFFF, four bytes) for almost fourteen years now. -- Doug Ewell | Thornton, CO, US | ewellic.org From unicode at unicode.org Mon Jul 24 18:52:09 2017 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Tue, 25 Jul 2017 01:52:09 +0200 Subject: Split a UTF-8 multi-octet sequence such that it cannot be unambiguously restored? In-Reply-To: <20170724153543.665a7a7059d7ee80bb4d670165c8327d.3799fdf420.wbe@email03.godaddy.com> References: <20170724153543.665a7a7059d7ee80bb4d670165c8327d.3799fdf420.wbe@email03.godaddy.com> Message-ID: 2017-07-25 0:35 GMT+02:00 Doug Ewell via Unicode : > J Decker wrote: > > > I generally accepted any utf-8 encoding up to 31 bits though ( since > > I was going from the original spec, and not what was effective limit > > based on unicode codepoint space) > > Hey, everybody: Don't do that. > > UTF-8 has been constrained to the Unicode code space (maximum U+10FFFF, > four bytes) for almost fourteen years now. I fully agree. This is now an essential part of UTF-8 that has helped secure it (including the dangerous unbound loops scanning through buffers in memory), and also helped improve performance (when unrolling loops that you no longer need to count separately, the code expansion is not so large that you can't do correct branch prediction and can benefit of caching in code. Due to the way the UCS code spacez is allocated and how they are used, the branches in your code have very distinctive patterns that are easy to enumerate; test coverage for those branches is possible without explosing combinatorially: this eliminates the need of heuristics. And about the RFC we were discussing, it is rather recent compared to the approved stabilization of UTF-8 and finally its endorsement by the industry. UTF-8 is strictly bound to 4 bytes and nothing more. This allows other things to be developed on top of this fact and used now as a checked assumption that cannot be broken except by software bugs that will soon create security problems when checked assumptions will no longer be checked throughout a processing chain. The old RFC was not "UTF-8" (even if that name was proposed, it was not really assigned) but an early proposal in discussion that did not reach the level of standard or best practice, it was experimental and at that time there were several other candidates (including also UTF-7 which is now almost abandoned, and BOCU-8 which is now marginal but was also bound to the 17 planes limit). The encoding old RFC should just be given another name, but it is not used for encoding only text, it was describing in fact a binary format (but for generic variable binary encoding format of numbers there are now better candidates, which are also not limited to just 31 bits or even just to unsigned integers, and are also faster to process and more compact, and have more interesting properties for code analysis and resistance to encoding and transmission/storage errors). In the IANA database for charsets, the old RFC encoding has a separate identifier, but "UTF-8" refers to RFC 3629 (IETF standard 63); the former proposals in RFC 2279 or RFC 2044 have never been approved standards, but just drafts mapped in IANA as the obsolete "UNICODE-1-1-UTF-8" (retired later as it was never approved by Unicode). The only remaining "charset" in the IANA database that refers to 31 bit code points is "ISO-10646-UCS-4", but it does not use variable encoding and does not specify any byte order, it is just a basic subtype for a range of positive integers, and without any restriction of use, and not necessarily repreenting text, but it is very inefficient way to encode them, only meant as an internal temporary transform in transient memory or CPU registers (at least for 32bit CPUs or higher: it is now almost alway the case today even in embedded systems, as 4-, 8- or16-bit CPUs are almost dead or will not be used for international text processing; even the simplest keyboard controlers that manage ~100-150 keys and a few leds, and reporting at 1kHz for the fastest ones, are now internally using 32bit CPUs) -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Jul 28 07:22:22 2017 From: unicode at unicode.org (William_J_G Overington via Unicode) Date: Fri, 28 Jul 2017 13:22:22 +0100 (BST) Subject: Turtle Graphics Emoji Message-ID: <4394434.24432.1501244542640.JavaMail.defaultUser@defaultHost> I have been thinking about having Turtle Graphics Emoji as an educational and fun idea. Turtle Graphics Emoji would each be for one turtle graphics command, such as forward, right and left and then there could be digits in a text message after the emoji character to act as the parameter to the turtle graphics command. There could also be a few associated emoji for start, pause and stop and for expressing loops. I am thinking that Turtle Graphics Emoji would be both educational and fun. William Overington Friday 28 July 2017 From unicode at unicode.org Fri Jul 28 10:20:02 2017 From: unicode at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via Unicode) Date: Fri, 28 Jul 2017 17:20:02 +0200 Subject: Turtle Graphics Emoji In-Reply-To: <4394434.24432.1501244542640.JavaMail.defaultUser@defaultHost> References: <4394434.24432.1501244542640.JavaMail.defaultUser@defaultHost> Message-ID: Producing emoji sticker sets and apps requires no involvement of Unicode or any other organization. So you can find out on your own whether there is an audience for your "Turtle Graphics Emoji". Mark (https://twitter.com/mark_e_davis) On Fri, Jul 28, 2017 at 2:22 PM, William_J_G Overington via Unicode < unicode at unicode.org> wrote: > I have been thinking about having Turtle Graphics Emoji as an educational > and fun idea. > > Turtle Graphics Emoji would each be for one turtle graphics command, such > as forward, right and left and then there could be digits in a text message > after the emoji character to act as the parameter to the turtle graphics > command. There could also be a few associated emoji for start, pause and > stop and for expressing loops. > > I am thinking that Turtle Graphics Emoji would be both educational and fun. > > William Overington > > Friday 28 July 2017 > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Jul 28 19:26:25 2017 From: unicode at unicode.org (Shriramana Sharma via Unicode) Date: Sat, 29 Jul 2017 05:56:25 +0530 Subject: Turtle Graphics Emoji In-Reply-To: References: <4394434.24432.1501244542640.JavaMail.defaultUser@defaultHost> Message-ID: for animal in animalKingdom: createEmojiProposal(animal) ? Emoji are a veritable Pandora box. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sat Jul 29 16:29:40 2017 From: unicode at unicode.org (=?UTF-8?Q?Christoph_P=C3=A4per?= via Unicode) Date: Sat, 29 Jul 2017 23:29:40 +0200 (CEST) Subject: Turtle Graphics Emoji In-Reply-To: References: <4394434.24432.1501244542640.JavaMail.defaultUser@defaultHost>

Message-ID: <1753979636.12822.1501363780421@ox.hosteurope.de> Shriramana Sharma via Unicode: > > for animal in animalKingdom: createEmojiProposal(animal) > > Emoji are a veritable Pandora box. It makes sense to look at where (animal) emojis came from and then decide how this might need to be expanded to avoid cultural bias etc. A large bunch of the original animal emojis was meant for the Eastern (Chinese/Asian) and Western (European/Mediterranean) zodiac signs. There are several other astrological traditions that involve animals which have not yet been covered. My very superficial first research (which largely ignores the three Southern-hemispheric continents of South America, Africa and Australia/Oceania) indicates that about a dozen animal emojis should be added for this reason alone: - Crane - Goose - Hawk or Falcon (or generic Bird of Prey) - Raven or Crow - Seagull - Swan - Badger - Beaver - Otter - Seal - Seahorse There are other culturally established sets of pictographic animals that might be used for systematic additions to Unicode, e.g. street signs. The existing Australian road signs , for instance, would suggest 3 additional animal emojis: - Kangaroo - Ratite (Emu) - Wombat If we were to take a look at existing pictographic scripts, some of which have already been added to Unicode, we will see some other local favorites, e.g. in Egyptian hieroglyphs (U+130D2..131AC), also cf. L2/15-208 : - Jackal - Donkey - Hippopotamus - Panther - Oryx, Gazelle, Ibex - Vulture, Buzzard, Falcon - Ibis, Flamingo, Stork, Heron, Cormorant - Ostrich - Swallow, Sparrow - Goose, Pintail - Catfish - Dung Beetle - ... Alas, Mark Davis tells us in L2/17-206 that all that matters is expected popularity and an interest (and no veto) by a handful of US-based companies (or their representatives). Please note that popularity, of course, does not equal (neither global nor local) usage frequency, and the methods to assess the predictions are really shaky. From unicode at unicode.org Sat Jul 29 14:04:41 2017 From: unicode at unicode.org (William_J_G Overington via Unicode) Date: Sat, 29 Jul 2017 20:04:41 +0100 (BST) Subject: Turtle Graphics Emoji In-Reply-To: References: <4394434.24432.1501244542640.JavaMail.defaultUser@defaultHost>

Message-ID: <28297318.34209.1501355081853.JavaMail.defaultUser@defaultHost> > for animal in animalKingdom: > createEmojiProposal(animal) Did you miss a semicolon off the end of that!? ? > Emoji are a veritable Pandora box. Is there an emoji for that!? The name Pandora reminded me that an electric locomotive was named Pandora. So I searched and found that a more recent electro-diesel locomotive has also been named Pandora. https://en.wikipedia.org/wiki/British_Rail_Class_88 I then wondered if the name Pandora had been used for a steam locomotive and it seemed to me that a Great Western Railway broad gauge locomotive from the nineteenth century might have had that sort of classical name, so I searched and found that one did. https://en.wikipedia.org/wiki/List_of_GWR_broad_gauge_locomotives However, although I have rambled off-topic I wonder if you might like the following about how Turtle Graphics are being used for education in the United Kingdom https://www.turtle.ox.ac.uk/ It just seems to me that as emoji are popular that to have some turtle graphics emoji would be both educational and fun. I am hoping that people on this mailing list will like to have a lively discussion on this topic, like there used to be lively discussions in this mailing list years ago. William Overington Saturday 29 July 2017 ----Original message---- >From : unicode at unicode.org Date : 2017/07/29 - 01:26 (GMTST) To : mark at macchiato.com Cc : wjgo_10009 at btinternet.com, unicode at unicode.org Subject : Re: Turtle Graphics Emoji for animal in animalKingdom: createEmojiProposal(animal) ? Emoji are a veritable Pandora box. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sat Jul 29 17:58:06 2017 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Sat, 29 Jul 2017 23:58:06 +0100 Subject: Turtle Graphics Emoji In-Reply-To: <4394434.24432.1501244542640.JavaMail.defaultUser@defaultHost> References: <4394434.24432.1501244542640.JavaMail.defaultUser@defaultHost> Message-ID: <20170729235806.47b0005f@JRWUBU2> On Fri, 28 Jul 2017 13:22:22 +0100 (BST) William_J_G Overington via Unicode wrote: > I have been thinking about having Turtle Graphics Emoji as an > educational and fun idea. I trust you are aware of the widespread feeling that there is already an excessive number of turtle characters in Unicode! Richard.