From A.Schappo at lboro.ac.uk Fri Jan 1 08:00:01 2016 From: A.Schappo at lboro.ac.uk (Andre Schappo) Date: Fri, 1 Jan 2016 14:00:01 +0000 Subject: Unicode in the Curriculum? In-Reply-To: References: <567331D2.1000007@gmail.com> <59AD8E6F-F67A-4FFD-B364-6AC9E555D23D@lboro.ac.uk> <6370df3a44124747bb580b2760e9d144@EX13D08UWB002.ant.amazon.com> Message-ID: <16BCDCD6-502F-4382-BA62-109160E18752@lboro.ac.uk> Julian, We have very different POVs on this topic. You raise a number of issues which would take me many many thousands of words to properly discuss. I will attempt a summary discussion of some of the issues. ? IT i18n is a huge subject area. Unicode is only one component. My module included: s/w i18n & L10n, character sets, unicode & unicode encodings, fonts, keyboard mappings, input methods, language tags, IDNs, website i18n, adaptive i18n websites, characteristics of human language scripts. One of my problems when putting the module together was deciding what to leave out. Actually every year I have tended to add a bit more teaching material to the module. It was along the lines ? Oh! I cannot leave that out ?? Much of IT i18n has a WOW factor. I have many times seen the WOW!?I didn't know that!?Really! reactions from students when teaching them about IT i18n ? There is also the cultural aspect which adds an extra richness, depth and interest to IT i18n ? IT i18n has many layers of detail. Each layer has concepts and realisations. Using your terminology, each layer has intellectual content ? Technical skills encompass and embody concepts and realisations. A technical skill is not adaptable/flexible unless one understands the concepts and has had the realisations. Computer Science needs to give students such technical skills so that they will be able to function and contribute in the non academic world. ? In my experience, just telling students to read the manuals does not work. Most need to be guided to the realisations and concepts. Many 1st year students have been programming from an early age, like 9/10 years old, and are gifted programmers. They have encountered unicode and hacked solutions to immediate problems. But, they do not have an understanding of Unicode and there are many layers to understand in unicode. ? My primary aim/goal/passion is to teach/encourage students to code for the World and not just Britain. ? The current situation is that the majority of students (actually, and academic staff) do not even think about i18n of their Apps/Systems/Websites. eg I will say to a Final Year Project student: "have you thought about internationalising your s/w". Time and time again the response is, No. ? There is a lot of ongoing development of i18n features in css, html, programming languages, social media. All these developments need to be studied and taught ? Surely one of the purposes of lecturing is to make the complicated, simpler. When I first started (self)studying Unicode I was completely baffled. I was overwhelmed with a mass of data, concepts, techniques, reports, standards. I just kept reading and thinking and experimenting. I read about Unicode from many different points of view. I wrote code to process unicode text. That took a lot of effort and time. Now I consider myself knowledgeable about unicode and am in a position to make unicode simpler for students. A 1 hour lecture from me on Unicode will save a student days of self study. Students have a very heavy workload and do not have time for unguided and unstructured self study. All for now?? Andr? Schappo On 31 Dec 2015, at 18:58, Julian Bradfield wrote: > On 2015-12-31, Andre Schappo wrote: > >> I have been hitting my head against the Academic Brick Wall for >> years WRT getting IT i18n and Unicode on the curriculum and I am >> losing. I did teach a final year elective module on IT i18n but a >> few months ago my University dropped it. I am continually puzzled by >> the lack of interest University Computer Science departments have in >> i18n. I appear to be a solitary UK University Computer Science voice >> when it comes to i18n. > > Well, I'd say that it's not the business of Computer Science degrees > to teach specific technical skills. It's our business to help people > learn about the fundamentals of the subject, so that they can acquire > any specific skill on demand, and use that skill competently. In those > areas where we do teach specific skills (e.g. machine learning > techniques) we teach those that have some intellectual content to > them. (This is why we don't teach programming languages as such - we > teach a programming language as a means of learning a programming > paradigm.) > > In my experience so far, using Unicode and doing i18n is not very > interesting (killingly boring, actually) from a purely CS technical > point of view, unless you happen to be one of the small minority who > enjoys script and font layout issues - the interesting bits of doing > i18n are in producing linguistically and culturally appropriate > messages, and that's where one should bring in experts, not expect > typical software developers to be able to do it. > > If you still have the materials for your course, it would be > interesting to see how you managed to get an interesting (and > examinable!) course out of i18n. > > I do in fact mention Unicode and i18n in my introductory programming > course (which is not for CS students), but all I say is "you should > know it's there, and if you become a competent programmer, then you > can read the manuals and tutorials to learn what you need". > > -- > The University of Edinburgh is a charitable body, registered in > Scotland, with registration number SC005336. > From asmus-inc at ix.netcom.com Fri Jan 1 14:09:13 2016 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Fri, 1 Jan 2016 12:09:13 -0800 Subject: Unicode in the Curriculum? In-Reply-To: <16BCDCD6-502F-4382-BA62-109160E18752@lboro.ac.uk> References: <567331D2.1000007@gmail.com> <59AD8E6F-F67A-4FFD-B364-6AC9E555D23D@lboro.ac.uk> <6370df3a44124747bb580b2760e9d144@EX13D08UWB002.ant.amazon.com> <16BCDCD6-502F-4382-BA62-109160E18752@lboro.ac.uk> Message-ID: <5686DCE9.1060401@ix.netcom.com> An HTML attachment was scrubbed... URL: From scarboroughben at gmail.com Sun Jan 3 22:54:50 2016 From: scarboroughben at gmail.com (Ben Scarborough) Date: Sun, 3 Jan 2016 22:54:50 -0600 Subject: Errors in CJK F chart in L2/15-339 Message-ID: I've found at least two errors in the CJK F charts in L2/15-339 (and thus in WG2 N4705 and IRG N2130 as well). Who should I contact about this? ?Ben Scarborough -------------- next part -------------- An HTML attachment was scrubbed... URL: From mpsuzuki at hiroshima-u.ac.jp Mon Jan 4 00:05:00 2016 From: mpsuzuki at hiroshima-u.ac.jp (suzuki toshiya) Date: Mon, 4 Jan 2016 15:05:00 +0900 Subject: [Unicode] Errors in CJK F chart in L2/15-339 In-Reply-To: <2ac9cc08d6334ea0a78759d3ee04a6f9@PS1PR04MB0953.apcprd04.prod.outlook.com> References: <2ac9cc08d6334ea0a78759d3ee04a6f9@PS1PR04MB0953.apcprd04.prod.outlook.com> Message-ID: <568A0B8C.9060205@hiroshima-u.ac.jp> Hi, I'm interested in the errors you found. The stabilization of CJK F is very important work of this year. Usually IRG expects the submissions from from the members (e.g. UTC), but in this case, you can submit your individual contribution to IRG, I guess. Please contact with the chair of IRG, Dr. Lu Qin, at csluqin at comp.polyu.edu.hk Regards, suzuki toshiya, Hiroshima University, Japan Ben Scarborough wrote: > I've found at least two errors in the CJK F charts in L2/15-339 (and thus in WG2 N4705 and IRG N2130 as well). Who should I contact about this? > > ?Ben Scarborough > From jknappen at web.de Mon Jan 4 02:06:03 2016 From: jknappen at web.de (=?UTF-8?Q?=22J=C3=B6rg_Knappen=22?=) Date: Mon, 4 Jan 2016 09:06:03 +0100 Subject: Aw: Symbol for an upside down capital L, pointing to the right? In-Reply-To: References: Message-ID: An HTML attachment was scrubbed... URL: From jknappen at web.de Mon Jan 4 02:15:09 2016 From: jknappen at web.de (=?UTF-8?Q?=22J=C3=B6rg_Knappen=22?=) Date: Mon, 4 Jan 2016 09:15:09 +0100 Subject: Turned Capital letter L (pointing to the left, with serifs) Message-ID: An HTML attachment was scrubbed... URL: From asmus-inc at ix.netcom.com Mon Jan 4 04:16:59 2016 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Mon, 4 Jan 2016 02:16:59 -0800 Subject: Turned Capital letter L (pointing to the left, with serifs) In-Reply-To: References: Message-ID: <568A469B.3060401@ix.netcom.com> An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Mon Jan 4 05:31:23 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Mon, 4 Jan 2016 12:31:23 +0100 Subject: Symbol for an upside down capital L, pointing to the right? In-Reply-To: References: Message-ID: May be because in this document, it is a measurement of time in seconds, and related to the letter T/Tau, rather than G/Gamma. So the idea of the Tyronian Et is not so stupid, even if the glyph used by the printer is higher than expected (and most probably borrowed from another font, possibly using it for the digit 7). 2016-01-04 9:06 GMT+01:00 "J?rg Knappen" : > Err... in what respect would this symbol be different from a CAPITAL GREEK > LETTER GAMMA? > > --J?rg Knappen > > *Gesendet:* Freitag, 25. Dezember 2015 um 14:43 Uhr > *Von:* "Costello, Roger L." > *An:* "unicode at unicode.org" > *Betreff:* Symbol for an upside down capital L, pointing to the right? > Hi Folks, > > Here is the upside down capital L, pointing to the left: > > ? - TURNED SANS-SERIF CAPITAL L (U+2142) > > Is there a symbol for an upside down capital L, pointing to the right? > > /Roger > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From everson at evertype.com Mon Jan 4 07:20:14 2016 From: everson at evertype.com (Michael Everson) Date: Mon, 4 Jan 2016 13:20:14 +0000 Subject: Symbol for an upside down capital L, pointing to the right? In-Reply-To: References: Message-ID: <051AA876-8469-4A75-AE03-98031D28F09E@evertype.com> On 4 Jan 2016, at 08:06, J?rg Knappen wrote: > Err... in what respect would this symbol be different from a CAPITAL GREEK LETTER GAMMA? Perhas that is not the right question. Gamma is only one of many right-angle letter characters in the standard. The question remains: What would INVERTED SANS-SERIF CAPITAL L to be used for? Michael Everson * http://www.evertype.com/ From everson at evertype.com Mon Jan 4 07:25:24 2016 From: everson at evertype.com (Michael Everson) Date: Mon, 4 Jan 2016 13:25:24 +0000 Subject: Turned Capital letter L (pointing to the left, with serifs) In-Reply-To: References: Message-ID: <5BFE25B1-6487-43AF-B01F-68C3F5C5A1BD@evertype.com> On 4 Jan 2016, at 08:15, J?rg Knappen wrote: > > Here is a report of a rather strange beast occurring in historical math printing (work of C. F. Gau?) in thw 19th century: > > http://tex.stackexchange.com/questions/284483/how-do-i-typeset-this-symbol-possibly-astronomical The image there is clearly a digit 7. > images are here: > > http://www.archive.org/stream/abhandlungenmet00gausrich#page/n129/mode/2up This will not load for me. > http://i.stack.imgur.com/57fN3.png Again, this is a digit 7. From a different font than the other 7?s set there. > It looks like a big digit "7" or like a turned letter "L". In the accepted answer it was identified with the Tironian note et; an identification > I'd dispute because the Tironian note Et is usually smaller in size than a capital latin letter. It is not a Tironian et. The Tironian Et typically has a descender and goes to x-height. Also the horizonal stroke would never be written like that 7, and indeed the angle (if less than 90?) of the descender wouldn?t be so small. Michael Everson * http://www.evertype.com/ From frederic.grosshans at gmail.com Mon Jan 4 08:06:24 2016 From: frederic.grosshans at gmail.com (=?UTF-8?B?RnLDqWTDqXJpYyBHcm9zc2hhbnM=?=) Date: Mon, 04 Jan 2016 14:06:24 +0000 Subject: Turned Capital letter L (pointing to the left, with serifs) In-Reply-To: References: Message-ID: Le lun. 4 janv. 2016 ? 09:18, "J?rg Knappen" a ?crit : > Here is a report of a rather strange beast occurring in historical math > printing (work of C. F. Gau?) in thw 19th century: > > > http://tex.stackexchange.com/questions/284483/how-do-i-typeset-this-symbol-possibly-astronomical > > images are here: > > http://www.archive.org/stream/abhandlungenmet00gausrich#page/n129/mode/2up > http://i.stack.imgur.com/57fN3.png > > It looks like a big digit "7" or like a turned letter "L". In the accepted > answer it was identified with the Tironian note et; an identification > I'd dispute because the Tironian note Et is usually smaller in size than a > capital latin letter. > I don?t know what the glyph is, but I doubt that a digit or tironian et makes sense semantically. Since it corresponds to an angular measure (the daily angular displacement of a celestial body), the unicode character correspnding to it is likely ? U+29A2 TURNED ANGLE Fr?d?ric -------------- next part -------------- An HTML attachment was scrubbed... URL: From ejp10 at psu.edu Mon Jan 4 08:44:38 2016 From: ejp10 at psu.edu (Elizabeth J. Pyatt) Date: Mon, 4 Jan 2016 09:44:38 -0500 Subject: Unicode in the Curriculum? (Julian Bradfield) In-Reply-To: References: Message-ID: Like some others on the list, I believe Unicode should be mentioned at different points in a programming curriculum, particularly at the time when ASCII would be taught. Font design and typography is perhaps a different topic, but if it?s mentioned, why not mention CSS/font options for different scripts? Any cloud based tool with ambitions to be a force in the global market MUST use Unicode. Tools such as Twitter, Word Press, Wikipedia, Facebook, Google Docs, Apple Mail/Outlook/Thunderbird and others work with multiple languages because of Unicode. And yes we all want to use our emojis! Optimal Unicode support means getting it right the first time to support thousands of languages instead of adding language support one by one. Even ?English only? pages, particularly educational pages, can include characters outside of Latin 1 such as math and technical symbols, smart curly quotes, long dashes, math symbols, and yes non-English words. I long for the day when I will no longer see phrases with mangled punctuation like: ?They%!re half their size%#Weight Loss Winners?. Thanks to Unicode savvy Web designers, it?s a sight seen much less than 10 years ago. Elizabeth =-=-=-=-=-=-=-=-=-=-=-=-= Elizabeth J. Pyatt, Ph.D. Instructional Designer Teaching and Learning with Technology Penn State University ejp10 at psu.edu, (814) 865-0805 or (814) 865-2030 (Main Office) 210 Rider Building (formerly Rider II) 227 W. Beaver Avenue State College, PA 16801-4819 http://www.personal.psu.edu/ejp10/psu http://tlt.psu.edu From raymond at almanach.co.uk Mon Jan 4 09:38:02 2016 From: raymond at almanach.co.uk (Raymond Mercier) Date: Mon, 4 Jan 2016 15:38:02 -0000 Subject: Turned Capital letter L (pointing to the left, with serifs) In-Reply-To: References: Message-ID: <36275E73002B46CFAE7F2D3A81917380@UserPC> The sign described as like 7 is surely a cursive form of ?. The form used by Gauss (Disquisitio de elementis ellipticis Palladis) is much the same as that shown in manuals of Greek Palaeography as a cursive ?. This is given by E.P. Thompson in two works, An Introduction to Greek and Latin Palaeography, Oxford, 1912, p.83, and A Handbook of Greek and Latin Palaeography, Chicago, 1975, p. 95. Raymond Mercier -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Pi_Abbrev.jpg Type: image/jpeg Size: 14412 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: GaussPallas_21.jpg Type: image/jpeg Size: 80689 bytes Desc: not available URL: From everson at evertype.com Mon Jan 4 09:49:15 2016 From: everson at evertype.com (Michael Everson) Date: Mon, 4 Jan 2016 15:49:15 +0000 Subject: Turned Capital letter L (pointing to the left, with serifs) In-Reply-To: <36275E73002B46CFAE7F2D3A81917380@UserPC> References: <36275E73002B46CFAE7F2D3A81917380@UserPC> Message-ID: <041CC0E9-0FE5-4A57-A61F-529C9D96B58E@evertype.com> Excellent! Looks like a candidate character for encoding. I?m sure I have some examples of good font designs for the old character in one of my books. > On 4 Jan 2016, at 15:38, Raymond Mercier wrote: > > The sign described as like 7 is surely a cursive form of ?. The form used by Gauss (Disquisitio de elementis ellipticis Palladis) is much the same as that shown in manuals of Greek Palaeography as a cursive ?. This is given by E.P. Thompson in two works, An Introduction to Greek and Latin Palaeography, Oxford, 1912, p.83, and A Handbook of Greek and Latin Palaeography, Chicago, 1975, p. 95. > Raymond Mercier > Michael Everson * http://www.evertype.com/ From asmus-inc at ix.netcom.com Mon Jan 4 10:54:18 2016 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Mon, 4 Jan 2016 08:54:18 -0800 Subject: Turned Capital letter L (pointing to the left, with serifs) In-Reply-To: <041CC0E9-0FE5-4A57-A61F-529C9D96B58E@evertype.com> References: <36275E73002B46CFAE7F2D3A81917380@UserPC> <041CC0E9-0FE5-4A57-A61F-529C9D96B58E@evertype.com> Message-ID: <568AA3BA.1030201@ix.netcom.com> An HTML attachment was scrubbed... URL: From asmus-inc at ix.netcom.com Mon Jan 4 10:59:23 2016 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Mon, 4 Jan 2016 08:59:23 -0800 Subject: Unicode in the Curriculum? (Julian Bradfield) In-Reply-To: References: Message-ID: <568AA4EB.5070606@ix.netcom.com> An HTML attachment was scrubbed... URL: From asmus-inc at ix.netcom.com Mon Jan 4 10:59:57 2016 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Mon, 4 Jan 2016 08:59:57 -0800 Subject: Aw: Symbol for an upside down capital L, pointing to the right? In-Reply-To: References: Message-ID: <568AA50D.9020007@ix.netcom.com> An HTML attachment was scrubbed... URL: From dwanders at sonic.net Mon Jan 4 11:09:55 2016 From: dwanders at sonic.net (Deborah W. Anderson) Date: Mon, 4 Jan 2016 09:09:55 -0800 Subject: Errors in CJK F chart in L2/15-339 In-Reply-To: References: Message-ID: <000e01d14712$bc3972c0$34ac5840$@sonic.net> Dear Ben, Please send any comments on the proposed additional repertoire for the 5th edition (CD2, L2/15-339) via the contact form http://www.unicode.org/reporting.html. (Shortly I expect there will be PRI for feedback on L2/15-339, but it isn?t up yet.) Forwarding the feedback to Dr. Lu is also a very good idea, as recommended by Suzuki-san. Thanks, Debbie Anderson From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Ben Scarborough Sent: Sunday, January 03, 2016 8:55 PM To: unicode at unicode.org Subject: Errors in CJK F chart in L2/15-339 I've found at least two errors in the CJK F charts in L2/15-339 (and thus in WG2 N4705 and IRG N2130 as well). Who should I contact about this? ?Ben Scarborough -------------- next part -------------- An HTML attachment was scrubbed... URL: From everson at evertype.com Mon Jan 4 12:41:53 2016 From: everson at evertype.com (Michael Everson) Date: Mon, 4 Jan 2016 18:41:53 +0000 Subject: Turned Capital letter L (pointing to the left, with serifs) In-Reply-To: <568AA3BA.1030201@ix.netcom.com> References: <36275E73002B46CFAE7F2D3A81917380@UserPC> <041CC0E9-0FE5-4A57-A61F-529C9D96B58E@evertype.com> <568AA3BA.1030201@ix.netcom.com> Message-ID: On 4 Jan 2016, at 16:54, Asmus Freytag (t) wrote: > > On 1/4/2016 7:49 AM, Michael Everson wrote: >> Excellent! >> Looks like a candidate character for encoding. I?m sure I have some examples of good font designs for the old character in one of my books. > > Admitting that a Greek letter inherently makes more sense than an "et" as a variable name, I would still need to understand why "pi" would make a sensible mnemonic choice for the variable in Gauss' treatise, before being confident that we've made the correct identification. The more so, as the use of non-cursive pi for "perihelion" in the same work is clearly mnemonic. Certainly it does look more like a very common variant of ?tau? than ?pi? Michael Everson * http://www.evertype.com/ From asmus-inc at ix.netcom.com Mon Jan 4 13:58:12 2016 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Mon, 4 Jan 2016 11:58:12 -0800 Subject: Turned Capital letter L (pointing to the left, with serifs) In-Reply-To: References: <36275E73002B46CFAE7F2D3A81917380@UserPC> <041CC0E9-0FE5-4A57-A61F-529C9D96B58E@evertype.com> <568AA3BA.1030201@ix.netcom.com> Message-ID: <568ACED4.3010505@ix.netcom.com> An HTML attachment was scrubbed... URL: From raymond at almanach.co.uk Mon Jan 4 14:27:44 2016 From: raymond at almanach.co.uk (Raymond Mercier) Date: Mon, 4 Jan 2016 20:27:44 -0000 Subject: Turned Capital letter L (pointing to the left, with serifs) In-Reply-To: <568ACED4.3010505@ix.netcom.com> References: <36275E73002B46CFAE7F2D3A81917380@UserPC> <041CC0E9-0FE5-4A57-A61F-529C9D96B58E@evertype.com> <568AA3BA.1030201@ix.netcom.com> <568ACED4.3010505@ix.netcom.com> Message-ID: On further reflection I can well agree that it is tau. The attached images from R. Barbour, Greek Literary Hands show clearly (scan 3) the large upper case tau in several lines, and in scan 4 in the first and other lines a hooked version of tau. So I withdraw my suggestion of pi. Raymond From: Asmus Freytag (t) Sent: Monday, January 04, 2016 7:58 PM To: unicode at unicode.org Subject: Re: Turned Capital letter L (pointing to the left, with serifs) On 1/4/2016 10:41 AM, Michael Everson wrote: Certainly it does look more like a very common variant of ?tau? than ?pi? Variant of uppercase tau? A./ -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: scan0003.jpg Type: image/jpeg Size: 156740 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: scan0004.jpg Type: image/jpeg Size: 182112 bytes Desc: not available URL: From raymond at almanach.co.uk Mon Jan 4 14:33:27 2016 From: raymond at almanach.co.uk (Raymond Mercier) Date: Mon, 4 Jan 2016 20:33:27 -0000 Subject: Turned Capital letter L (pointing to the left, with serifs) In-Reply-To: <568ACED4.3010505@ix.netcom.com> References: <36275E73002B46CFAE7F2D3A81917380@UserPC> <041CC0E9-0FE5-4A57-A61F-529C9D96B58E@evertype.com> <568AA3BA.1030201@ix.netcom.com> <568ACED4.3010505@ix.netcom.com> Message-ID: <7F2DF0906BEC42219F6825A4D225926A@UserPC> On further reflection I can well agree that it is tau. The attached images from R. Barbour, Greek Literary Hands, show clearly (scan 3) the large upper case tau in several lines, and in scan 4 in the first and other lines a hooked version of tau. So I withdraw my suggestion of pi. Raymond From: Asmus Freytag (t) Sent: Monday, January 04, 2016 7:58 PM To: unicode at unicode.org Subject: Re: Turned Capital letter L (pointing to the left, with serifs) On 1/4/2016 10:41 AM, Michael Everson wrote: Certainly it does look more like a very common variant of ?tau? than ?pi? Variant of uppercase tau? A./ -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: scan0003_2.jpg Type: image/jpeg Size: 19199 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: scan0004_1.jpg Type: image/jpeg Size: 20048 bytes Desc: not available URL: From asmus-inc at ix.netcom.com Mon Jan 4 13:58:12 2016 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Mon, 4 Jan 2016 11:58:12 -0800 Subject: Turned Capital letter L (pointing to the left, with serifs) In-Reply-To: References: <36275E73002B46CFAE7F2D3A81917380@UserPC> <041CC0E9-0FE5-4A57-A61F-529C9D96B58E@evertype.com> <568AA3BA.1030201@ix.netcom.com> Message-ID: <568ACED4.3010505@ix.netcom.com> On 1/4/2016 10:41 AM, Michael Everson wrote: >> Certainly it does look more like a very common variant of ?tau? than ?pi? > > Variant of uppercase tau? No, of lowercase. Michael From frederic.grosshans at gmail.com Mon Jan 4 15:33:45 2016 From: frederic.grosshans at gmail.com (=?UTF-8?B?RnLDqWTDqXJpYyBHcm9zc2hhbnM=?=) Date: Mon, 04 Jan 2016 21:33:45 +0000 Subject: Turned Capital letter L (pointing to the left, with serifs) In-Reply-To: <7F2DF0906BEC42219F6825A4D225926A@UserPC> References: <36275E73002B46CFAE7F2D3A81917380@UserPC> <041CC0E9-0FE5-4A57-A61F-529C9D96B58E@evertype.com> <568AA3BA.1030201@ix.netcom.com> <568ACED4.3010505@ix.netcom.com> <7F2DF0906BEC42219F6825A4D225926A@UserPC> Message-ID: I looked all the pages of the 1809 edition of _Theoria motus corporum coelestium in sectionibus conicis solem ambientium_ https://archive.org/stream/bub_gb_ORUOAAAAQAAJ where Gauss used this notation in pages 80-81. Almost all notations are standard enough to be familiar to any modern (2015) mathematician or physicist, with two exceptions : this "7" symbol and ? U+260A ASCENDING NODE (which is still standard in astronomy). The Greek letters in particular have a pretty standard shape, and I don't see why this symbol would be the only geek letter using a fancy cursive shape. Even the Latin letters used standard shapes ( italic, roman, a few capital fraktur). That said, I did not spot a tau in the text, while most of the Greek alphabet was used. Could "7" be a standard shape for tau in 1809 Hamburg ? However, I still think it is a ? U+29A2 TURNED ANGLE Fr?d?ric Le lun 4 janv. 2016 21:38, Raymond Mercier a ?crit : > On further reflection I can well agree that it is tau. The attached images > from R. Barbour, Greek Literary Hands, show clearly (scan 3) the large > upper case tau in several lines, and in scan 4 in the first and other lines > a hooked version of tau. So I withdraw my suggestion of pi. > Raymond > > *From:* Asmus Freytag (t) > *Sent:* Monday, January 04, 2016 7:58 PM > *To:* unicode at unicode.org > *Subject:* Re: Turned Capital letter L (pointing to the left, with serifs) > > On 1/4/2016 10:41 AM, Michael Everson wrote: > > Certainly it does look more like a very common variant of ?tau? than ?pi? > > > Variant of uppercase tau? > > A./ > -------------- next part -------------- An HTML attachment was scrubbed... URL: From everson at evertype.com Mon Jan 4 16:14:14 2016 From: everson at evertype.com (Michael Everson) Date: Mon, 4 Jan 2016 22:14:14 +0000 Subject: Turned Capital letter L (pointing to the left, with serifs) In-Reply-To: References: <36275E73002B46CFAE7F2D3A81917380@UserPC> <041CC0E9-0FE5-4A57-A61F-529C9D96B58E@evertype.com> <568AA3BA.1030201@ix.netcom.com> <568ACED4.3010505@ix.netcom.com> <7F2DF0906BEC42219F6825A4D225926A@UserPC> Message-ID: <08A1BCBE-382D-4547-ADFE-83863C089DBC@evertype.com> On 4 Jan 2016, at 21:33, Fr?d?ric Grosshans wrote: > > The Greek letters in particular have a pretty standard shape, and I don't see why this symbol would be the only geek letter using a fancy cursive shape. Even the Latin letters used standard shapes ( italic, roman, a few capital fraktur). If he uses a regular tau for anything else that would be the reason. > That said, I did not spot a tau in the text, while most of the Greek alphabet was used. Could "7" be a standard shape for tau in 1809 Hamburg? It?s a standard variant in many older Greek typefaces with ligatures. Michael Everson * http://www.evertype.com/ From asmus-inc at ix.netcom.com Mon Jan 4 23:08:03 2016 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Mon, 4 Jan 2016 21:08:03 -0800 Subject: Turned Capital letter L (pointing to the left, with serifs) In-Reply-To: References: <36275E73002B46CFAE7F2D3A81917380@UserPC> <041CC0E9-0FE5-4A57-A61F-529C9D96B58E@evertype.com> <568AA3BA.1030201@ix.netcom.com> <568ACED4.3010505@ix.netcom.com> <7F2DF0906BEC42219F6825A4D225926A@UserPC> Message-ID: <568B4FB3.2080708@ix.netcom.com> An HTML attachment was scrubbed... URL: From lists+unicode at seantek.com Mon Jan 4 23:30:32 2016 From: lists+unicode at seantek.com (Sean Leonard) Date: Mon, 4 Jan 2016 21:30:32 -0800 Subject: Unicode password mapping for crypto standard Message-ID: <568B54F8.5000802@seantek.com> Hi Unicode list, I am looking for feedback on this proposal, specifically a standard specification to map between (presumably) Unicode text strings and octet strings. A "password" is defined as an arbitrary octet string in a number of protocols and formats. This has worked for basic cases where the "password" is just ASCII, but there are interoperability issues when characters beyond ASCII get involved. My observation is that a lot of security folks get hand-wavy about the Unicode stuff, which is why there is little standardization in this area. Recently in the IETF, application/pkcs8-encrypted is proposed for the PKCS #8 EncryptedPrivateKeyInfo type. For purposes of our discussion, the format takes as input an opaque octet string (any octet in the range 00h-FFh, of any length), and executes various specified algorithms; the result is a decrypted private key. The most common algorithm is PBKDF2, but any algorithm can be used (including, for example, a raw symmetric encryption algorithm such as AES-256). PKCS #8 punts on the issue of character encoding. It says that ASCII or UTF-8 could be used, but doesn?t enforce anything in particular. PKCS #12 specifies UTF-16LE with a terminating NULL character (00h 00h). In the application/pkcs8-encrypted registration, I thought it might be wise to allow senders and receivers to specify how input (whether user input or otherwise) gets mapped to the octet string, since it's not part of the format. Originally my concern at that time was to reflect IANA character sets, rather than profiles of Unicode. These days, however, most user agents are Unicode-enabled and will accept user input in Unicode. Therefore, issue is less about legacy character sets, and more about how to take the Unicode input and get a consistent and reasonable stream of bits out on both ends. For example: should the password be case folded, converted to NFKC, encoded in UTF-8 vs. UTF-16BE, etc.? Constraining or transforming the input would be helpful for disparate systems to agree on these things. Thank you, Sean PS I read the "Unicode in passwords" thread. It's relevant. An alternative or addition to proposing a mapping to/from Unicode, might be to have a "keyboard-mapping" or "keyboard-layout" parameter, that specifies the suggested layout of the keyboard (or input device) used for password input, preferably by deferring to some international standard on the topic. Such a parameter could influence the initial user input method, but it doesn't answer the question of how to turn the key presses into specific bits (Unicode-based or otherwise). ********** The relevant part of the template (most recent proposal, today) is: *** Optional parameters: password-mapping: When the private key encryption algorithm incorporates a "password" that is an octet string, a mapping between user input and the octet string is desirable. PKCS #5 [RFC2898] Section 3 recommends "that applications follow some common text encoding rules"; it then suggests, but does not recommend, ASCII and UTF-8. This parameter specifies the charset that a recipient SHOULD attempt first when mapping user input to the octet string. It has similar semantics as the charset parameter from text/plain, except that it only applies to the user?s input of the password. There is no default value. The following special values are defined: *pkcs12 = UTF-16LE with U+0000 NULL terminator (PKCS #12-style) *precis = PRECIS password profile, i.e., OpaqueString from Section 4 of RFC 7613 (always UTF-8) *precis-XXX = PRECIS profile as named XXX in the IANA PRECIS Profiles Registry *hex = hexadecimal input: the input is mapped to 0-9, A-F, and then converted directly to octets. If there are an odd number of hex digits, the final digit 0 is appended, or an error condition may be raised. Compare with Annex M.4 of IEEE 802.11-2012. *dtmf = The characters "0"-"9", "A"-"D", "*", and "#", which map to their corresponding ASCII codes. (This is to support restricted-input devices, i.e., telephones and telephone-like equipment.) Otherwise, the value of this parameter is a charset, from the Character Sets Registry . *** The relevant part of the original template (proposed 2015-11-04) is: *** Optional parameters: charset: When the private key encryption algorithm incorporates a ?password" that is an octet string, a mapping between user input and the octet string is desirable. PKCS #5 [RFC2898] Section 3 recommends "that applications follow some common text encoding rules"; it then suggests, but does not recommend, ASCII and UTF-8. This parameter specifies the charset that a recipient SHOULD attempt first when mapping user input to the octet string. It has the same semantics as the charset parameter from text/plain, except that it only applies to the user?s input of the password. There is no default value. ualg: When the charset is a Unicode-based encoding, this parameter is a space-delimited list of Unicode algorithms that a recipient SHOULD first attempt to apply to the Unicode user input in succession, in order to derive the octet string. The list of algorithm keywords is defined by [UNICODE]. ?Tailored operations? are operations that are sensitive to language, which must be provided as an input parameter. If a tailored operation is called for, the exclamation mark followed by the [BCP47] language tag specifies the language. For example, "toNFD toNFKC_Casefold!tr" first applies Normalization Form D, followed by Normalization Form KC with Case Folding in the Turkish language, according to [UNICODE] and [UAX31]. The default value of this parameter is empty, and leaves the matter of whether to normalize, case fold, or apply other transformations unspecified. The latest template is here: http://mailarchive.ietf.org/arch/msg/precis/Qil9mc5AtqxXp8OXllp0lAwYts4 From c933103 at gmail.com Tue Jan 5 01:19:25 2016 From: c933103 at gmail.com (gfb hjjhjh) Date: Tue, 5 Jan 2016 15:19:25 +0800 Subject: Unicode password mapping for crypto standard In-Reply-To: <568B54F8.5000802@seantek.com> References: <568B54F8.5000802@seantek.com> Message-ID: Hello, I don't have much knowledge on the topic, but 1. probably something like the punycode used for internationalized domain name might help? 2. I don't think keyboard mapping is a good idea, as to some less computer-savvy Chinese-speaking users, it's often that their only way to write Chinese into computer is by handwriting and handwriting doesn't seem to be something supported by keyboard mapping. 2016/01/05 13:33 "Sean Leonard" : > Hi Unicode list, I am looking for feedback on this proposal, specifically > a standard specification to map between (presumably) Unicode text strings > and octet strings. > > A "password" is defined as an arbitrary octet string in a number of > protocols and formats. This has worked for basic cases where the "password" > is just ASCII, but there are interoperability issues when characters beyond > ASCII get involved. My observation is that a lot of security folks get > hand-wavy about the Unicode stuff, which is why there is little > standardization in this area. > > Recently in the IETF, application/pkcs8-encrypted is proposed for the PKCS > #8 EncryptedPrivateKeyInfo type. For purposes of our discussion, the format > takes as input an opaque octet string (any octet in the range 00h-FFh, of > any length), and executes various specified algorithms; the result is a > decrypted private key. The most common algorithm is PBKDF2, but any > algorithm can be used (including, for example, a raw symmetric encryption > algorithm such as AES-256). > > PKCS #8 punts on the issue of character encoding. It says that ASCII or > UTF-8 could be used, but doesn?t enforce anything in particular. PKCS #12 > specifies UTF-16LE with a terminating NULL character (00h 00h). > > In the application/pkcs8-encrypted registration, I thought it might be > wise to allow senders and receivers to specify how input (whether user > input or otherwise) gets mapped to the octet string, since it's not part of > the format. Originally my concern at that time was to reflect IANA > character sets, rather than profiles of Unicode. > > These days, however, most user agents are Unicode-enabled and will accept > user input in Unicode. Therefore, issue is less about legacy character > sets, and more about how to take the Unicode input and get a consistent and > reasonable stream of bits out on both ends. For example: should the > password be case folded, converted to NFKC, encoded in UTF-8 vs. UTF-16BE, > etc.? Constraining or transforming the input would be helpful for disparate > systems to agree on these things. > > > Thank you, > > Sean > > PS I read the "Unicode in passwords" thread. It's relevant. An alternative > or addition to proposing a mapping to/from Unicode, might be to have a > "keyboard-mapping" or "keyboard-layout" parameter, that specifies the > suggested layout of the keyboard (or input device) used for password input, > preferably by deferring to some international standard on the topic. Such a > parameter could influence the initial user input method, but it doesn't > answer the question of how to turn the key presses into specific bits > (Unicode-based or otherwise). > > ********** > The relevant part of the template (most recent proposal, today) is: > *** > Optional parameters: > > password-mapping: > When the private key encryption algorithm incorporates a "password" that > is an octet string, a mapping between user input and the octet string is > desirable. PKCS #5 [RFC2898] Section 3 recommends "that applications follow > some common text encoding rules"; it then suggests, but does not recommend, > ASCII and UTF-8. This parameter specifies the charset that a recipient > SHOULD attempt first when mapping user input to the octet string. It has > similar semantics as the charset parameter from text/plain, except that it > only applies to the user?s input of the password. There is no default value. > > The following special values are defined: > *pkcs12 = UTF-16LE with U+0000 NULL terminator (PKCS #12-style) > *precis = PRECIS password profile, i.e., OpaqueString from Section 4 of > RFC 7613 (always UTF-8) > *precis-XXX = PRECIS profile as named XXX in the IANA PRECIS Profiles > Registry > *hex = hexadecimal input: the input is mapped to 0-9, A-F, and then > converted directly to octets. If there are an odd number of hex digits, the > final digit 0 is appended, or an error condition may be raised. Compare > with Annex M.4 of IEEE 802.11-2012. > *dtmf = The characters "0"-"9", "A"-"D", "*", and "#", which map to > their corresponding ASCII codes. (This is to support restricted-input > devices, i.e., telephones and telephone-like equipment.) > > Otherwise, the value of this parameter is a charset, from the Character > Sets Registry . > *** > > The relevant part of the original template (proposed 2015-11-04) is: > *** > Optional parameters: > charset: When the private key encryption algorithm incorporates a > ?password" that is an octet string, a mapping between user input and the > octet string is desirable. PKCS #5 [RFC2898] Section 3 recommends "that > applications follow some common text encoding rules"; it then suggests, but > does not recommend, ASCII and UTF-8. This parameter specifies the charset > that a recipient SHOULD attempt first when mapping user input to the octet > string. It has the same semantics as the charset parameter from text/plain, > except that it only applies to the user?s input of the password. There is > no default value. > > ualg: When the charset is a Unicode-based encoding, this parameter is a > space-delimited list of Unicode algorithms that a recipient SHOULD first > attempt to apply to the Unicode user input in succession, in order to > derive the octet string. The list of algorithm keywords is defined by > [UNICODE]. ?Tailored operations? are operations that are sensitive to > language, which must be provided as an input parameter. If a tailored > operation is called for, the exclamation mark followed by the [BCP47] > language tag specifies the language. For example, "toNFD > toNFKC_Casefold!tr" first applies Normalization Form D, followed by > Normalization Form KC with Case Folding in the Turkish language, according > to [UNICODE] and [UAX31]. The default value of this parameter is empty, and > leaves the matter of whether to normalize, case fold, or apply other > transformations unspecified. > > > The latest template is here: > > http://mailarchive.ietf.org/arch/msg/precis/Qil9mc5AtqxXp8OXllp0lAwYts4 > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From duerst at it.aoyama.ac.jp Tue Jan 5 02:26:45 2016 From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J._D=c3=bcrst?=) Date: Tue, 5 Jan 2016 17:26:45 +0900 Subject: Unicode in the Curriculum? In-Reply-To: References: <567331D2.1000007@gmail.com> <59AD8E6F-F67A-4FFD-B364-6AC9E555D23D@lboro.ac.uk> <6370df3a44124747bb580b2760e9d144@EX13D08UWB002.ant.amazon.com> Message-ID: <568B7E45.2020401@it.aoyama.ac.jp> I agree to a certain extent with Julian. There are extremely many subjects industry surely would like computer science students to learn in college, and internationalization/Unicode is only one of them. On the other hand, I think that universities teach about integer and floating point representation for numbers, and likewise, they should teach about ASCII and Unicode for text representation. I personally have given a full course on internationalization/Unicode topics only once, as a guest lecturer at the University of Linz in the 1990ies. In that same aera, I also once gave a course about computer topics for Japanology students, which of course included character encodings, but also complete beginner stuff such as use of Web search engines. Otherwise, I integrate Unicode and internationalization subjects in my courses where possible. As an example, in my C programming course, there's an exercise where students use the same C program with different source encodings, execution encodings, and terminal settings, getting some understanding for character count vs. byte count, repertoire of different encodings, and so on. This kind of stuff is a bit easier to do here in Japan, where "ASCII isn't enough" doesn't have to be explained at great length, and where multiple encodings (mostly UTF-8 and Shift_JIS) are still in use. Regards, Martin. On 2016/01/01 03:58, Julian Bradfield wrote: > On 2015-12-31, Andre Schappo wrote: > >> I have been hitting my head against the Academic Brick Wall for >> years WRT getting IT i18n and Unicode on the curriculum and I am >> losing. I did teach a final year elective module on IT i18n but a >> few months ago my University dropped it. I am continually puzzled by >> the lack of interest University Computer Science departments have in >> i18n. I appear to be a solitary UK University Computer Science voice >> when it comes to i18n. > > Well, I'd say that it's not the business of Computer Science degrees > to teach specific technical skills. It's our business to help people > learn about the fundamentals of the subject, so that they can acquire > any specific skill on demand, and use that skill competently. In those > areas where we do teach specific skills (e.g. machine learning > techniques) we teach those that have some intellectual content to > them. (This is why we don't teach programming languages as such - we > teach a programming language as a means of learning a programming > paradigm.) > > In my experience so far, using Unicode and doing i18n is not very > interesting (killingly boring, actually) from a purely CS technical > point of view, unless you happen to be one of the small minority who > enjoys script and font layout issues - the interesting bits of doing > i18n are in producing linguistically and culturally appropriate > messages, and that's where one should bring in experts, not expect > typical software developers to be able to do it. > > If you still have the materials for your course, it would be > interesting to see how you managed to get an interesting (and > examinable!) course out of i18n. > > I do in fact mention Unicode and i18n in my introductory programming > course (which is not for CS students), but all I say is "you should > know it's there, and if you become a competent programmer, then you > can read the manuals and tutorials to learn what you need". > From jknappen at web.de Tue Jan 5 03:10:40 2016 From: jknappen at web.de (=?UTF-8?Q?=22J=C3=B6rg_Knappen=22?=) Date: Tue, 5 Jan 2016 10:10:40 +0100 Subject: Aw: Re: Turned Capital letter L (pointing to the left, with serifs) In-Reply-To: <568B4FB3.2080708@ix.netcom.com> References: <36275E73002B46CFAE7F2D3A81917380@UserPC> <041CC0E9-0FE5-4A57-A61F-529C9D96B58E@evertype.com> <568AA3BA.1030201@ix.netcom.com> <568ACED4.3010505@ix.netcom.com> <7F2DF0906BEC42219F6825A4D225926A@UserPC> , <568B4FB3.2080708@ix.netcom.com> Message-ID: An HTML attachment was scrubbed... URL: From frederic.grosshans at gmail.com Tue Jan 5 03:22:16 2016 From: frederic.grosshans at gmail.com (=?UTF-8?B?RnLDqWTDqXJpYyBHcm9zc2hhbnM=?=) Date: Tue, 05 Jan 2016 09:22:16 +0000 Subject: Turned Capital letter L (pointing to the left, with serifs) In-Reply-To: References: <36275E73002B46CFAE7F2D3A81917380@UserPC> <041CC0E9-0FE5-4A57-A61F-529C9D96B58E@evertype.com> <568AA3BA.1030201@ix.netcom.com> <568ACED4.3010505@ix.netcom.com> <7F2DF0906BEC42219F6825A4D225926A@UserPC> <568B4FB3.2080708@ix.netcom.com> Message-ID: Le mar. 5 janv. 2016 10:13, "J?rg Knappen" a ?crit : > I have looked up some printed sources and I agree with Michael Everson and > Fr?d?ric Grosshans that the > beast in question is a variant of the greek letter tau (capital or > lowercase). > The identification to ? is from Asmus Freytag, not me. I have proposed another identity (TURNED ANGLE), and I only start to be convinced by the ? identification Fr?d?ric -------------- next part -------------- An HTML attachment was scrubbed... URL: From jknappen at web.de Tue Jan 5 04:07:23 2016 From: jknappen at web.de (=?UTF-8?Q?=22J=C3=B6rg_Knappen=22?=) Date: Tue, 5 Jan 2016 11:07:23 +0100 Subject: Aw: Re: Turned Capital letter L (pointing to the left, with serifs) In-Reply-To: References: <36275E73002B46CFAE7F2D3A81917380@UserPC> <041CC0E9-0FE5-4A57-A61F-529C9D96B58E@evertype.com> <568AA3BA.1030201@ix.netcom.com> <568ACED4.3010505@ix.netcom.com> <7F2DF0906BEC42219F6825A4D225926A@UserPC> , <568B4FB3.2080708@ix.netcom.com>, Message-ID: An HTML attachment was scrubbed... URL: From asmus-inc at ix.netcom.com Tue Jan 5 08:04:48 2016 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Tue, 5 Jan 2016 06:04:48 -0800 Subject: Turned Capital letter L (pointing to the left, with serifs) In-Reply-To: References: <36275E73002B46CFAE7F2D3A81917380@UserPC> <041CC0E9-0FE5-4A57-A61F-529C9D96B58E@evertype.com> <568AA3BA.1030201@ix.netcom.com> <568ACED4.3010505@ix.netcom.com> <7F2DF0906BEC42219F6825A4D225926A@UserPC> <568B4FB3.2080708@ix.netcom.com> Message-ID: <568BCD80.20707@ix.netcom.com> On 1/5/2016 1:22 AM, Fr?d?ric Grosshans wrote: > > > Le mar. 5 janv. 2016 10:13, "J?rg Knappen" > a ?crit : > > I have looked up some printed sources and I agree with Michael > Everson and Fr?d?ric Grosshans that the > beast in question is a variant of the greek letter tau (capital or > lowercase). > > > The identification to ? is from Asmus Freytag, not me. Mine is a concurring opinion based on ME's suggestion, but corroborated, in my view, by the systematic notational conventions and not merely informed by visual similarity. A./ > I have proposed another identity (TURNED ANGLE), and I only start to > be convinced by the ? identification > > Fr?d?ric -------------- next part -------------- An HTML attachment was scrubbed... URL: From markus.icu at gmail.com Tue Jan 5 10:26:42 2016 From: markus.icu at gmail.com (Markus Scherer) Date: Tue, 5 Jan 2016 08:26:42 -0800 Subject: Unicode password mapping for crypto standard In-Reply-To: <568B54F8.5000802@seantek.com> References: <568B54F8.5000802@seantek.com> Message-ID: I would specify that UTF-8 must be used, without mapping. US-ASCII is a proper subset, so need not be mentioned explicitly, nor distinguished in the protocol. Mappings would require that all implementations carry relevant data, and are up to date to recent versions of Unicode, or else previously-unassigned code points will cause failures. As long as a user types the same password the same way, or with IMEs that produce the same output, they are fine. Strange variants might improve password security. markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Tue Jan 5 10:26:52 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Tue, 5 Jan 2016 17:26:52 +0100 Subject: Turned Capital letter L (pointing to the left, with serifs) In-Reply-To: References: <36275E73002B46CFAE7F2D3A81917380@UserPC> <041CC0E9-0FE5-4A57-A61F-529C9D96B58E@evertype.com> <568AA3BA.1030201@ix.netcom.com> Message-ID: And given the context of use on the document, where it is a measurement of time in seconds (it is a mean daily time drift, if you don't read German), some variants of T/Tau is certainly a best option. The other variables in the additive formula were also related to time and where also based on "t", so the formula used various variants of the T/Tau letter. Intuitively when reading the formula and description I undoubtly pronounced it "tau" (there was no other occurence of the tau letter in the formula, but the fact it used a bold capital may be related to the fact that the mean daily time drift in this formula is nearly constant, with very tiny variations that the formula wants to take into account in a differential. Traditionally, consitnats or near constants are using bold capital letters, and it was made to contrast it with the true time "t" which is obviously not constant (|dt / dTau| is largely above 1, most of the time except in very few short periods of time in the year, but the formula is not interested in finding/predicting those events but to estimate how the geocentric time evolves over long periods thru years in order to compute calendars). The discovery of the cursive variant of pi is interesting but largely too far graphically : it is is single curved stroke like a turned "J", but here the "7" shaped letter clearly uses two strokes, like Tau) and semantically (pi would be related to an angle measurement, not to time, even if the formula is related to the pseudo-elliptic revolution of Earth around Sun, it would not be coherent with the additive differential formula cumulating with time "t". In summary for me it's just a bold capital Greek letter Tau (in cursive/itialic style, like "t", because it is a true variable and not a symbol like the differential operator). The printer however chose to use a decorative variant of the bold digit 7 to represent it, because it had it in its collections of metal fonts (e.g. for titling on cover pages, where titles/headings are customarily using decorative such bold font styles). May be if you read the rest of the text including the presentation, you will discover it more completely or even spelled explicitly in sentences. But we have no audio records to confirm it: the reader has to interpret it but it is easier to read and understand if you just identify it as "Tau" rather than "T" or worse as "7". 2016-01-04 19:41 GMT+01:00 Michael Everson : > On 4 Jan 2016, at 16:54, Asmus Freytag (t) > wrote: > > > > On 1/4/2016 7:49 AM, Michael Everson wrote: > >> Excellent! > >> Looks like a candidate character for encoding. I?m sure I have some > examples of good font designs for the old character in one of my books. > > > > Admitting that a Greek letter inherently makes more sense than an "et" > as a variable name, I would still need to understand why "pi" would make a > sensible mnemonic choice for the variable in Gauss' treatise, before being > confident that we've made the correct identification. The more so, as > the use of non-cursive pi for "perihelion" in the same work is clearly > mnemonic. > > Certainly it does look more like a very common variant of ?tau? than ?pi? > > Michael Everson * http://www.evertype.com/ > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From bortzmeyer at nic.fr Tue Jan 5 10:37:05 2016 From: bortzmeyer at nic.fr (Stephane Bortzmeyer) Date: Tue, 5 Jan 2016 17:37:05 +0100 Subject: Unicode password mapping for crypto standard In-Reply-To: <568B54F8.5000802@seantek.com> References: <568B54F8.5000802@seantek.com> Message-ID: <20160105163705.GA4941@nic.fr> On Mon, Jan 04, 2016 at 09:30:32PM -0800, Sean Leonard wrote a message of 120 lines which said: > how to take the Unicode input and get a consistent and reasonable > stream of bits out on both ends. For example: should the password be > case folded, converted to NFKC, encoded in UTF-8 vs. UTF-16BE, etc.? There is already a standard on that, RFC 7613 "Preparation, Enforcement, and Comparison of Internationalized Strings Representing Usernames and Passwords" and I suggest we use it and do not reinvent the wheel. From raymond at almanach.co.uk Tue Jan 5 12:20:07 2016 From: raymond at almanach.co.uk (Raymond Mercier) Date: Tue, 5 Jan 2016 18:20:07 -0000 Subject: Turned Capital letter L (pointing to the left, with serifs) In-Reply-To: <568BCD80.20707@ix.netcom.com> References: <36275E73002B46CFAE7F2D3A81917380@UserPC> <041CC0E9-0FE5-4A57-A61F-529C9D96B58E@evertype.com> <568AA3BA.1030201@ix.netcom.com> <568ACED4.3010505@ix.netcom.com> <7F2DF0906BEC42219F6825A4D225926A@UserPC> <568B4FB3.2080708@ix.netcom.com> <568BCD80.20707@ix.netcom.com> Message-ID: <395E59FB896A41FBA00C1E782F83E3B6@UserPC> I have looked at both the collected works of Gauss and at the English version of the Theoria Motus, in order to see what a later editor made of this symbol. In the Werke the symbol ?7? continues to be used : C F Gauss, Werke, Vol. 7, ed. E J Schering, Gotha, 1871; ? 77, M = N + n?7? ? ?. In the translation the ?7? is replaced by the lower case tau. Theory of the motion of the heavenly bodies moving about the sun in conic sections: a translation of Gauss's "Theoria motus." With an appendix. By Charles Henry Davis, Boston : Little, Brown and company, 1857; ? 77, M = N + n? ? ?. So this seems to settle the matter of the identity, and just leaves one to puzzle over the German use of this sign for tau. Raymond -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: carlfriedrichgau07gaus 100.jpg Type: image/jpeg Size: 64903 bytes Desc: not available URL: From A.Schappo at lboro.ac.uk Wed Jan 6 06:09:33 2016 From: A.Schappo at lboro.ac.uk (Andre Schappo) Date: Wed, 6 Jan 2016 12:09:33 +0000 Subject: Unicode in the Curriculum? In-Reply-To: <568AA4EB.5070606@ix.netcom.com> References: <568AA4EB.5070606@ix.netcom.com> Message-ID: On 4 Jan 2016, at 16:59, Asmus Freytag (t) wrote: > On 1/4/2016 6:44 AM, Elizabeth J. Pyatt wrote: >> Like some others on the list, I believe Unicode should be mentioned at different points in a programming curriculum, particularly at the time when ASCII would be taught. > ASCII shouldn't be taught, perhaps? I really like the idea of questioning whether or not ASCII should even be taught. Wherever in a programming curriculum, text processing/transmission/storage/presentation/encoding is taught, then it should be Unicode text. ASCII, along with, ISO-8859 ISO-2022 GB2312 ?etc? should be consigned to ?and finally, the legacy character sets/encodings... Maybe ASCII should now be flagged as deprecated https://twitter.com/andreschappo/status/684706421712228352 Andr? Schappo From corbett.dav at husky.neu.edu Wed Jan 6 08:42:21 2016 From: corbett.dav at husky.neu.edu (David Corbett) Date: Wed, 6 Jan 2016 09:42:21 -0500 Subject: HENTAIGANA LETTER E-1 Message-ID: Is there a difference between HENTAIGANA LETTER E-1 in L2/15-343 and U+1B001 HIRAGANA LETTER ARCHAIC YE? From kenwhistler at att.net Wed Jan 6 09:43:41 2016 From: kenwhistler at att.net (Ken Whistler) Date: Wed, 6 Jan 2016 07:43:41 -0800 Subject: Unicode in the Curriculum? In-Reply-To: References: <568AA4EB.5070606@ix.netcom.com> Message-ID: <568D362D.2020409@att.net> Actually, ASCII should *not* be ignored or deprecated. We *love* ASCII. The issue is just making sure that students understand that the *true name* of "ASCII" is "UTF-8". It is just the very first 128 values that open into the entire world of Unicode characters. It is a mind trick to play on young programmers: when you learn "ASCII", you are just playing on the bunny slope at the UTF-8 ski resort. Slap on your snowboard and practice -- get out there onto the 2-, 3- and 4-byte slopes with the experts! --Ken On 1/6/2016 4:09 AM, Andre Schappo wrote: > On 4 Jan 2016, at 16:59, Asmus Freytag (t) wrote: > >> ASCII shouldn't be taught, perhaps? > I really like the idea of questioning whether or not ASCII should even be taught. > > Wherever in a programming curriculum, text processing/transmission/storage/presentation/encoding is taught, then it should be Unicode text. > > ASCII, along with, ISO-8859 ISO-2022 GB2312 ?etc? should be consigned to > > ?and finally, the legacy character sets/encodings... > > Maybe ASCII should now be flagged as deprecated https://twitter.com/andreschappo/status/684706421712228352 > > Andr? Schappo > > > > From everson at evertype.com Wed Jan 6 10:22:12 2016 From: everson at evertype.com (Michael Everson) Date: Wed, 6 Jan 2016 16:22:12 +0000 Subject: HENTAIGANA LETTER E-1 In-Reply-To: References: Message-ID: <584B230B-BFBF-4A93-95B3-6F1E463E6181@evertype.com> On 6 Jan 2016, at 14:42, David Corbett wrote: > > Is there a difference between HENTAIGANA LETTER E-1 in L2/15-343 and > U+1B001 HIRAGANA LETTER ARCHAIC YE? No, there is not. The former would be unified with it. Michael Everson * http://www.evertype.com/ From Shawn.Steele at microsoft.com Wed Jan 6 12:59:22 2016 From: Shawn.Steele at microsoft.com (Shawn Steele) Date: Wed, 6 Jan 2016 18:59:22 +0000 Subject: Unicode in the Curriculum? In-Reply-To: <568D362D.2020409@att.net> References: <568AA4EB.5070606@ix.netcom.com> <568D362D.2020409@att.net> Message-ID: +1 :) -----Original Message----- From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Ken Whistler Sent: Wednesday, January 6, 2016 7:44 AM To: Andre Schappo Cc: unicode at unicode.org Subject: Re: Unicode in the Curriculum? Actually, ASCII should *not* be ignored or deprecated. We *love* ASCII. The issue is just making sure that students understand that the *true name* of "ASCII" is "UTF-8". It is just the very first 128 values that open into the entire world of Unicode characters. It is a mind trick to play on young programmers: when you learn "ASCII", you are just playing on the bunny slope at the UTF-8 ski resort. Slap on your snowboard and practice -- get out there onto the 2-, 3- and 4-byte slopes with the experts! --Ken On 1/6/2016 4:09 AM, Andre Schappo wrote: > On 4 Jan 2016, at 16:59, Asmus Freytag (t) wrote: > >> ASCII shouldn't be taught, perhaps? > I really like the idea of questioning whether or not ASCII should even be taught. > > Wherever in a programming curriculum, text processing/transmission/storage/presentation/encoding is taught, then it should be Unicode text. > > ASCII, along with, ISO-8859 ISO-2022 GB2312 .etc. should be consigned > to > > .and finally, the legacy character sets/encodings... > > Maybe ASCII should now be flagged as deprecated > https://twitter.com/andreschappo/status/684706421712228352 > > Andr? Schappo > > > > From gwalla at gmail.com Wed Jan 6 16:42:42 2016 From: gwalla at gmail.com (Garth Wallace) Date: Wed, 6 Jan 2016 14:42:42 -0800 Subject: HENTAIGANA LETTER E-1 In-Reply-To: <584B230B-BFBF-4A93-95B3-6F1E463E6181@evertype.com> References: <584B230B-BFBF-4A93-95B3-6F1E463E6181@evertype.com> Message-ID: On Wed, Jan 6, 2016 at 8:22 AM, Michael Everson wrote: > On 6 Jan 2016, at 14:42, David Corbett wrote: >> >> Is there a difference between HENTAIGANA LETTER E-1 in L2/15-343 and >> U+1B001 HIRAGANA LETTER ARCHAIC YE? > > No, there is not. The former would be unified with it. > > Michael Everson * http://www.evertype.com/ > They never took that out? I pointed it out back in July and Ken Lunde passed it along in his official feedback AIUI: . I could have sworn they took it out after that. It's a very clear duplicate. From asmus-inc at ix.netcom.com Wed Jan 6 17:19:09 2016 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Wed, 6 Jan 2016 15:19:09 -0800 Subject: Unicode in the Curriculum? In-Reply-To: References: <568AA4EB.5070606@ix.netcom.com> <568D362D.2020409@att.net> Message-ID: <568DA0ED.2050804@ix.netcom.com> An HTML attachment was scrubbed... URL: From Shawn.Steele at microsoft.com Wed Jan 6 17:27:25 2016 From: Shawn.Steele at microsoft.com (Shawn Steele) Date: Wed, 6 Jan 2016 23:27:25 +0000 Subject: Unicode in the Curriculum? In-Reply-To: <568DA0ED.2050804@ix.netcom.com> References: <568AA4EB.5070606@ix.netcom.com> <568D362D.2020409@att.net> <568DA0ED.2050804@ix.netcom.com> Message-ID: Then it should be UTF-8. Learning to do something in a non-Unicode code page and then redoing it for UTF-8 or UTF-16 merely leads to conversion problems, incompatibilities, and other nonsense. If someone ?needs? to not use UTF-16 for whatever reason, then they should use UTF-8. The ?advanced? training should be the other non-Unicode code pages. Teach them right the first time. They?ll never use a code page. -Shawn From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Asmus Freytag (t) Sent: January 6, 2016 3:19 PM To: unicode at unicode.org Subject: Re: Unicode in the Curriculum? On 1/6/2016 10:59 AM, Shawn Steele wrote: +1 :) I'm not going to join the happy chorus here. The "bunny" slope for most people is their own native language... A./ -----Original Message----- From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Ken Whistler Sent: Wednesday, January 6, 2016 7:44 AM To: Andre Schappo Cc: unicode at unicode.org Subject: Re: Unicode in the Curriculum? Actually, ASCII should *not* be ignored or deprecated. We *love* ASCII. The issue is just making sure that students understand that the *true name* of "ASCII" is "UTF-8". It is just the very first 128 values that open into the entire world of Unicode characters. It is a mind trick to play on young programmers: when you learn "ASCII", you are just playing on the bunny slope at the UTF-8 ski resort. Slap on your snowboard and practice -- get out there onto the 2-, 3- and 4-byte slopes with the experts! --Ken On 1/6/2016 4:09 AM, Andre Schappo wrote: On 4 Jan 2016, at 16:59, Asmus Freytag (t) wrote: ASCII shouldn't be taught, perhaps? I really like the idea of questioning whether or not ASCII should even be taught. Wherever in a programming curriculum, text processing/transmission/storage/presentation/encoding is taught, then it should be Unicode text. ASCII, along with, ISO-8859 ISO-2022 GB2312 .etc. should be consigned to .and finally, the legacy character sets/encodings... Maybe ASCII should now be flagged as deprecated https://twitter.com/andreschappo/status/684706421712228352 Andr? Schappo -------------- next part -------------- An HTML attachment was scrubbed... URL: From asmus-inc at ix.netcom.com Wed Jan 6 17:32:33 2016 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Wed, 6 Jan 2016 15:32:33 -0800 Subject: Unicode in the Curriculum? In-Reply-To: References: <568AA4EB.5070606@ix.netcom.com> <568D362D.2020409@att.net> <568DA0ED.2050804@ix.netcom.com> Message-ID: <568DA411.5000506@ix.netcom.com> On 1/6/2016 3:27 PM, Shawn Steele wrote: > > Then it should be UTF-8. Learning to do something in a non-Unicode > code page and then redoing it for UTF-8 or UTF-16 merely leads to > conversion problems, incompatibilities, and other nonsense. > Agreed. But so does teaching people that it's OK to use ASCII-fallbacks, because a few of their characters are not available on the bunny slope. > > If someone ?needs? to not use UTF-16 for whatever reason, then they > should use UTF-8. The ?advanced? training should be the other > non-Unicode code pages. > I think any training in non-Unicode character sets is beyond a standard curriculum, except perhaps History of Computing or Digital Archaeology :) > > Teach them right the first time. They?ll never use a code page. > +1 A./ > > -Shawn > > *From:*Unicode [mailto:unicode-bounces at unicode.org] *On Behalf Of > *Asmus Freytag (t) > *Sent:* January 6, 2016 3:19 PM > *To:* unicode at unicode.org > *Subject:* Re: Unicode in the Curriculum? > > On 1/6/2016 10:59 AM, Shawn Steele wrote: > > +1 :) > > > I'm not going to join the happy chorus here. > > The "bunny" slope for most people is their own native language... > > A./ > > -----Original Message----- > > From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Ken Whistler > > Sent: Wednesday, January 6, 2016 7:44 AM > > To: Andre Schappo > > Cc:unicode at unicode.org > > Subject: Re: Unicode in the Curriculum? > > Actually, ASCII should *not* be ignored or deprecated. > > We *love* ASCII. The issue is just making sure that students understand that the *true name* of "ASCII" is "UTF-8". It is just the very first 128 values that open into the entire world of Unicode characters. > > It is a mind trick to play on young programmers: when you learn "ASCII", you are just playing on the bunny slope at the UTF-8 ski resort. Slap on your snowboard and practice -- get out there onto the 2-, 3- and 4-byte slopes with the experts! > > --Ken > > On 1/6/2016 4:09 AM, Andre Schappo wrote: > > On 4 Jan 2016, at 16:59, Asmus Freytag (t) wrote: > > ASCII shouldn't be taught, perhaps? > > I really like the idea of questioning whether or not ASCII should even be taught. > > Wherever in a programming curriculum, text processing/transmission/storage/presentation/encoding is taught, then it should be Unicode text. > > ASCII, along with, ISO-8859 ISO-2022 GB2312 .etc. should be consigned > > to > > .and finally, the legacy character sets/encodings... > > Maybe ASCII should now be flagged as deprecated > > https://twitter.com/andreschappo/status/684706421712228352 > > Andr? Schappo > -------------- next part -------------- An HTML attachment was scrubbed... URL: From Shawn.Steele at microsoft.com Wed Jan 6 17:36:16 2016 From: Shawn.Steele at microsoft.com (Shawn Steele) Date: Wed, 6 Jan 2016 23:36:16 +0000 Subject: Unicode in the Curriculum? In-Reply-To: <568DA411.5000506@ix.netcom.com> References: <568AA4EB.5070606@ix.netcom.com> <568D362D.2020409@att.net> <568DA0ED.2050804@ix.netcom.com> <568DA411.5000506@ix.netcom.com> Message-ID: ? I think any training in non-Unicode character sets is beyond a standard curriculum, except perhaps History of Computing or Digital Archaeology :) One could only hope. -------------- next part -------------- An HTML attachment was scrubbed... URL: From eik at iki.fi Thu Jan 7 00:50:26 2016 From: eik at iki.fi (Erkki I Kolehmainen) Date: Thu, 7 Jan 2016 08:50:26 +0200 Subject: Unicode in the Curriculum? In-Reply-To: <568DA0ED.2050804@ix.netcom.com> References: <568AA4EB.5070606@ix.netcom.com> <568D362D.2020409@att.net> <568DA0ED.2050804@ix.netcom.com> Message-ID: <001b01d14917$b0cc75c0$12656140$@fi> +1 I cannot but agree with Asmus. Sincerely, Erkki L?hett?j?: Unicode [mailto:unicode-bounces at unicode.org] Puolesta Asmus Freytag (t) L?hetetty: 7. tammikuuta 2016 01:19 Vastaanottaja: unicode at unicode.org Aihe: Re: Unicode in the Curriculum? On 1/6/2016 10:59 AM, Shawn Steele wrote: +1 :) I'm not going to join the happy chorus here. The "bunny" slope for most people is their own native language... A./ -----Original Message----- From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Ken Whistler Sent: Wednesday, January 6, 2016 7:44 AM To: Andre Schappo Cc: unicode at unicode.org Subject: Re: Unicode in the Curriculum? Actually, ASCII should *not* be ignored or deprecated. We *love* ASCII. The issue is just making sure that students understand that the *true name* of "ASCII" is "UTF-8". It is just the very first 128 values that open into the entire world of Unicode characters. It is a mind trick to play on young programmers: when you learn "ASCII", you are just playing on the bunny slope at the UTF-8 ski resort. Slap on your snowboard and practice -- get out there onto the 2-, 3- and 4-byte slopes with the experts! --Ken On 1/6/2016 4:09 AM, Andre Schappo wrote: On 4 Jan 2016, at 16:59, Asmus Freytag (t) wrote: ASCII shouldn't be taught, perhaps? I really like the idea of questioning whether or not ASCII should even be taught. Wherever in a programming curriculum, text processing/transmission/storage/presentation/encoding is taught, then it should be Unicode text. ASCII, along with, ISO-8859 ISO-2022 GB2312 .etc. should be consigned to .and finally, the legacy character sets/encodings... Maybe ASCII should now be flagged as deprecated https://twitter.com/andreschappo/status/684706421712228352 Andr? Schappo -------------- next part -------------- An HTML attachment was scrubbed... URL: From mpsuzuki at hiroshima-u.ac.jp Thu Jan 7 09:56:38 2016 From: mpsuzuki at hiroshima-u.ac.jp (suzuki toshiya) Date: Fri, 8 Jan 2016 00:56:38 +0900 Subject: [Unicode] Re: HENTAIGANA LETTER E-1 In-Reply-To: <90baabfa798243e6b41e4f3d6502ed3c@PS1PR04MB0953.apcprd04.prod.outlook.com> References: <584B230B-BFBF-4A93-95B3-6F1E463E6181@evertype.com> <90baabfa798243e6b41e4f3d6502ed3c@PS1PR04MB0953.apcprd04.prod.outlook.com> Message-ID: <568E8AB6.5010307@hiroshima-u.ac.jp> Hi, I'm not a representative of the experts working for the proposal from Japan NB, but I could explain something. 1) "They never took that out?" I'm not sure who you mean "they" (UTC? JNB?), but it seems that no official document asking for the response from JNB is submitted in WG2. If UTC sends something officially, JNB would response something, I believe. 2) Difference in HENTAIGANA LETTER E-1 and U+1B001. U+1B001 is a character designed to note an ancient (and extinct in modern Japanese language) pronunciation YE. When standard kana was defined about 100 years ago, the pronunciation YE was already merged to E. Some scholars planned to use a few kana-like characters to note such pronunciation (to discuss about the ancient Japanese language pronunciation), and used some hentaigana- like glyphs for such purpose. As far as I know, there is no wide consensus that the glyph looking like U+1B001 was historically used to note YE mainly, when YE and E were distinctively used in Japanese language. On the other hand, JNB's proposal does not include any ancient/extinct pronunciation, Their phonetic coverage is exactly same with modern Japanese language. So, the glyph looking like U+1B001 is not designed to note the pronunciation YE. The motivation why JNB proposed hentaigana would be just because of their shape differences. Therefore, U+1B001 and HENTAIGANA E-1 could be said as differently designed, their designed usages are different. Please do not think JNB hentaigana experts overlooked U+1B001 and proposed a duplicated encoding. They ought to have known it but proposed. However, some WG2 experts suggested to unify them because of the shape similarity. I'm not sure whether 2 glyphs are indistinctively similar for hentaigana scholars, but I accept with that some people are hard to distinguish. I cannot distinguish some Latin and Greek alphabets when they are displayed as single isolated character. Regards, mpsuzuki Garth Wallace wrote: > On Wed, Jan 6, 2016 at 8:22 AM, Michael Everson wrote: >> On 6 Jan 2016, at 14:42, David Corbett wrote: >>> Is there a difference between HENTAIGANA LETTER E-1 in L2/15-343 and >>> U+1B001 HIRAGANA LETTER ARCHAIC YE? >> No, there is not. The former would be unified with it. >> >> Michael Everson * http://www.evertype.com/ >> > > They never took that out? I pointed it out back in July and Ken Lunde > passed it along in his official feedback AIUI: > . I could have > sworn they took it out after that. It's a very clear duplicate. From gwalla at gmail.com Thu Jan 7 14:39:50 2016 From: gwalla at gmail.com (Garth Wallace) Date: Thu, 7 Jan 2016 12:39:50 -0800 Subject: [Unicode] Re: HENTAIGANA LETTER E-1 In-Reply-To: <568E8AB6.5010307@hiroshima-u.ac.jp> References: <584B230B-BFBF-4A93-95B3-6F1E463E6181@evertype.com> <90baabfa798243e6b41e4f3d6502ed3c@PS1PR04MB0953.apcprd04.prod.outlook.com> <568E8AB6.5010307@hiroshima-u.ac.jp> Message-ID: On Thu, Jan 7, 2016 at 7:56 AM, suzuki toshiya wrote: > Hi, > > I'm not a representative of the experts working for the > proposal from Japan NB, but I could explain something. > > 1) "They never took that out?" I'm not sure who you mean > "they" (UTC? JNB?), but it seems that no official document > asking for the response from JNB is submitted in WG2. > If UTC sends something officially, JNB would response > something, I believe. I meant the JNB. I thought they had removed that character from the later revised proposals that were posted on the UTC document register, but I checked and I had apparently been mistaken. The issue is only raised in passing in a footnote in Mr. Lunde's feedback. > 2) Difference in HENTAIGANA LETTER E-1 and U+1B001. > > U+1B001 is a character designed to note an ancient (and > extinct in modern Japanese language) pronunciation YE. > > When standard kana was defined about 100 years ago, > the pronunciation YE was already merged to E. > Some scholars planned to use a few kana-like characters > to note such pronunciation (to discuss about the ancient > Japanese language pronunciation), and used some hentaigana- > like glyphs for such purpose. As far as I know, there is > no wide consensus that the glyph looking like U+1B001 was > historically used to note YE mainly, when YE and E were > distinctively used in Japanese language. AIUI they simply reused an existing hentaigana to make the distinction, rather than making a new kana that just happened to look exactly like it. > On the other hand, JNB's proposal does not include any > ancient/extinct pronunciation, Their phonetic coverage > is exactly same with modern Japanese language. So, > the glyph looking like U+1B001 is not designed to note > the pronunciation YE. The motivation why JNB proposed > hentaigana would be just because of their shape differences. > > Therefore, U+1B001 and HENTAIGANA E-1 could be said as > differently designed, their designed usages are different. > Please do not think JNB hentaigana experts overlooked > U+1B001 and proposed a duplicated encoding. They ought to > have known it but proposed. It's not unknown for a single character to have more than one pronunciation in different contexts. > However, some WG2 experts suggested to unify them because > of the shape similarity. I'm not sure whether 2 glyphs are > indistinctively similar for hentaigana scholars, but I > accept with that some people are hard to distinguish. > I cannot distinguish some Latin and Greek alphabets when > they are displayed as single isolated character. We're not talking about about different scripts, though. Hentaigana are obsolete hiragana (eliminated from modern written Japanese by a spelling reform) but they are still hiragana. Latin and Greek, on the other hand, are clearly separate but related scripts. From mpsuzuki at hiroshima-u.ac.jp Fri Jan 8 08:55:36 2016 From: mpsuzuki at hiroshima-u.ac.jp (suzuki toshiya) Date: Fri, 8 Jan 2016 23:55:36 +0900 Subject: [Unicode] Re: HENTAIGANA LETTER E-1 In-Reply-To: <5c73403150964b0a802b31df2ab5fa76@PS1PR04MB0953.apcprd04.prod.outlook.com> References: <584B230B-BFBF-4A93-95B3-6F1E463E6181@evertype.com> <90baabfa798243e6b41e4f3d6502ed3c@PS1PR04MB0953.apcprd04.prod.outlook.com> <568E8AB6.5010307@hiroshima-u.ac.jp> <5c73403150964b0a802b31df2ab5fa76@PS1PR04MB0953.apcprd04.prod.outlook.com> Message-ID: <568FCDE8.3020307@hiroshima-u.ac.jp> Garth Wallace wrote: > On Thu, Jan 7, 2016 at 7:56 AM, suzuki toshiya > wrote: >> Hi, >> >> I'm not a representative of the experts working for the >> proposal from Japan NB, but I could explain something. >> >> 1) "They never took that out?" I'm not sure who you mean >> "they" (UTC? JNB?), but it seems that no official document >> asking for the response from JNB is submitted in WG2. >> If UTC sends something officially, JNB would response >> something, I believe. > > I meant the JNB. I thought they had removed that character from the > later revised proposals that were posted on the UTC document register, > but I checked and I had apparently been mistaken. > > The issue is only raised in passing in a footnote in Mr. Lunde's feedback. I think HENTAIGANA LETTER E-1 is intentionally proposed to be coded separately, and no official document is sent to JNB, so it is still kept as it was before. >> 2) Difference in HENTAIGANA LETTER E-1 and U+1B001. >> >> U+1B001 is a character designed to note an ancient (and >> extinct in modern Japanese language) pronunciation YE. >> >> When standard kana was defined about 100 years ago, >> the pronunciation YE was already merged to E. >> Some scholars planned to use a few kana-like characters >> to note such pronunciation (to discuss about the ancient >> Japanese language pronunciation), and used some hentaigana- >> like glyphs for such purpose. As far as I know, there is >> no wide consensus that the glyph looking like U+1B001 was >> historically used to note YE mainly, when YE and E were >> distinctively used in Japanese language. > > AIUI they simply reused an existing hentaigana to make the > distinction, rather than making a new kana that just happened to look > exactly like it. It is difficult (for me) to judge U+1B001 has same identity with the hentaigana before kana standardization with similar appearance. The rationale to encode U+1B001 was justified by its unique phonetic value, so its character name is YE. It is normative. Some people may think they can identify the hentaigana by their glyph shapes only, but others may have different view. As the first proposal (L2/15-193) prioritized the (modern) phonetic value as the first key to identify the glyph, I think some user community would want to identify the glyph by the phonetic value. I don't say it is the best solution, but I say they have their own rationale. >> On the other hand, JNB's proposal does not include any >> ancient/extinct pronunciation, Their phonetic coverage >> is exactly same with modern Japanese language. So, >> the glyph looking like U+1B001 is not designed to note >> the pronunciation YE. The motivation why JNB proposed >> hentaigana would be just because of their shape differences. >> >> Therefore, U+1B001 and HENTAIGANA E-1 could be said as >> differently designed, their designed usages are different. >> Please do not think JNB hentaigana experts overlooked >> U+1B001 and proposed a duplicated encoding. They ought to >> have known it but proposed. > > It's not unknown for a single character to have more than one > pronunciation in different contexts. Is it easy to distinguish the contexts how the "unified U+1B001" should be pronounced (some case, it must be YE, some case, it must be E, some case, both of YE/E are acceptable)? I don't have good connection with the users community of U+1B001, so I cannot estimate which is easier (less troublesome for existing user communities) in separation or unification. Do you have any connection with the user community of U+1B001? >> However, some WG2 experts suggested to unify them because >> of the shape similarity. I'm not sure whether 2 glyphs are >> indistinctively similar for hentaigana scholars, but I >> accept with that some people are hard to distinguish. >> I cannot distinguish some Latin and Greek alphabets when >> they are displayed as single isolated character. > > We're not talking about about different scripts, though. Hentaigana > are obsolete hiragana (eliminated from modern written Japanese by a > spelling reform) but they are still hiragana. Latin and Greek, on the > other hand, are clearly separate but related scripts. I'm afraid that the counting how many scripts in the set of modern hiragana, U+1B001 and JNB proposal could depend on the people. Some people may count only 1, some people may count 2, some people may count 3. If there is stable consensus already, it could be used as the rational to unify, but, I don't think so. Anyway, Latin and Greek were not good example, I'm sorry. Regards, mpsuzuki From lists+unicode at seantek.com Sat Jan 9 17:27:04 2016 From: lists+unicode at seantek.com (Sean Leonard) Date: Sat, 9 Jan 2016 15:27:04 -0800 Subject: Unicode password mapping for crypto standard In-Reply-To: <20160105163705.GA4941@nic.fr> References: <568B54F8.5000802@seantek.com> <20160105163705.GA4941@nic.fr> Message-ID: <56919748.2000906@seantek.com> On 1/5/2016 8:37 AM, Stephane Bortzmeyer wrote: > On Mon, Jan 04, 2016 at 09:30:32PM -0800, > Sean Leonard wrote > a message of 120 lines which said: > >> how to take the Unicode input and get a consistent and reasonable >> stream of bits out on both ends. For example: should the password be >> case folded, converted to NFKC, encoded in UTF-8 vs. UTF-16BE, etc.? > There is already a standard on that, RFC 7613 "Preparation, > Enforcement, and Comparison of Internationalized Strings Representing > Usernames and Passwords" > and I suggest we use it and do not reinvent the wheel. > Hello (sorry for my delayed response): Yes, I am aware of PRECIS. I actually asked the PRECIS mailing list a couple of months ago but got no feedback. PRECIS is an overarching framework; it doesn't specify mappings in particular. So merely saying "PRECIS!" is not enough. In my proposal, the parameter "password-mapping" can take two relevant PRECIS forms: *precis *precis-XXX (where XXX is a registered profile name) In the first form, the mapping is defined by the OpaqueString profile, /as amended from time to time/. This is the PRECIS password profile but it doesn't specify a version or anything so additional characters may be admitted in the future or treated differently, as the standards get updated (including the Unicode standard). It is meant to be "living". In the second form, it's PRECIS but is fixed to the specific profile name. An interesting use case might be the recently registered "Nickname" class [RFC7700] and . In that profile, spaces are stripped and characters are treated case-insensitively with Unicode Default Case Folding (among other things). In applications where the encryption key is derived from a user handle, this might be a relevant profile to name. Compare with UsernameCaseMapped, etc. Sean From lists+unicode at seantek.com Sat Jan 9 17:30:45 2016 From: lists+unicode at seantek.com (Sean Leonard) Date: Sat, 9 Jan 2016 15:30:45 -0800 Subject: Unicode password mapping for crypto standard In-Reply-To: References: <568B54F8.5000802@seantek.com> Message-ID: <56919825.6080307@seantek.com> On 1/5/2016 8:26 AM, Markus Scherer wrote: > I would specify that UTF-8 must be used, without mapping. > US-ASCII is a proper subset, so need not be mentioned explicitly, nor > distinguished in the protocol. > Mappings would require that all implementations carry relevant data, > and are up to date to recent versions of Unicode, or else > previously-unassigned code points will cause failures. > As long as a user types the same password the same way, or with IMEs > that produce the same output, they are fine. Strange variants might > improve password security. Right. In PRECIS, UTF-8 is enforced. However as you point out, the issue is that "strange variants" exist, as well as different IMEs and different keyboard/keystroke combinations. A case in point is that 0xFF is not a valid UTF-8 octet. However, nothing constrains the underlying technology not to use 0xFF, so there should be a way for a user (or process) to force the use of specific octet strings as inputs. That is why the "password-mapping" parameter is proposed as a hint rather than a strict rule. Also as pointed out, PKCS#8 encrypted blobs are used within PKCS #12, which has its own Unicode mapping (based on UTF-16LE). Sean From albrecht.dreiheller at siemens.com Mon Jan 11 07:22:53 2016 From: albrecht.dreiheller at siemens.com (Dreiheller, Albrecht) Date: Mon, 11 Jan 2016 13:22:53 +0000 Subject: AW: Unicode in the Curriculum? In-Reply-To: References: <568AA4EB.5070606@ix.netcom.com> <568D362D.2020409@att.net> <568DA0ED.2050804@ix.netcom.com> Message-ID: <3E10480FE4510343914E4312AB46E74212B75C4A@DEFTHW99EH5MSX.ww902.siemens.net> From: Unicode [mailto:unicode-bounces at unicode.org] Im Auftrag von Shawn Steele Date: Donnerstag, 7. Januar 2016 00:27 To: Asmus Freytag (t); unicode at unicode.org Subject: RE: Unicode in the Curriculum? Then it should be UTF-8. Learning to do something in a non-Unicode code page and then redoing it for UTF-8 or UTF-16 merely leads to conversion problems, incompatibilities, and other nonsense. If someone ?needs? to not use UTF-16 for whatever reason, then they should use UTF-8. The ?advanced? training should be the other non-Unicode code pages. Teach them right the first time. They?ll never use a code page. -Shawn They'll never use a code page for encoding, I agree, but ? When setting up a requirement specification for a font manufacturer for a new font for Chinese (both simplified and traditional), Japanese or Korean, there is no easy way to define the character repertoire without refering to the code pages like GB2312, Big-5, JIS, etc. A.D. -------------- next part -------------- An HTML attachment was scrubbed... URL: From public at khwilliamson.com Mon Jan 11 16:42:37 2016 From: public at khwilliamson.com (Karl Williamson) Date: Mon, 11 Jan 2016 15:42:37 -0700 Subject: Trying to understand Line_Break property apparent discrepancy Message-ID: <56942FDD.9070805@khwilliamson.com> It appears that http://www.unicode.org/Public/8.0.0/ucd/auxiliary/LineBreakTest.txt is testing a tailoring rather than the default line break algorithm, contrary to its heading "# Default Line Break Test". And http://www.unicode.org/Public/UCD/latest/ucd/auxiliary/LineBreakTest.html follows along. For example, the default algorithm as shown in http://www.unicode.org/reports/tr14/#Table2 follows LB25, which is an approximation of the desired behavior. But the test and html don't follow this. I suspect they are looking for the tailoring described in http://www.unicode.org/reports/tr14/#Examples example 7. For example, the test file tests for, and the html says that a class CL code point followed by a class PO one is an unconditional line break opportunity, based on rule 999. (which is the same as LB31 in TR14) Whereas, http://www.unicode.org/reports/tr14/#Table2 says that a class CL code point followed by a class PO one is an "indirect break opportunity B % A is equivalent to B ? A and B SP+ ? A; in other words, do not break before A, unless one or more spaces follow B." This is by LB25 and LB18. There is a discrepancy here, which could be resolved either by changing the tests and html to follow LB25, or documenting that these are for something above and beyond the default algorithm. (There may also be other discrepancies that I haven't stumbled against) From public at khwilliamson.com Mon Jan 11 17:32:47 2016 From: public at khwilliamson.com (Karl Williamson) Date: Mon, 11 Jan 2016 16:32:47 -0700 Subject: Redundancy in TR14 Message-ID: <56943B9F.90701@khwilliamson.com> Example 7 in http://www.unicode.org/reports/tr14/#Examples has these two rules NU ? (NU | SY | IS) NU (NU | SY | IS)* ? (NU | SY | IS | CL | CP ) It appears to me that the first rule generates a subset of what the 2nd rule generates, and so is useless. It could be hence removed for simplicity, unless I'm missing something or there is a typo and it is meant to generate something else. From public at khwilliamson.com Mon Jan 11 18:16:56 2016 From: public at khwilliamson.com (Karl Williamson) Date: Mon, 11 Jan 2016 17:16:56 -0700 Subject: Trying to understand Line_Break property apparent discrepancy In-Reply-To: <56942FDD.9070805@khwilliamson.com> References: <56942FDD.9070805@khwilliamson.com> Message-ID: <569445F8.5090000@khwilliamson.com> On 01/11/2016 03:42 PM, Karl Williamson wrote: > It appears that > http://www.unicode.org/Public/8.0.0/ucd/auxiliary/LineBreakTest.txt is > testing a tailoring rather than the default line break algorithm, > contrary to its heading "# Default Line Break Test". And > http://www.unicode.org/Public/UCD/latest/ucd/auxiliary/LineBreakTest.html follows > along. > > For example, the default algorithm as shown in > http://www.unicode.org/reports/tr14/#Table2 follows LB25, which is an > approximation of the desired behavior. But the test and html don't > follow this. I suspect they are looking for the tailoring described in > http://www.unicode.org/reports/tr14/#Examples example 7. > > For example, the test file tests for, and the html says that a class CL > code point followed by a class PO one is an unconditional line break > opportunity, based on rule 999. (which is the same as LB31 in TR14) > > Whereas, http://www.unicode.org/reports/tr14/#Table2 says that a class > CL code point followed by a class PO one is an > > "indirect break opportunity B % A is equivalent to B ? A and B > SP+ ? A; in other words, do not break before A, unless one or more > spaces follow B." This is by LB25 and LB18. > > There is a discrepancy here, which could be resolved either by changing > the tests and html to follow LB25, or documenting that these are for > something above and beyond the default algorithm. (There may also be > other discrepancies that I haven't stumbled against) > > > > Ooops. I didn't see this statement in the html file: "The Line Break tests use tailoring of numbers described in Example 7 of Section 8.2 Examples of Customization. They also differ from the results produced by a pair table implementation in sequences like: ZW SP CL." This explains everything. Please disregard the earlier email from me. From charupdate at orange.fr Mon Jan 11 21:54:31 2016 From: charupdate at orange.fr (Marcel Schneider) Date: Tue, 12 Jan 2016 04:54:31 +0100 (CET) Subject: Unicode in the Curriculum? Message-ID: <1906649992.303.1452570871573.JavaMail.www@wwinf1e33> On 1/6/2016 3:27 PM, Shawn Steele wrote: [?] > Teach them right the first time. They?ll never use a code page. On 6 Jan 2016 15:32:33, Asmus Freytag (t) wrote: [?] > +1 On 11 Jan 2016 14:22:53, Albrecht Dreiheller wrote: [?] > When setting up a requirement specification for a font manufacturer for a new font for Chinese [?], there is no easy way to define the character repertoire without refering to the code pages [?]. Among the many uses of code pages, this thread was focusing on training for computer scientists. If enlarging the subject to cover font design and possibly keyboard input as well is really useful, then from a German POV it might be interesting to look up the discussion at http://www.typografie.info/3/topic/26274-liste-unbedingt-notwendiger-zeichen/ Retrieved January 7, 2016. For *IT students* (and other people as well), the day they encounter their first ?U+?, it is straightforward either to look up some pieces of information about Unicode, since they have already a strong experience of the internet; or at least if they don?t (and anyway), to use the Contact form to submit their questions. While the interest *on the whole* won?t be missing, the actual problem is oversolliciting and misdirecting the interest through the entertainment and advertising industries. The attention as a limited resource is even uselessly threatened through the side-effects of consumption (food, ?). Checking these problems is a matter of on-going efforts. I just would like to complete the discussion on that side. [By this occasion I apologize for my last and previous e-mails; I hope I got some skill to stop bothering uselessly, and to hopefully focus on the topics I?m able to do some useful work in. Soon I should send a link FWIW.] Best regards, Marcel -------------- next part -------------- An HTML attachment was scrubbed... URL: From mark at macchiato.com Mon Jan 11 23:55:04 2016 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Tue, 12 Jan 2016 06:55:04 +0100 Subject: Redundancy in TR14 In-Reply-To: <56943B9F.90701@khwilliamson.com> References: <56943B9F.90701@khwilliamson.com> Message-ID: Looks that way to me too. Can you submit this as feedback? {phone} On Jan 12, 2016 00:39, "Karl Williamson" wrote: > Example 7 in http://www.unicode.org/reports/tr14/#Examples > > has these two rules > > NU ? (NU | SY | IS) > > NU (NU | SY | IS)* ? (NU | SY | IS | CL | CP ) > > It appears to me that the first rule generates a subset of what the 2nd > rule generates, and so is useless. It could be hence removed for > simplicity, unless I'm missing something or there is a typo and it is meant > to generate something else. > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From public at khwilliamson.com Tue Jan 12 00:25:46 2016 From: public at khwilliamson.com (Karl Williamson) Date: Mon, 11 Jan 2016 23:25:46 -0700 Subject: Redundancy in TR14 In-Reply-To: References: <56943B9F.90701@khwilliamson.com> Message-ID: <56949C6A.4000306@khwilliamson.com> On 01/11/2016 10:55 PM, Mark Davis ?? wrote: > Looks that way to me too. Can you submit this as feedback? will do > > {phone} > > On Jan 12, 2016 00:39, "Karl Williamson" > wrote: > > Example 7 in http://www.unicode.org/reports/tr14/#Examples > > has these two rules > > NU ? (NU | SY | IS) > > NU (NU | SY | IS)* ? (NU | SY | IS | CL | CP ) > > It appears to me that the first rule generates a subset of what the > 2nd rule generates, and so is useless. It could be hence removed > for simplicity, unless I'm missing something or there is a typo and > it is meant to generate something else. > From drott at google.com Wed Jan 13 05:25:56 2016 From: drott at google.com (=?UTF-8?Q?Dominik_R=C3=B6ttsches?=) Date: Wed, 13 Jan 2016 13:25:56 +0200 Subject: Additional ZWJ prefixes in ZWJ emoji sequences page Message-ID: Hi, if I am not mistaken, there are a couple of additional, probably unintentional ZWJ prefixes in field count 1,2,3 and 4,5,6 in http://www.unicode.org/emoji/charts/emoji-zwj-sequences.html >From a hexdump of the page: 00008dd0 74 72 3e 0a 3c 74 72 3e 0a 3c 74 64 20 63 6c 61 |tr>..1.U+1F469 U| 00008e00 2b 32 30 30 44 20 55 2b 32 37 36 34 20 55 2b 46 |+200D U+2764 U+F| 00008e10 45 30 46 20 55 2b 32 30 30 44 20 55 2b 31 46 34 |E0F U+200D U+1F4| 00008e20 38 42 20 55 2b 32 30 30 44 20 55 2b 31 46 34 36 |8B U+200D U+1F46| 00008e30 38 3c 2f 74 64 3e 0a 3c 74 64 20 63 6c 61 73 73 |8........| So, after the U+003E '>', there is the e2 80 8d sequence of a ZWJ there in field 1. Perhaps someone could fix that. Thanks, Dominik From mark at macchiato.com Wed Jan 13 09:51:38 2016 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Wed, 13 Jan 2016 16:51:38 +0100 Subject: Additional ZWJ prefixes in ZWJ emoji sequences page In-Reply-To: References: Message-ID: You're right. It's between the closing > and the following ??? character \u003e *\u200d* \U0001f469 We'll see why that spurious character is there in the HTML. Mark On Wed, Jan 13, 2016 at 12:25 PM, Dominik R?ttsches wrote: > Hi, > > if I am not mistaken, there are a couple of additional, probably > unintentional ZWJ prefixes in field count 1,2,3 and 4,5,6 in > > http://www.unicode.org/emoji/charts/emoji-zwj-sequences.html > > From a hexdump of the page: > > 00008dd0 74 72 3e 0a 3c 74 72 3e 0a 3c 74 64 20 63 6c 61 |tr>.. cla| > > 00008de0 73 73 3d 27 72 63 68 61 72 73 27 3e 31 3c 2f 74 > |ss='rchars'>1 > 00008df0 64 3e 0a 3c 74 64 3e 55 2b 31 46 34 36 39 20 55 > |d>.U+1F469 U| > > 00008e00 2b 32 30 30 44 20 55 2b 32 37 36 34 20 55 2b 46 |+200D U+2764 > U+F| > > 00008e10 45 30 46 20 55 2b 32 30 30 44 20 55 2b 31 46 34 |E0F U+200D > U+1F4| > > 00008e20 38 42 20 55 2b 32 30 30 44 20 55 2b 31 46 34 36 |8B U+200D > U+1F46| > > 00008e30 38 3c 2f 74 64 3e 0a 3c 74 64 20 63 6c 61 73 73 |8. class| > > 00008e40 3d 27 63 68 61 72 73 27 3e e2 80 8d f0 9f 91 a9 > |='chars'>.......| > > > So, after the U+003E '>', there is the e2 80 8d sequence of a ZWJ > there in field 1. > > Perhaps someone could fix that. > > Thanks, > > Dominik > -------------- next part -------------- An HTML attachment was scrubbed... URL: From gwalla at gmail.com Wed Jan 13 15:39:26 2016 From: gwalla at gmail.com (Garth Wallace) Date: Wed, 13 Jan 2016 13:39:26 -0800 Subject: [Unicode] Re: HENTAIGANA LETTER E-1 In-Reply-To: <568FCDE8.3020307@hiroshima-u.ac.jp> References: <584B230B-BFBF-4A93-95B3-6F1E463E6181@evertype.com> <90baabfa798243e6b41e4f3d6502ed3c@PS1PR04MB0953.apcprd04.prod.outlook.com> <568E8AB6.5010307@hiroshima-u.ac.jp> <5c73403150964b0a802b31df2ab5fa76@PS1PR04MB0953.apcprd04.prod.outlook.com> <568FCDE8.3020307@hiroshima-u.ac.jp> Message-ID: On Fri, Jan 8, 2016 at 6:55 AM, suzuki toshiya wrote: > Garth Wallace wrote: >> On Thu, Jan 7, 2016 at 7:56 AM, suzuki toshiya >> wrote: >>> Hi, >>> >>> I'm not a representative of the experts working for the >>> proposal from Japan NB, but I could explain something. >>> >>> 1) "They never took that out?" I'm not sure who you mean >>> "they" (UTC? JNB?), but it seems that no official document >>> asking for the response from JNB is submitted in WG2. >>> If UTC sends something officially, JNB would response >>> something, I believe. >> >> I meant the JNB. I thought they had removed that character from the >> later revised proposals that were posted on the UTC document register, >> but I checked and I had apparently been mistaken. >> >> The issue is only raised in passing in a footnote in Mr. Lunde's feedback. > > I think HENTAIGANA LETTER E-1 is intentionally proposed > to be coded separately, and no official document is > sent to JNB, so it is still kept as it was before. > >>> 2) Difference in HENTAIGANA LETTER E-1 and U+1B001. >>> >>> U+1B001 is a character designed to note an ancient (and >>> extinct in modern Japanese language) pronunciation YE. >>> >>> When standard kana was defined about 100 years ago, >>> the pronunciation YE was already merged to E. >>> Some scholars planned to use a few kana-like characters >>> to note such pronunciation (to discuss about the ancient >>> Japanese language pronunciation), and used some hentaigana- >>> like glyphs for such purpose. As far as I know, there is >>> no wide consensus that the glyph looking like U+1B001 was >>> historically used to note YE mainly, when YE and E were >>> distinctively used in Japanese language. >> >> AIUI they simply reused an existing hentaigana to make the >> distinction, rather than making a new kana that just happened to look >> exactly like it. > > It is difficult (for me) to judge U+1B001 has same identity > with the hentaigana before kana standardization with similar > appearance. The rationale to encode U+1B001 was justified by > its unique phonetic value, so its character name is YE. It > is normative. Some people may think they can identify the > hentaigana by their glyph shapes only, but others may have > different view. As the first proposal (L2/15-193) prioritized > the (modern) phonetic value as the first key to identify the > glyph, I think some user community would want to identify the > glyph by the phonetic value. I don't say it is the best > solution, but I say they have their own rationale. The rationale for U+1B001, AIUI, was that it was used in some modern scholarly works about the history of the Japanese language to distinguish between /e/ and /je/ before they merged in the modern language. I don't know if historically that distinction existed in writing. The character name is normative. But the pronunciation is not, and I don't think the Unicode name should be taken to mean that it can only be used when a particular pronunciation is intended. Spelling and pronunciation are outside of Unicode's scope. >>> On the other hand, JNB's proposal does not include any >>> ancient/extinct pronunciation, Their phonetic coverage >>> is exactly same with modern Japanese language. So, >>> the glyph looking like U+1B001 is not designed to note >>> the pronunciation YE. The motivation why JNB proposed >>> hentaigana would be just because of their shape differences. >>> >>> Therefore, U+1B001 and HENTAIGANA E-1 could be said as >>> differently designed, their designed usages are different. >>> Please do not think JNB hentaigana experts overlooked >>> U+1B001 and proposed a duplicated encoding. They ought to >>> have known it but proposed. >> >> It's not unknown for a single character to have more than one >> pronunciation in different contexts. > > Is it easy to distinguish the contexts how the "unified U+1B001" > should be pronounced (some case, it must be YE, some case, it > must be E, some case, both of YE/E are acceptable)? I don't have > good connection with the users community of U+1B001, so I cannot > estimate which is easier (less troublesome for existing user > communities) in separation or unification. Do you have any > connection with the user community of U+1B001? I do not. For that matter, I'm not a member of the UTC. I've only read Nozomu Kat?'s original proposal and some of the documents that followed. >>> However, some WG2 experts suggested to unify them because >>> of the shape similarity. I'm not sure whether 2 glyphs are >>> indistinctively similar for hentaigana scholars, but I >>> accept with that some people are hard to distinguish. >>> I cannot distinguish some Latin and Greek alphabets when >>> they are displayed as single isolated character. >> >> We're not talking about about different scripts, though. Hentaigana >> are obsolete hiragana (eliminated from modern written Japanese by a >> spelling reform) but they are still hiragana. Latin and Greek, on the >> other hand, are clearly separate but related scripts. > > I'm afraid that the counting how many scripts in the set > of modern hiragana, U+1B001 and JNB proposal could depend > on the people. Some people may count only 1, some people > may count 2, some people may count 3. If there is stable > consensus already, it could be used as the rational to unify, > but, I don't think so. Anyway, Latin and Greek were not > good example, I'm sorry. You're right, it's unclear, though at least in Unicode terms I don't think you can really count 3. U+1B001 has the script property "hiragana", but that still leaves the question of whether hentaigana should be considered a separate script from hiragana. The proposal summary for L2/15-239 does say it's for a new script, named "hentaigana". However, elsewhere in that document it says "In year 1900, Japanese government selected one phonogram for each phonetic value and announced not to use other phonograms in elementary education. Afterward, the selected phonograms are called ?HIRAGANA? and others are called ?HENTAIGANA?, the meaning is variants of a HIRAGANA." Also, the original proposal was to encode them as Standard Variation Sequences of hiragana, which I think implies that the JNB, at least at that time, considered them to be variants of hiragana and not something other than hiragana. AIUI, and correct me if I'm wrong, hentaigana is a retronym; at the time they were in regular use they were used in combination with and interchangeably with the modern set of hiragana, and did not have an identity as a distinct set until the spelling reform of 1900. I believe that in Unicode, characters that were once used in a script but were later made obsolete are usually still considered part of the same script as the surviving set. That has been the case for Latin, at least. From frederic.grosshans at gmail.com Thu Jan 14 06:04:49 2016 From: frederic.grosshans at gmail.com (=?UTF-8?Q?Fr=c3=a9d=c3=a9ric_Grosshans?=) Date: Thu, 14 Jan 2016 13:04:49 +0100 Subject: [Unicode] Re: HENTAIGANA LETTER E-1 In-Reply-To: References: <584B230B-BFBF-4A93-95B3-6F1E463E6181@evertype.com> <90baabfa798243e6b41e4f3d6502ed3c@PS1PR04MB0953.apcprd04.prod.outlook.com> <568E8AB6.5010307@hiroshima-u.ac.jp> <5c73403150964b0a802b31df2ab5fa76@PS1PR04MB0953.apcprd04.prod.outlook.com> <568FCDE8.3020307@hiroshima-u.ac.jp> Message-ID: <56978EE1.9070206@gmail.com> Le 13/01/2016 22:39, Garth Wallace a ?crit : > The rationale for U+1B001, AIUI, was that it was used in some modern > scholarly works about the history of the Japanese language to > distinguish between/e/ and/je/ before they merged in the modern > language. I don't know if historically that distinction existed in > writing. > > The character name is normative. But the pronunciation is not, and I > don't think the Unicode name should be taken to mean that it can only > be used when a particular pronunciation is intended. Spelling and > pronunciation are outside of Unicode's scope. Let us suppose *HENTAIGANA LETTER E-1 is to be unified with ?? U+1B001 HIRAGANA LETTER ARCHAIC YE. An annotation can be added to U+1B001 description to confirm this usage. If it is not enough, and an official HENTAIGANA name is desired for consistency, I think it is conceivable to add the following line in http://www.unicode.org/Public/UCD/latest/ucd/NameAliases.txt 1B001;HENTAIGANA LETTER E-1;alternate Fr?d?ric From charupdate at orange.fr Thu Jan 14 07:38:52 2016 From: charupdate at orange.fr (Marcel Schneider) Date: Thu, 14 Jan 2016 14:38:52 +0100 (CET) Subject: Unicode in the Curriculum? In-Reply-To: References: <568AA4EB.5070606@ix.netcom.com> <568D362D.2020409@att.net> <568DA0ED.2050804@ix.netcom.com> <568DA411.5000506@ix.netcom.com> Message-ID: <1262344245.11703.1452778732578.JavaMail.www@wwinf1n02> On January 7, 2016, at 00:39, Shawn Steele wrote: >> ? I think any training in non-Unicode character sets is beyond a standard curriculum, except perhaps History of Computing or Digital Archaeology :) > One could only hope. Since the topic widened to font design, one easily agrees that also in these curricula, Unicode is taught, and code pages are replaced with Unicode collections. Even the Multilingual European Subsets were originally declared to be an intermediate stage on the road towards the implementation of the whole UCS. I fully agree that code pages are to be relegated into the archives. *If* there is an exception for CJK fonts, it merely confirms the rule. Last fall we?ve seen the side effects of remnant code page use in the recognition of native languages in Northwest Territories. I apologize to all persons I?ve hurt. E.g. one may teach that Latin script is covered by the Unicode collections Basic Latin ? Latin-1 Supplement ? Latin Extended-A ? Latin Extended-B ? IPA Extensions ? Spacing Modifier Letters ? Combining Diacritical Marks ? Combining Diacritical Marks Extended ? Phonetic Extensions ? Phonetic Extensions Supplement ? Combining Diacritical Marks Supplement ? Latin Extended Additional ? General Punctuation ? Superscripts and Subscripts ? [most of] Currency Symbols ? Letterlike Symbols ? Number Forms ? Enclosed Alphanumerics ? Latin Extended-C ? Supplemental Punctuation ? Modifier Tone Letters ? Latin Extended-D ? Latin Extended-E ? Combining Half Marks ? Mathematical Alphanumeric Symbols ? Enclosed Alphanumeric Supplement, AFAIK. The more we add to the cart, the more the specified font will be useful?but the more it will be costly. Therefore cheaper fonts may restrict themselves to less collections or subsets of them, at risk of not covering e.g. U+2010 HYPHEN and U+02BC LETTER APOSTROPHE. I apologize again to all persons I?ve hurt in that other thread. In fact I felt that something is wrong, but above all I?was wrong myself. Looking for defaults on Unicode?s side was a big mistake. You are heroes. BTW, for keyboard input, there is strictly no problem on Windows. Typing the ??1,600?Latin characters +?punctuation is straightforward since we know how keyboard layout drivers work. There is mainly *one* long dead trans list, and almost every keyboard can have Kana on Right Alt, and Compose on Kana?+?Space. ISO/IEC?9995 should soon be revised to become ultimately fit for real mainstream computers. Microsoft didn?t wait for ISO/IEC?9995-2 to provide performative APIs, nor did Tavultesoft wait for ISO/IEC?9995-11 to provide performative UIs. Would it be possible to teach them too how a Unicode keyboard is made, and how KbdUTool works? And Keyman Developer? Perhaps in a lecture on C, or in a workshop on compilers, or in a lecture on UI design? Marcel -------------- next part -------------- An HTML attachment was scrubbed... URL: From eric.muller at efele.net Sat Jan 16 09:00:17 2016 From: eric.muller at efele.net (Eric Muller) Date: Sat, 16 Jan 2016 07:00:17 -0800 Subject: The Chinese Typewriter: The Design and Science of East Asian Information Technology Message-ID: <569A5B01.9070701@efele.net> An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: typewriter.jpg Type: image/jpeg Size: 28061 bytes Desc: not available URL: From joe at unicode.org Sat Jan 16 12:30:51 2016 From: joe at unicode.org (Joe Becker) Date: Sat, 16 Jan 2016 10:30:51 -0800 Subject: The Chinese Typewriter: The Design and Science of East Asian Information Technology In-Reply-To: <569A5B01.9070701@efele.net> References: <569A5B01.9070701@efele.net> Message-ID: <569A8C5B.2080305@unicode.org> See, for starters ... https://www.youtube.com/watch?v=tdT-oFxc-C0 -- A Chinese Typewriter in Silicon Valley, Thomas S. Mullaney, Google Tech Talk, December 5, 2011 http://thechinesetypewriter.wordpress.com/ http://tsmullaney.com/ Zhou From c933103 at gmail.com Wed Jan 20 00:07:37 2016 From: c933103 at gmail.com (gfb hjjhjh) Date: Wed, 20 Jan 2016 14:07:37 +0800 Subject: Is it possible to choose rotational direction of vertical script if I want to force them to display horizontally? In-Reply-To: References: Message-ID: For instance, traditional Mongolian script write in vertical-lr mode (text run vertically from top to bottom, first line start on left), if you use css writing mode horizontal-tb (default) then you can force it horizontal by rotating each line of the text by 90 degree anticlockwise, and the resultant text would be ltr. However, I just read on a Chinese webpage http://www.zhihu.com/question/30727581 which claim there're a "traditional way" of writing Mongolian horizontally by rotating it 90 degree clockwise (despite I am not sure about what kind of tradition the webpage is referring to nor do i know is it legit.), is this achievable within current computer system like on webpage via css or in different word processors/office suite software? -------------- next part -------------- An HTML attachment was scrubbed... URL: From everson at evertype.com Wed Jan 20 08:49:43 2016 From: everson at evertype.com (Michael Everson) Date: Wed, 20 Jan 2016 14:49:43 +0000 Subject: ISO 15924 updated Message-ID: <9D4E073B-B424-4747-9FCD-6428155C97BD@evertype.com> Two aliases have been added for the Jamo subset of Hangul, and for Han + Bopomofo. See http://www.unicode.org/iso15924/codechanges.html Michael Everson Registrar From quanxunzhen at gmail.com Wed Jan 20 01:08:06 2016 From: quanxunzhen at gmail.com (Xidorn Quan) Date: Wed, 20 Jan 2016 18:08:06 +1100 Subject: Is it possible to choose rotational direction of vertical script if I want to force them to display horizontally? In-Reply-To: References: Message-ID: On Wed, Jan 20, 2016 at 5:07 PM, gfb hjjhjh wrote: > However, I just read on a Chinese webpage > http://www.zhihu.com/question/30727581 which claim there're a "traditional > way" of writing Mongolian horizontally by rotating it 90 degree clockwise > (despite I am not sure about what kind of tradition the webpage is referring > to nor do i know is it legit.), It doesn't seem to me the post claims anything. AFAICS, it is just a question that, whether there exists any valid method to write Mongolian horizontally. > is this achievable within current computer > system like on webpage via css or in different word processors/office suite > software? It seems to be a question for www-style at w3.org, instead of the Unicode mailing list. - Xidorn From rwhlk142 at gmail.com Fri Jan 22 16:56:42 2016 From: rwhlk142 at gmail.com (Robert Wheelock) Date: Fri, 22 Jan 2016 17:56:42 -0500 Subject: Fwd: Drawing Souvenir-style letters for Hebrew script In-Reply-To: References: Message-ID: ---------- Forwarded message ---------- From: Robert Lloyd Wheelock Date: Wed, Jan 20, 2016 at 5:30 AM Subject: RE: Drawing Souvenir-style letters for Hebrew script To: Robert Lloyd Wheelock From: Robert Lloyd Wheelock Subject: RE: Drawing Souvenir-style letters for Hebrew script Hello! I tried my hands to design Hebrew letters using similarly-looking Latin/Roman ones in the font *Souvenir*, a playful retro round serif typeface. Souvenir's somewhat bending strokes lend nicely to create characters for Latin-Roman/Greek/Cyrillic, but present quite a challenge when designing characters for Hebrew/Arabic/Syro-Aramaic/Devanagari... . How would you employ softer, more rounded strokes to construct the 22 Hebrew letters with the 5 *sofith* (final forms for kaf/mem/nun/pe?/?adheh), so that they would harmonize well into the *Souvenir* font family?!?! Designing the vowel points and cantillation signs would be much easier. Shalom! Thank You! -- This mail was sent by Robert Lloyd Wheelock via The Open Siddur Project's contact form http://opensiddur.org/contact/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From as at signographie.de Sat Jan 23 05:58:31 2016 From: as at signographie.de (=?iso-8859-1?Q?Andreas_St=F6tzner?=) Date: Sat, 23 Jan 2016 12:58:31 +0100 Subject: Drawing Souvenir-style letters for Hebrew script In-Reply-To: References: Message-ID: <29C476FD-048C-45FC-B77F-54D49667EFBD@signographie.de> Am 22.01.2016 um 23:56 schrieb Robert Wheelock: > I tried my hands to design Hebrew letters using similarly-looking Latin/Roman ones in the font *Souvenir*, a playful retro round serif typeface. Souvenir's somewhat bending strokes lend nicely to create characters for Latin-Roman/Greek/Cyrillic, but present quite a challenge when designing characters for Hebrew/Arabic/Syro-Aramaic/Devanagari... . > > How would you employ softer, more rounded strokes to construct the 22 Hebrew letters with the 5 *sofith* (final forms for kaf/mem/nun/pe?/?adheh), so that they would harmonize well into the *Souvenir* font family?!?! this is surely an interesting matter but no one within the scope of the Unicode disc. list. You may wish to turn to the Typedrawers.com forum, where font design issues can get forwarded to a well-prepared audience. With kind regards, Andreas St?tzner _______________________________________________________________________________ Andreas St?tzner Gestaltung Signographie Fontentwicklung Haus des Buches Gerichtsweg 28, Raum 434 04103 Leipzig 0176-86823396 -------------- next part -------------- An HTML attachment was scrubbed... URL: From d3ck0r at gmail.com Sat Jan 30 08:40:23 2016 From: d3ck0r at gmail.com (J Decker) Date: Sat, 30 Jan 2016 06:40:23 -0800 Subject: Encoding/Use of pontial unpaired UTF-16 surrogate pair specifiers Message-ID: I do see that the code points D800-DFFF should not be encoded in any UTF format (UTF8/32)... UTF8 has a way to define any byte that might otherwise be used as an encoding byte. UTF16 has no way to define a code point that is D800-DFFF; this is an issue if I want to apply some sort of encryption algorithm and still have the result treated as text for transmission and encoding to other string systems. http://www.azillionmonkeys.com/qed/unicode.html lists Unicode private areas Area-A which is U-F0000:U-FFFFD and Area-B which is U-100000:U-10FFFD which will suffice for a workaround for my purposes.... For my purposes I will implement F0000-F0800 to be (code point minus D800 and then add F0000 (or vice versa)) and then encoded as a surrogate pair... it would have been super nice of unicode standards included a way to specify code point even if there isn't a language character assigned to that point. http://unicode.org/faq/utf_bom.html does say: "Q: Are there any 16-bit values that are invalid? A: Unpaired surrogates are invalid in UTFs. These include any value in the range D800 to DBFF not followed by a value in the range DC00 to DFFF, or any value in the range DC00 to DFFF not preceded by a value in the range D800 to DBFF " and "Q: How do I convert an unpaired UTF-16 surrogate to UTF-8? A different issue arises if an unpaired surrogate is encountered when converting ill-formed UTF-16 data. By represented such an unpaired surrogate on its own as a 3-byte sequence, the resulting UTF-8 data stream would become ill-formed. While it faithfully reflects the nature of the input, Unicode conformance requires that encoding form conversion always results in valid data stream. Therefore a converter must treat this as an error. " I did see these older messages... (not that they talk about this much just more info) http://www.unicode.org/mail-arch/unicode-ml/y2013-m01/0204.html http://www.unicode.org/mail-arch/unicode-ml/y2013-m01/0209.html http://www.unicode.org/mail-arch/unicode-ml/y2013-m01/0210.html http://www.unicode.org/mail-arch/unicode-ml/y2013-m01/0201.html From doug at ewellic.org Sat Jan 30 15:05:52 2016 From: doug at ewellic.org (Doug Ewell) Date: Sat, 30 Jan 2016 14:05:52 -0700 Subject: Encoding/Use of pontial unpaired UTF-16 surrogate pair specifiers Message-ID: <5E72B7B357714F38A8B16215EFB36319@DougEwell> J Decker wrote: > UTF16 has no way to define a code point that is D800-DFFF; this is an > issue if I want to apply some sort of encryption algorithm and still > have the result treated as text for transmission and encoding to other > string systems. Unpaired surrogates are not valid Unicode text. If you want to encrypt data into 16-bit code units and have them treated as valid Unicode text, the encryption algorithm must not generate unpaired surrogates. This is not negotiable and not something you can be "partially" compliant on. See Unicode Conformance Requirement C1: "A process shall not interpret a high-surrogate code point or a low-surrogate code point as an abstract character." There's a reason this is "C1" and not farther down the list. It is fundamental to Unicode. > For my purposes I will implement F0000-F0800 to be (code point minus > D800 and then add F0000 (or vice versa)) and then encoded as a > surrogate pair... This is fine for a private implementation where you are sure no input will contain these PUA code points. Keep in mind that some people do use them -- for example, they are assigned in the ConScript Unicode Registry, which is unofficial and not affiliated with Unicode. > it would have been super nice of unicode standards > included a way to specify code point even if there isn't a language > character assigned to that point. It's not a question of whether a code point is assigned to a "language character." There are hundreds of thousands of unassigned code points that can be represented in any UTF, such as this one: ??, U+77777. But unpaired surrogates can *never* be assigned to a character. If they could, they would have failed in their basic purpose of extending UTF-16. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From chris.jacobs at xs4all.nl Sat Jan 30 15:29:35 2016 From: chris.jacobs at xs4all.nl (Chris Jacobs) Date: Sat, 30 Jan 2016 22:29:35 +0100 Subject: Encoding/Use of pontial unpaired UTF-16 surrogate pair specifiers In-Reply-To: <5E72B7B357714F38A8B16215EFB36319@DougEwell> References: <5E72B7B357714F38A8B16215EFB36319@DougEwell> Message-ID: Doug Ewell schreef op 2016-01-30 22:05: > J Decker wrote: > >> UTF16 has no way to define a code point that is D800-DFFF; this is an >> issue if I want to apply some sort of encryption algorithm and still >> have the result treated as text for transmission and encoding to other >> string systems. This is not an issue at all. You don't have to restrict the input to text to be able to generate an output that can be treated as text. Just, as a last step, apply e.g. UUENCODE or Base64. Look how PGP solves this. Chris From doug at ewellic.org Sat Jan 30 15:46:39 2016 From: doug at ewellic.org (Doug Ewell) Date: Sat, 30 Jan 2016 14:46:39 -0700 Subject: Encoding/Use of pontial unpaired UTF-16 surrogate pair specifiers In-Reply-To: References: <5E72B7B357714F38A8B16215EFB36319@DougEwell> Message-ID: <04EF7083723A40ACAADA05B703602249@DougEwell> Chris Jacobs wrote: >>> UTF16 has no way to define a code point that is D800-DFFF; this is >>> an issue if I want to apply some sort of encryption algorithm and >>> still have the result treated as text for transmission and encoding >>> to other string systems. > > This is not an issue at all. You don't have to restrict the input to > text to be able to generate an output that can be treated as text. I gathered that J wanted to generate arbitrary output that could be interpreted as UTF-16 code units. I admit to being less than 100% sure of this. Certainly there is no shortage of algorithms to map arbitrary byte input to text output, usually limited to some subset of ASCII. One interesting approach for the Unicode era was Markus Scherer's "Base16k" concept, at https://sites.google.com/site/markusicu/unicode/base16k . -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From Shawn.Steele at microsoft.com Sat Jan 30 18:45:18 2016 From: Shawn.Steele at microsoft.com (Shawn Steele) Date: Sun, 31 Jan 2016 00:45:18 +0000 Subject: Encoding/Use of pontial unpaired UTF-16 surrogate pair specifiers In-Reply-To: References: Message-ID: Why do you need illegal unicode code points? -----Original Message----- From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of J Decker Sent: Saturday, January 30, 2016 6:40 AM To: unicode at unicode.org Subject: Encoding/Use of pontial unpaired UTF-16 surrogate pair specifiers I do see that the code points D800-DFFF should not be encoded in any UTF format (UTF8/32)... UTF8 has a way to define any byte that might otherwise be used as an encoding byte. UTF16 has no way to define a code point that is D800-DFFF; this is an issue if I want to apply some sort of encryption algorithm and still have the result treated as text for transmission and encoding to other string systems. http://www.azillionmonkeys.com/qed/unicode.html lists Unicode private areas Area-A which is U-F0000:U-FFFFD and Area-B which is U-100000:U-10FFFD which will suffice for a workaround for my purposes.... For my purposes I will implement F0000-F0800 to be (code point minus D800 and then add F0000 (or vice versa)) and then encoded as a surrogate pair... it would have been super nice of unicode standards included a way to specify code point even if there isn't a language character assigned to that point. http://unicode.org/faq/utf_bom.html does say: "Q: Are there any 16-bit values that are invalid? A: Unpaired surrogates are invalid in UTFs. These include any value in the range D800 to DBFF not followed by a value in the range DC00 to DFFF, or any value in the range DC00 to DFFF not preceded by a value in the range D800 to DBFF " and "Q: How do I convert an unpaired UTF-16 surrogate to UTF-8? A different issue arises if an unpaired surrogate is encountered when converting ill-formed UTF-16 data. By represented such an unpaired surrogate on its own as a 3-byte sequence, the resulting UTF-8 data stream would become ill-formed. While it faithfully reflects the nature of the input, Unicode conformance requires that encoding form conversion always results in valid data stream. Therefore a converter must treat this as an error. " I did see these older messages... (not that they talk about this much just more info) http://www.unicode.org/mail-arch/unicode-ml/y2013-m01/0204.html http://www.unicode.org/mail-arch/unicode-ml/y2013-m01/0209.html http://www.unicode.org/mail-arch/unicode-ml/y2013-m01/0210.html http://www.unicode.org/mail-arch/unicode-ml/y2013-m01/0201.html From d3ck0r at gmail.com Sat Jan 30 20:28:03 2016 From: d3ck0r at gmail.com (J Decker) Date: Sat, 30 Jan 2016 18:28:03 -0800 Subject: Encoding/Use of pontial unpaired UTF-16 surrogate pair specifiers In-Reply-To: References: Message-ID: On Sat, Jan 30, 2016 at 4:45 PM, Shawn Steele wrote: > Why do you need illegal unicode code points? This originated from learning Javascript; which is internally UTF-16. Playing with localStorage, some browsers use a sqlite3 database to store values. The database is UTF-8 so there must be a valid conversion between the internal UTF-16 and UTF-8 localStorage (and reverse). I wanted to obfuscate the data stored for a certain application; and cover all content that someone might send. Having slept on this, I realized that even if hieroglyphics were stored, if I pulled out the character using codePointAt() and applied a 20 bit random value to it using XOR it could end up as a normal character, and I wouldn't know I had to use a 20 bit value... so every character would have to use a 20 bit mask (which could end up with a value that's D800-DFFF). I've reconsidered and think for ease of implementation to just mask every UTF-16 character (not codepoint) with a 10 bit value, This will result in no character changing from BMP space to surrogate-pair or vice-versa. Thanks for the feedback. (sorry if I've used some terms inaccurately) > > -----Original Message----- > From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of J Decker > Sent: Saturday, January 30, 2016 6:40 AM > To: unicode at unicode.org > Subject: Encoding/Use of pontial unpaired UTF-16 surrogate pair specifiers > > I do see that the code points D800-DFFF should not be encoded in any UTF format (UTF8/32)... > > UTF8 has a way to define any byte that might otherwise be used as an encoding byte. > > UTF16 has no way to define a code point that is D800-DFFF; this is an issue if I want to apply some sort of encryption algorithm and still have the result treated as text for transmission and encoding to other string systems. > > http://www.azillionmonkeys.com/qed/unicode.html lists Unicode > private areas Area-A which is U-F0000:U-FFFFD and Area-B which is U-100000:U-10FFFD which will suffice for a workaround for my purposes.... > > For my purposes I will implement F0000-F0800 to be (code point minus > D800 and then add F0000 (or vice versa)) and then encoded as a surrogate pair... it would have been super nice of unicode standards included a way to specify code point even if there isn't a language character assigned to that point. > > http://unicode.org/faq/utf_bom.html > does say: "Q: Are there any 16-bit values that are invalid? > > A: Unpaired surrogates are invalid in UTFs. These include any value in the range D800 to DBFF not followed by a value in the range DC00 to DFFF, or any value in the range DC00 to DFFF not preceded by a value in the range D800 to DBFF " > > and "Q: How do I convert an unpaired UTF-16 surrogate to UTF-8? > > A different issue arises if an unpaired surrogate is encountered when converting ill-formed UTF-16 data. By represented such an unpaired surrogate on its own as a 3-byte sequence, the resulting UTF-8 data stream would become ill-formed. While it faithfully reflects the nature of the input, Unicode conformance requires that encoding form conversion always results in valid data stream. Therefore a converter must treat this as an error. " > > > > I did see these older messages... (not that they talk about this much just more info) http://www.unicode.org/mail-arch/unicode-ml/y2013-m01/0204.html > http://www.unicode.org/mail-arch/unicode-ml/y2013-m01/0209.html > http://www.unicode.org/mail-arch/unicode-ml/y2013-m01/0210.html > http://www.unicode.org/mail-arch/unicode-ml/y2013-m01/0201.html From prosfilaes at gmail.com Sat Jan 30 21:20:14 2016 From: prosfilaes at gmail.com (David Starner) Date: Sun, 31 Jan 2016 03:20:14 +0000 Subject: Encoding/Use of pontial unpaired UTF-16 surrogate pair specifiers In-Reply-To: References: Message-ID: Obfuscate is right. It might conceivably be better than nothing, but at its best it will stop someone for an hour or so. Why not run it through a standard encryption protocol and if necessary use one of the options mentioned before to turn it into valid text? On Sat, Jan 30, 2016, 6:31 PM J Decker wrote: > On Sat, Jan 30, 2016 at 4:45 PM, Shawn Steele > wrote: > > Why do you need illegal unicode code points? > > This originated from learning Javascript; which is internally UTF-16. > Playing with localStorage, some browsers use a sqlite3 database to > store values. The database is UTF-8 so there must be a valid > conversion between the internal UTF-16 and UTF-8 localStorage (and > reverse). I wanted to obfuscate the data stored for a certain > application; and cover all content that someone might send. Having > slept on this, I realized that even if hieroglyphics were stored, if I > pulled out the character using codePointAt() and applied a 20 bit > random value to it using XOR it could end up as a normal character, > and I wouldn't know I had to use a 20 bit value... so every character > would have to use a 20 bit mask (which could end up with a value > that's D800-DFFF). > > I've reconsidered and think for ease of implementation to just mask > every UTF-16 character (not codepoint) with a 10 bit value, This will > result in no character changing from BMP space to surrogate-pair or > vice-versa. > > Thanks for the feedback. > (sorry if I've used some terms inaccurately) > > > > > -----Original Message----- > > From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of J Decker > > Sent: Saturday, January 30, 2016 6:40 AM > > To: unicode at unicode.org > > Subject: Encoding/Use of pontial unpaired UTF-16 surrogate pair > specifiers > > > > I do see that the code points D800-DFFF should not be encoded in any UTF > format (UTF8/32)... > > > > UTF8 has a way to define any byte that might otherwise be used as an > encoding byte. > > > > UTF16 has no way to define a code point that is D800-DFFF; this is an > issue if I want to apply some sort of encryption algorithm and still have > the result treated as text for transmission and encoding to other string > systems. > > > > http://www.azillionmonkeys.com/qed/unicode.html lists Unicode > > private areas Area-A which is U-F0000:U-FFFFD and Area-B which is > U-100000:U-10FFFD which will suffice for a workaround for my purposes.... > > > > For my purposes I will implement F0000-F0800 to be (code point minus > > D800 and then add F0000 (or vice versa)) and then encoded as a surrogate > pair... it would have been super nice of unicode standards included a way > to specify code point even if there isn't a language character assigned to > that point. > > > > http://unicode.org/faq/utf_bom.html > > does say: "Q: Are there any 16-bit values that are invalid? > > > > A: Unpaired surrogates are invalid in UTFs. These include any value in > the range D800 to DBFF not followed by a value in the range DC00 to DFFF, > or any value in the range DC00 to DFFF not preceded by a value in the range > D800 to DBFF " > > > > and "Q: How do I convert an unpaired UTF-16 surrogate to UTF-8? > > > > A different issue arises if an unpaired surrogate is encountered when > converting ill-formed UTF-16 data. By represented such an unpaired > surrogate on its own as a 3-byte sequence, the resulting UTF-8 data stream > would become ill-formed. While it faithfully reflects the nature of the > input, Unicode conformance requires that encoding form conversion always > results in valid data stream. Therefore a converter must treat this as an > error. " > > > > > > > > I did see these older messages... (not that they talk about this much > just more info) > http://www.unicode.org/mail-arch/unicode-ml/y2013-m01/0204.html > > http://www.unicode.org/mail-arch/unicode-ml/y2013-m01/0209.html > > http://www.unicode.org/mail-arch/unicode-ml/y2013-m01/0210.html > > http://www.unicode.org/mail-arch/unicode-ml/y2013-m01/0201.html > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From Shawn.Steele at microsoft.com Sun Jan 31 02:21:01 2016 From: Shawn.Steele at microsoft.com (Shawn Steele) Date: Sun, 31 Jan 2016 08:21:01 +0000 Subject: Encoding/Use of pontial unpaired UTF-16 surrogate pair specifiers In-Reply-To: References: Message-ID: Typically XOR?ing a constant isn?t really considered worth messing with. It?s somewhat trivial to figure out the key to un-XOR. On Sat, Jan 30, 2016, 6:31 PM J Decker > wrote: On Sat, Jan 30, 2016 at 4:45 PM, Shawn Steele > wrote: > Why do you need illegal unicode code points? This originated from learning Javascript; which is internally UTF-16. Playing with localStorage, some browsers use a sqlite3 database to store values. The database is UTF-8 so there must be a valid conversion between the internal UTF-16 and UTF-8 localStorage (and reverse). I wanted to obfuscate the data stored for a certain application; and cover all content that someone might send. Having slept on this, I realized that even if hieroglyphics were stored, if I pulled out the character using codePointAt() and applied a 20 bit random value to it using XOR it could end up as a normal character, and I wouldn't know I had to use a 20 bit value... so every character would have to use a 20 bit mask (which could end up with a value that's D800-DFFF). I've reconsidered and think for ease of implementation to just mask every UTF-16 character (not codepoint) with a 10 bit value, This will result in no character changing from BMP space to surrogate-pair or vice-versa. Thanks for the feedback. (sorry if I've used some terms inaccurately) > > -----Original Message----- > From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of J Decker > Sent: Saturday, January 30, 2016 6:40 AM > To: unicode at unicode.org > Subject: Encoding/Use of pontial unpaired UTF-16 surrogate pair specifiers > > I do see that the code points D800-DFFF should not be encoded in any UTF format (UTF8/32)... > > UTF8 has a way to define any byte that might otherwise be used as an encoding byte. > > UTF16 has no way to define a code point that is D800-DFFF; this is an issue if I want to apply some sort of encryption algorithm and still have the result treated as text for transmission and encoding to other string systems. > > http://www.azillionmonkeys.com/qed/unicode.html lists Unicode > private areas Area-A which is U-F0000:U-FFFFD and Area-B which is U-100000:U-10FFFD which will suffice for a workaround for my purposes.... > > For my purposes I will implement F0000-F0800 to be (code point minus > D800 and then add F0000 (or vice versa)) and then encoded as a surrogate pair... it would have been super nice of unicode standards included a way to specify code point even if there isn't a language character assigned to that point. > > http://unicode.org/faq/utf_bom.html > does say: "Q: Are there any 16-bit values that are invalid? > > A: Unpaired surrogates are invalid in UTFs. These include any value in the range D800 to DBFF not followed by a value in the range DC00 to DFFF, or any value in the range DC00 to DFFF not preceded by a value in the range D800 to DBFF " > > and "Q: How do I convert an unpaired UTF-16 surrogate to UTF-8? > > A different issue arises if an unpaired surrogate is encountered when converting ill-formed UTF-16 data. By represented such an unpaired surrogate on its own as a 3-byte sequence, the resulting UTF-8 data stream would become ill-formed. While it faithfully reflects the nature of the input, Unicode conformance requires that encoding form conversion always results in valid data stream. Therefore a converter must treat this as an error. " > > > > I did see these older messages... (not that they talk about this much just more info) http://www.unicode.org/mail-arch/unicode-ml/y2013-m01/0204.html > http://www.unicode.org/mail-arch/unicode-ml/y2013-m01/0209.html > http://www.unicode.org/mail-arch/unicode-ml/y2013-m01/0210.html > http://www.unicode.org/mail-arch/unicode-ml/y2013-m01/0201.html -------------- next part -------------- An HTML attachment was scrubbed... URL: From d3ck0r at gmail.com Sun Jan 31 03:27:19 2016 From: d3ck0r at gmail.com (J Decker) Date: Sun, 31 Jan 2016 01:27:19 -0800 Subject: Encoding/Use of pontial unpaired UTF-16 surrogate pair specifiers In-Reply-To: References: Message-ID: On Sun, Jan 31, 2016 at 12:21 AM, Shawn Steele wrote: > Typically XOR?ing a constant isn?t really considered worth messing with. > It?s somewhat trivial to figure out the key to un-XOR. > obviously. It's not constant, nor is it stored anywhere in the code or data. > > > On Sat, Jan 30, 2016, 6:31 PM J Decker wrote: > > On Sat, Jan 30, 2016 at 4:45 PM, Shawn Steele > wrote: >> Why do you need illegal unicode code points? > > This originated from learning Javascript; which is internally UTF-16. > Playing with localStorage, some browsers use a sqlite3 database to > store values. The database is UTF-8 so there must be a valid > conversion between the internal UTF-16 and UTF-8 localStorage (and > reverse). I wanted to obfuscate the data stored for a certain > application; and cover all content that someone might send. Having > slept on this, I realized that even if hieroglyphics were stored, if I > pulled out the character using codePointAt() and applied a 20 bit > random value to it using XOR it could end up as a normal character, > and I wouldn't know I had to use a 20 bit value... so every character > would have to use a 20 bit mask (which could end up with a value > that's D800-DFFF). > > I've reconsidered and think for ease of implementation to just mask > every UTF-16 character (not codepoint) with a 10 bit value, This will > result in no character changing from BMP space to surrogate-pair or > vice-versa. > > Thanks for the feedback. > (sorry if I've used some terms inaccurately) > >> >> -----Original Message----- >> From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of J Decker >> Sent: Saturday, January 30, 2016 6:40 AM >> To: unicode at unicode.org >> Subject: Encoding/Use of pontial unpaired UTF-16 surrogate pair specifiers >> >> I do see that the code points D800-DFFF should not be encoded in any UTF >> format (UTF8/32)... >> >> UTF8 has a way to define any byte that might otherwise be used as an >> encoding byte. >> >> UTF16 has no way to define a code point that is D800-DFFF; this is an >> issue if I want to apply some sort of encryption algorithm and still have >> the result treated as text for transmission and encoding to other string >> systems. >> >> http://www.azillionmonkeys.com/qed/unicode.html lists Unicode >> private areas Area-A which is U-F0000:U-FFFFD and Area-B which is >> U-100000:U-10FFFD which will suffice for a workaround for my purposes.... >> >> For my purposes I will implement F0000-F0800 to be (code point minus >> D800 and then add F0000 (or vice versa)) and then encoded as a surrogate >> pair... it would have been super nice of unicode standards included a way to >> specify code point even if there isn't a language character assigned to that >> point. >> >> http://unicode.org/faq/utf_bom.html >> does say: "Q: Are there any 16-bit values that are invalid? >> >> A: Unpaired surrogates are invalid in UTFs. These include any value in the >> range D800 to DBFF not followed by a value in the range DC00 to DFFF, or any >> value in the range DC00 to DFFF not preceded by a value in the range D800 to >> DBFF " >> >> and "Q: How do I convert an unpaired UTF-16 surrogate to UTF-8? >> >> A different issue arises if an unpaired surrogate is encountered when >> converting ill-formed UTF-16 data. By represented such an unpaired surrogate >> on its own as a 3-byte sequence, the resulting UTF-8 data stream would >> become ill-formed. While it faithfully reflects the nature of the input, >> Unicode conformance requires that encoding form conversion always results in >> valid data stream. Therefore a converter must treat this as an error. " >> >> >> >> I did see these older messages... (not that they talk about this much just >> more info) http://www.unicode.org/mail-arch/unicode-ml/y2013-m01/0204.html >> http://www.unicode.org/mail-arch/unicode-ml/y2013-m01/0209.html >> http://www.unicode.org/mail-arch/unicode-ml/y2013-m01/0210.html >> http://www.unicode.org/mail-arch/unicode-ml/y2013-m01/0201.html From chris.jacobs at xs4all.nl Sun Jan 31 10:31:45 2016 From: chris.jacobs at xs4all.nl (Chris Jacobs) Date: Sun, 31 Jan 2016 17:31:45 +0100 Subject: Encoding/Use of pontial unpaired UTF-16 surrogate pair specifiers In-Reply-To: References: Message-ID: <645d8c72efe9658e2e6580f95600fdcb@xs4all.nl> J Decker schreef op 2016-01-31 03:28: > I've reconsidered and think for ease of implementation to just mask > every UTF-16 character (not codepoint) with a 10 bit value, This will > result in no character changing from BMP space to surrogate-pair or > vice-versa. > > Thanks for the feedback. So you are still trying to handle the unarmed output as plaintext. Do you realize that if a string in the output is replaced by a canonical equivalent one this may mess up things because the originals are not canonical equivalent? From d3ck0r at gmail.com Sun Jan 31 11:56:11 2016 From: d3ck0r at gmail.com (J Decker) Date: Sun, 31 Jan 2016 09:56:11 -0800 Subject: Encoding/Use of pontial unpaired UTF-16 surrogate pair specifiers In-Reply-To: <645d8c72efe9658e2e6580f95600fdcb@xs4all.nl> References: <645d8c72efe9658e2e6580f95600fdcb@xs4all.nl> Message-ID: On Sun, Jan 31, 2016 at 8:31 AM, Chris Jacobs wrote: > > > J Decker schreef op 2016-01-31 03:28: >> >> I've reconsidered and think for ease of implementation to just mask >> every UTF-16 character (not codepoint) with a 10 bit value, This will >> result in no character changing from BMP space to surrogate-pair or >> vice-versa. >> >> Thanks for the feedback. > > > So you are still trying to handle the unarmed output as plaintext. > Do you realize that if a string in the output is replaced by a canonical > equivalent > one this may mess up things because the originals are not canonical > equivalent? > I see ... things like mentioned here http://websec.github.io/unicode-security-guide/character-transformations/ From chris.jacobs at xs4all.nl Sun Jan 31 12:07:57 2016 From: chris.jacobs at xs4all.nl (Chris Jacobs) Date: Sun, 31 Jan 2016 19:07:57 +0100 Subject: Encoding/Use of pontial unpaired UTF-16 surrogate pair specifiers In-Reply-To: References: <645d8c72efe9658e2e6580f95600fdcb@xs4all.nl> Message-ID: <41a4f348541db2b8a0ad5b191c3a1c64@xs4all.nl> J Decker schreef op 2016-01-31 18:56: > On Sun, Jan 31, 2016 at 8:31 AM, Chris Jacobs > wrote: >> >> >> J Decker schreef op 2016-01-31 03:28: >>> >>> I've reconsidered and think for ease of implementation to just mask >>> every UTF-16 character (not codepoint) with a 10 bit value, This >>> will >>> result in no character changing from BMP space to surrogate-pair or >>> vice-versa. >>> >>> Thanks for the feedback. >> >> >> So you are still trying to handle the unarmed output as plaintext. >> Do you realize that if a string in the output is replaced by a >> canonical >> equivalent >> one this may mess up things because the originals are not canonical >> equivalent? >> > I see ... things like mentioned here > http://websec.github.io/unicode-security-guide/character-transformations/ Yes especially the part about normalization. This would not only spoil the normalized string, but also, as the string can have a different length, for anything after that your ever-changing xor-values may go out of sync. From Shawn.Steele at microsoft.com Sun Jan 31 13:52:32 2016 From: Shawn.Steele at microsoft.com (Shawn Steele) Date: Sun, 31 Jan 2016 19:52:32 +0000 Subject: Encoding/Use of pontial unpaired UTF-16 surrogate pair specifiers In-Reply-To: <41a4f348541db2b8a0ad5b191c3a1c64@xs4all.nl> References: <645d8c72efe9658e2e6580f95600fdcb@xs4all.nl> <41a4f348541db2b8a0ad5b191c3a1c64@xs4all.nl> Message-ID: It should be understood that any algorithm that changes the Unicode character data to non-character data is therefore binary, and not Unicode. It's inappropriate to shove binary data into unicode streams because stuff will break. https://blogs.msdn.microsoft.com/shawnste/2005/09/26/avoid-treating-binary-data-as-a-string/ -----Original Message----- From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Chris Jacobs Sent: Sunday, January 31, 2016 10:08 AM To: J Decker Cc: unicode at unicode.org Subject: Re: Encoding/Use of pontial unpaired UTF-16 surrogate pair specifiers J Decker schreef op 2016-01-31 18:56: > On Sun, Jan 31, 2016 at 8:31 AM, Chris Jacobs > wrote: >> >> >> J Decker schreef op 2016-01-31 03:28: >>> >>> I've reconsidered and think for ease of implementation to just mask >>> every UTF-16 character (not codepoint) with a 10 bit value, This >>> will result in no character changing from BMP space to >>> surrogate-pair or vice-versa. >>> >>> Thanks for the feedback. >> >> >> So you are still trying to handle the unarmed output as plaintext. >> Do you realize that if a string in the output is replaced by a >> canonical equivalent one this may mess up things because the >> originals are not canonical equivalent? >> > I see ... things like mentioned here > http://websec.github.io/unicode-security-guide/character-transformatio > ns/ Yes especially the part about normalization. This would not only spoil the normalized string, but also, as the string can have a different length, for anything after that your ever-changing xor-values may go out of sync. From verdy_p at wanadoo.fr Sun Jan 31 15:49:26 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Sun, 31 Jan 2016 22:49:26 +0100 Subject: Encoding/Use of pontial unpaired UTF-16 surrogate pair specifiers In-Reply-To: References: <645d8c72efe9658e2e6580f95600fdcb@xs4all.nl> <41a4f348541db2b8a0ad5b191c3a1c64@xs4all.nl> Message-ID: I also agree. To transport binary data over a plain-text format there are other common types, including Base64, Quoted-Printable (and you can also compress the binary data before this transformation, using Gzip, deflate... for example in MIME for emails; or compress it after this transformation only over the transport channel like in HTTP which natively supports transparent 8-bit streams, this solution being generally more performant). There's no reliable way to preserve the exact binary encoding of texts using invalid UTF sequences (including unpaired surrogates in UTF-16, or isolated surrogate code points and other non-characters in other UTFs, or forbidden byte values or restricted byte sequence in UTF-8) without using a binary envelope (which cannot preserve the same encoding of valid UTF sequences). Even by using another encoding scheme/encoding form or legacy charset mapped with Unicode (including GB and HKCS charsets), you will fail each time due to the canonical equivalences and the existing conforming conversions between all UTFs which are made to preserve the identity of characters, not the equality of their binary encodings. In summary, what you need is: - a transport-syntax (see HTTP for example) to allow decoding your envelope, and - a separate media-type (see HTTP and MIME for example, don't choose any one in "text/*", but in "binary/*" or possibly "application/*") or some filesystem convention or standards for file types (such as file name extensions in common Unix/Linux filesystems or FTP, or external metadata streams for file attributes such as in MacOS, or VMS, or even in NTFS and almost all HTTP-based filesystems) for your chosen binary encoding encapsulated in a text-compatible format. If your encoded document does not match exactly the strict text encoding conformances, it cannot be declared and handled at all as if it was valid text. You have to handle it as an opaque BLOB (as if they were data for a bitmap image or executable code, or a PKI encryption key, or a data signature such as SHA or an encrypted stream such as DES). Basic filesystems for Unix/Linux or FAT treat all their files as unrestricted blobs (that's why they use a separate data to represent its actual type to decode it with specific algorithms, the most common being filename extensions to determine the envelope format, then using internal data structures in this envelope such as MPEG, OGG, or XML with schemas validation, or ZIP archives embedding mutiple structured streams with some conventions) All these options are out of scope of the Unicode standard which is not made to transport and preserve the binary encodings, but is made purposely to allow transparent conversions between all conforming UTFs of valid text only (nothing else) and to support canonical equivalences as much as possible in "Unicode-conforming process", so that they'll be able to choose between these wellknown and standardized text representations. 2016-01-31 20:52 GMT+01:00 Shawn Steele : > It should be understood that any algorithm that changes the Unicode > character data to non-character data is therefore binary, and not Unicode. > It's inappropriate to shove binary data into unicode streams because stuff > will break. > > https://blogs.msdn.microsoft.com/shawnste/2005/09/26/avoid-treating-binary-data-as-a-string/ > > > -----Original Message----- > From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Chris > Jacobs > Sent: Sunday, January 31, 2016 10:08 AM > To: J Decker > Cc: unicode at unicode.org > Subject: Re: Encoding/Use of pontial unpaired UTF-16 surrogate pair > specifiers > > > > J Decker schreef op 2016-01-31 18:56: > > On Sun, Jan 31, 2016 at 8:31 AM, Chris Jacobs > > wrote: > >> > >> > >> J Decker schreef op 2016-01-31 03:28: > >>> > >>> I've reconsidered and think for ease of implementation to just mask > >>> every UTF-16 character (not codepoint) with a 10 bit value, This > >>> will result in no character changing from BMP space to > >>> surrogate-pair or vice-versa. > >>> > >>> Thanks for the feedback. > >> > >> > >> So you are still trying to handle the unarmed output as plaintext. > >> Do you realize that if a string in the output is replaced by a > >> canonical equivalent one this may mess up things because the > >> originals are not canonical equivalent? > >> > > I see ... things like mentioned here > > http://websec.github.io/unicode-security-guide/character-transformatio > > ns/ > > Yes especially the part about normalization. > This would not only spoil the normalized string, but also, as the string > can have a different length, for anything after that your ever-changing > xor-values may go out of sync. > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: