From mark at macchiato.com Thu Oct 1 00:01:12 2015 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Thu, 1 Oct 2015 07:01:12 +0200 Subject: Unicode in passwords In-Reply-To: <000601d0fbff$42881070$c7983150$@gmail.com> References: <4057F09C-5426-40ED-AB3D-356BBBC85276@telia.com> <2E86845B-6A9F-4D5C-A02A-C2C82F4DB28F@gmail.com> <000601d0fbff$42881070$c7983150$@gmail.com> Message-ID: I've heard some concerns, mostly around the UI for people typing in passwords; that they get frustrated when they have to type their password on different devices: 1. A device may not have keyboard mappings with all the keys for their language. 2. The keyboard mappings across devices vary where they put keys, especially for minority script characters using some pattern of shift/alt/option/etc.. So the pattern of keys that they use on one may be different than on another. 3. People are often 'blind' to the characters being entered: they just see a dot, for example. If the keyboards for their language are not standard, then that makes it difficult. 4. Even if they see, for an instant, the character they type, if the device doesn't have a font for their language's characters, it may be just a box. 5. Even if those are not true, the glyph may not be distinctive enough if the size is too small. Mark *? Il meglio ? l?inimico del bene ?* On Thu, Oct 1, 2015 at 6:11 AM, Jonathan Rosenne wrote: > For languages such as Java, passwords should be handled as byte arrays > rather than strings. This may make it difficult to apply normalization. > > > > Jonathan Rosenne > > > > *From:* Unicode [mailto:unicode-bounces at unicode.org] *On Behalf Of *Clark > S. Cox III > *Sent:* Thursday, October 01, 2015 2:16 AM > *To:* Hans ?berg > *Cc:* unicode at unicode.org; John O'Conner > *Subject:* Re: Unicode in passwords > > > > > > On 2015/09/30, at 13:29, Hans ?berg wrote: > > > > > > On 30 Sep 2015, at 18:33, John O'Conner wrote: > > Can you recommend any documents to help me understand potential issues (if > any) for password policies and validation methods that allow characters > from more "exotic" portions of the Unicode space? > > > On UNIX computers, one computes a hash (like SHA-256), which is then used > to authenticate the password up to a high probability. The hash is stored > in the open, but it is not known how to compute the password from the hash, > so knowing the hash does not easily allow authentication. > > So if the password is > > > > ? normalized and then ? > > > > encoded in say UTF-8 and then hashed, it would seem to take care of most > problems. > > > > You really wouldn?t want ?Schl?ssel? and ?Schl?ssel? being different > passwords, would you? (assuming that my mail client and/or OS is not > interfering, the first is NFC, while the second is NFD) > -------------- next part -------------- An HTML attachment was scrubbed... URL: From marc at keyman.com Thu Oct 1 00:19:35 2015 From: marc at keyman.com (Marc Durdin) Date: Thu, 1 Oct 2015 05:19:35 +0000 Subject: Unicode in passwords In-Reply-To: References: <4057F09C-5426-40ED-AB3D-356BBBC85276@telia.com> <2E86845B-6A9F-4D5C-A02A-C2C82F4DB28F@gmail.com> <000601d0fbff$42881070$c7983150$@gmail.com> Message-ID: <1CEDD746887FFF4B834688E7AF5FDA5A82323850@federation.tavultesoft.local> That?s a good list. A few other things I?ve seen: 1. Even if the user sees the character for an instant, complex script characters can be very puzzling as they appear differently and ?out of order? when isolated. 2. The number of dots corresponds to the number of code points, which is misleading with complex scripts or advanced input methods: you won?t necessarily see one dot per keystroke; in some cases, typing a character may replace a dot with another dot or even delete a dot. 3. Directionality can be frustrating. I?ve had to assist in situations where a user has set a new Windows password using a custom keyboard, and then been unable to login, e.g. with Remote Desktop, or even with the standard Windows login screen. iOS, for example, doesn?t even allow the user to select a different input method for password boxes ? it seems to always be Latin script only (even if you?ve removed all your Latin script keyboards from Settings). Marc From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Mark Davis ?? Sent: Thursday, 1 October 2015 3:01 PM To: Jonathan Rosenne Cc: Unicode Public Subject: Re: Unicode in passwords I've heard some concerns, mostly around the UI for people typing in passwords; that they get frustrated when they have to type their password on different devices: 1. A device may not have keyboard mappings with all the keys for their language. 2. The keyboard mappings across devices vary where they put keys, especially for minority script characters using some pattern of shift/alt/option/etc.. So the pattern of keys that they use on one may be different than on another. 3. People are often 'blind' to the characters being entered: they just see a dot, for example. If the keyboards for their language are not standard, then that makes it difficult. 4. Even if they see, for an instant, the character they type, if the device doesn't have a font for their language's characters, it may be just a box. 5. Even if those are not true, the glyph may not be distinctive enough if the size is too small. Mark ? Il meglio ? l?inimico del bene ? On Thu, Oct 1, 2015 at 6:11 AM, Jonathan Rosenne > wrote: For languages such as Java, passwords should be handled as byte arrays rather than strings. This may make it difficult to apply normalization. Jonathan Rosenne From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Clark S. Cox III Sent: Thursday, October 01, 2015 2:16 AM To: Hans ?berg Cc: unicode at unicode.org; John O'Conner Subject: Re: Unicode in passwords On 2015/09/30, at 13:29, Hans ?berg > wrote: On 30 Sep 2015, at 18:33, John O'Conner > wrote: Can you recommend any documents to help me understand potential issues (if any) for password policies and validation methods that allow characters from more "exotic" portions of the Unicode space? On UNIX computers, one computes a hash (like SHA-256), which is then used to authenticate the password up to a high probability. The hash is stored in the open, but it is not known how to compute the password from the hash, so knowing the hash does not easily allow authentication. So if the password is ? normalized and then ? encoded in say UTF-8 and then hashed, it would seem to take care of most problems. You really wouldn?t want ?Schl?ssel? and ?Schl?ssel? being different passwords, would you? (assuming that my mail client and/or OS is not interfering, the first is NFC, while the second is NFD) -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Thu Oct 1 02:33:22 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Thu, 1 Oct 2015 08:33:22 +0100 Subject: Unicode in passwords In-Reply-To: References: <4057F09C-5426-40ED-AB3D-356BBBC85276@telia.com> <2E86845B-6A9F-4D5C-A02A-C2C82F4DB28F@gmail.com> <000601d0fbff$42881070$c7983150$@gmail.com> Message-ID: <20151001083322.5440cc2a@JRWUBU2> On Thu, 1 Oct 2015 07:01:12 +0200 Mark Davis ?? wrote: > I've heard some concerns, mostly around the UI for people typing in > passwords; that they get frustrated when they have to type their > password on different devices: > > 1. A device may not have keyboard mappings with all the keys for > their language. The typographers will probably give English as an example! Where's the en dash key? > 2. The keyboard mappings across devices vary where they put keys, > especially for minority script characters using some pattern of > shift/alt/option/etc.. So the pattern of keys that they use on one > may be different than on another. Even ASCII can have problems. A password containing '#' and '|' can't be entered when a physical US keyboard (102 keys) is interpreted using a mapping for a British keyboard (103 keys). (There seem to be different conventions as to which key is missing.) Richard. From richard.wordingham at ntlworld.com Thu Oct 1 02:42:28 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Thu, 1 Oct 2015 08:42:28 +0100 Subject: UAX #29, Unicode Text Segmentation, update to improve Mongolian word segmentation In-Reply-To: <560C4E6D.4050004@unicode.org> References: <560C4E6D.4050004@unicode.org> Message-ID: <20151001084228.63589572@JRWUBU2> On Wed, 30 Sep 2015 14:04:45 -0700 announcements at unicode.org wrote: > For further background on this issue and possible > ways to address it, see PRI #308 > , /Property Change for U+202F > NARROW NO-BREAK SPACE (NNBSP)/. Is this the announcement of PRI #308? Richard. From mathias at qiwi.be Thu Oct 1 02:59:22 2015 From: mathias at qiwi.be (Mathias Bynens) Date: Thu, 1 Oct 2015 09:59:22 +0200 Subject: Unicode in passwords In-Reply-To: <1CEDD746887FFF4B834688E7AF5FDA5A82323850@federation.tavultesoft.local> References: <4057F09C-5426-40ED-AB3D-356BBBC85276@telia.com> <2E86845B-6A9F-4D5C-A02A-C2C82F4DB28F@gmail.com> <000601d0fbff$42881070$c7983150$@gmail.com> <1CEDD746887FFF4B834688E7AF5FDA5A82323850@federation.tavultesoft.local> Message-ID: <0D7A9A89-4B53-421F-BCEF-CFD975B5F11B@qiwi.be> > On 1 Oct 2015, at 07:19, Marc Durdin wrote: > > 2. The number of dots corresponds to the number of code points, which is misleading with complex scripts or advanced input methods: you won?t necessarily see one dot per keystroke; in some cases, typing a character may replace a dot with another dot or even delete a dot. Lots of systems have a bug where supplementary code points show up as two dots instead of one, due to UTF-16 being used internally. OS X is an example. Demo (open in your browser): data:text/html, From mark at macchiato.com Thu Oct 1 05:18:47 2015 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Thu, 1 Oct 2015 12:18:47 +0200 Subject: Unicode in passwords In-Reply-To: <20151001083322.5440cc2a@JRWUBU2> References: <4057F09C-5426-40ED-AB3D-356BBBC85276@telia.com> <2E86845B-6A9F-4D5C-A02A-C2C82F4DB28F@gmail.com> <000601d0fbff$42881070$c7983150$@gmail.com> <20151001083322.5440cc2a@JRWUBU2> Message-ID: As to #1, my note needs some clarification. For characters that don't typically occur on *any* keyboards, people don't typically use those in their passwords, so switching between different devices doesn't matter. (One caveat would be where the password dialog permits selection from a palette. That way it is independent of device.) The problem comes in where someone uses (as I do), a Mac, a Windows box, a Chromebook, and an Android tablet & phone. The Mac makes it easy to type an em-dash?to use your example. It is slightly less easy on Android, a real pain on Windows, and I haven't even tried on a Chomebook (maybe easy, maybe not, just haven't tried). So for me to use an em-dash in a password would just be opening up to annoyance. I just had a quick look, and it appears that on the latest systems we have data for in CLDR, em-dash is typeable (somehow) on: - all of the android keyboards - 85% of the osx keyboards - 27% of chromeos keyboards - 9% of windows keyboards http://www.unicode.org/cldr/charts/28/keyboards/chars2keyboards.html It's even somewhat uglier in the case where I'm typing a password on a borrowed/public computing device (although typing a password on such a device may not be exactly a great idea from a security standpoint!). Mark *? Il meglio ? l?inimico del bene ?* On Thu, Oct 1, 2015 at 9:33 AM, Richard Wordingham < richard.wordingham at ntlworld.com> wrote: > On Thu, 1 Oct 2015 07:01:12 +0200 > Mark Davis ?? wrote: > > > I've heard some concerns, mostly around the UI for people typing in > > passwords; that they get frustrated when they have to type their > > password on different devices: > > > > 1. A device may not have keyboard mappings with all the keys for > > their language. > > The typographers will probably give English as an example! Where's > the en dash key? > > > 2. The keyboard mappings across devices vary where they put keys, > > especially for minority script characters using some pattern of > > shift/alt/option/etc.. So the pattern of keys that they use on one > > may be different than on another. > > Even ASCII can have problems. A password containing '#' and '|' can't > be entered when a physical US keyboard (102 keys) is interpreted using > a mapping for a British keyboard (103 keys). (There seem to be > different conventions as to which key is missing.) > > Richard. > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From A.Schappo at lboro.ac.uk Thu Oct 1 07:56:48 2015 From: A.Schappo at lboro.ac.uk (Andre Schappo) Date: Thu, 1 Oct 2015 12:56:48 +0000 Subject: Unicode in passwords In-Reply-To: <20151001083322.5440cc2a@JRWUBU2> References: <4057F09C-5426-40ED-AB3D-356BBBC85276@telia.com> <2E86845B-6A9F-4D5C-A02A-C2C82F4DB28F@gmail.com> <000601d0fbff$42881070$c7983150$@gmail.com> <20151001083322.5440cc2a@JRWUBU2> Message-ID: <7BD7361C-AAFD-40D9-9D14-5B41295EAE6D@lboro.ac.uk> On 1 Oct 2015, at 08:33, Richard Wordingham wrote: > > Even ASCII can have problems. A password containing '#' and '|' can't > be entered when a physical US keyboard (102 keys) is interpreted using > a mapping for a British keyboard (103 keys). (There seem to be > different conventions as to which key is missing.) I used to have a # in one of my passwords. It used to be fun finding where the # key was on a computer's default pre-login keyboard mapping which frequently did not match what was printed on the physical keys. I became quite adept at it and it certainly made for a more secure password because of the challenge of finding # on the keyboard. I, personally, would really like to have a non-ascii unicode password. I would when choosing a non-ascii unicode password test to make sure I could enter it on all the devices I use. Andr? Schappo From richard.wordingham at ntlworld.com Thu Oct 1 11:26:33 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Thu, 1 Oct 2015 17:26:33 +0100 Subject: NNBSP and Word Boundaries Message-ID: <20151001172633.2a72f48f@JRWUBU2> The background document for PRI #308 (Property Change for NNBSP), http://www.unicode.org/review/pri308/pri308-background.html , says, "The only other widely noted use for U+202F NNBSP is for representation of the thin non-breaking space (espace fine ins?cable) regularly seen next to certain punctuation marks in French style typography. However, the word segmentation change for U+202F should have no impact in that context, as ExtendNumLet is explicitly for preventing breaks between letters, but does not prevent the identification of word boundaries next to punctuation marks." Unfortunately, this isn't quite true. In the text fragment " dit: ", there would be internal word-boundaries before 'd' and before and after ':', but the word isolated would be the four characters "dit". One solution would be replace NNBSP by U+2009 THIN SPACE, for with untailored line-breaking there would be no line break between it and the 't' or colon, but there would be a word break between the 't' and the thin space. The problem is that characters with property ExtendNumLet can be the first or last character of a word as well as a character strictly within a word. In this respect, the property differs from characters with the property MidNumLet. The problem with using that property instead is that such characters, such as FULL STOP, may be flanked by letters or numbers within a word, but not both. The problem then arises with the Mongolian analogue of '4th' etc. - it is written digit, NNBSP, letters, and is a single word. Richard. From mark at macchiato.com Fri Oct 2 02:25:01 2015 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Fri, 2 Oct 2015 09:25:01 +0200 Subject: NNBSP and Word Boundaries In-Reply-To: <20151001172633.2a72f48f@JRWUBU2> References: <20151001172633.2a72f48f@JRWUBU2> Message-ID: Like Andy, I'm hesitant about changing the gc of NNBSP, because of backwards compatibility concerns. I'm also starting to think that scoping the wb change to Mongolian may not be a bad thing. We might want to explore what it would look like, since it would preserve the maximum compatibility for current use of NNBSP with French and other languages. (The use of NNBSP in French, although not all that common, I suspect would swamp?in terms of frequency of usage?the use with Mongolian, simply because the amount of text worldwide in French is so much greater.) Context The proposed WB change is from XX to EX Old relevant props: WB ; EX ; ExtendNumLet WB ; LE ; ALetter WB ; XX ; Other Old rules with EX: WB13a (AHLetter | Numeric | Katakana | ExtendNumLet) ? ExtendNumLet WB13b ExtendNumLet ? (AHLetter | Numeric | Katakana) ==== Off of the top of my head, perhaps something like: We add: WB ; ML ; Mongolian_Letter WB ; NN ; NNBSP // maybe different name We change the contents of LE and XX to move characters to the two new value sets. Eg, MN gets http://unicode.org/cldr/utility/list-unicodeset.jsp?a= [:scx=/Mong/:]&[:wb=ALetter:] We change the "macro" AHLetter(ALetter | Hebrew_Letter | Mongolian_Letter) *At this point, all behaves the same; that is just a 'refactoring'.* Now we can modify the behavior for sequences with NN adjacent to ML. We add: WB13c Mongolian_Letter ? NNBSP WB13d NNBSP ? Mongolian_Letter *If* we want to also change behavior on the other side of the NNBSP, whenever the Mongolian_Letter and NNBSP occur in sequence, we add 2 additional rules (with the appropriate values for ..., like Numeric) WB13c Mongolian_Letter NNBSP ? (...) WB13d (...) ? NNBSP Mongolian_Letter -------------- next part -------------- An HTML attachment was scrubbed... URL: From lists+unicode at seantek.com Fri Oct 2 23:46:58 2015 From: lists+unicode at seantek.com (Sean Leonard) Date: Fri, 2 Oct 2015 21:46:58 -0700 Subject: Acquiring DIS 10646 Message-ID: <560F5DC2.4080507@seantek.com> As part of yet more research, I would like to get a hold of DIS 10646, aka Draft International Standard ISO/IEC 10646.1 (circa 1990 or 1991). I understand that Draft 2 (10646.2) was accepted and therefore became ISO/IEC 10646-1:1993. Therefore, I am looking for a copy (preferably free, preferably online) of DIS 10646. Maybe also the final one too. Does anyone know how to get it/them? Thank you, Sean From michel at suignard.com Sat Oct 3 00:28:35 2015 From: michel at suignard.com (Michel Suignard) Date: Sat, 3 Oct 2015 05:28:35 +0000 Subject: Acquiring DIS 10646 In-Reply-To: <560F5DC2.4080507@seantek.com> References: <560F5DC2.4080507@seantek.com> Message-ID: ISO never keeps previous versions of standards. You can look into the wg2 web site at dkuug.dk that will give you some versions of these documents (Google or your favorite search engine will be your friend) although all that may disappear any day. If you tell me what you are looking for I can help you. Bear in mind that anything that ISO does is copyrighted. Therefore, forget about a free online version of DIS 10646 of whatever version you are looking for. There is a reason that Unicode (all versions still visible, archive up to 2000 increasingly visible) is a much better source for references. Michel -----Original Message----- From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Sean Leonard Sent: Friday, October 2, 2015 9:47 PM To: unicode at unicode.org Subject: Acquiring DIS 10646 As part of yet more research, I would like to get a hold of DIS 10646, aka Draft International Standard ISO/IEC 10646.1 (circa 1990 or 1991). I understand that Draft 2 (10646.2) was accepted and therefore became ISO/IEC 10646-1:1993. Therefore, I am looking for a copy (preferably free, preferably online) of DIS 10646. Maybe also the final one too. Does anyone know how to get it/them? Thank you, Sean From lists+unicode at seantek.com Sat Oct 3 10:15:55 2015 From: lists+unicode at seantek.com (Sean Leonard) Date: Sat, 3 Oct 2015 08:15:55 -0700 Subject: Acquiring DIS 10646 In-Reply-To: References: <560F5DC2.4080507@seantek.com> Message-ID: <560FF12B.8010105@seantek.com> Thanks. Well, "DIS 10646" is the Draft International Standard, particularly Draft 1, from ~1990 or ~1991. (Sometimes it might have been called 10646.1.) Therefore it would likely only be in print form (or printed and scanned form). It's pretty old. What I understand is that Draft 1 got shot down because it was at variance with the nascent Unicode effort; Draft 2 was eventually adopted as ISO 10646:1993, and is equivalent to Unicode 1.1. (10646-1:1993 plus Amendments 5 to 7 = Unicode 2.0.) Sean On 10/2/2015 10:28 PM, Michel Suignard wrote: > ISO never keeps previous versions of standards. You can look into the wg2 web site at dkuug.dk that will give you some versions of these documents (Google or your favorite search engine will be your friend) although all that may disappear any day. If you tell me what you are looking for I can help you. Bear in mind that anything that ISO does is copyrighted. Therefore, forget about a free online version of DIS 10646 of whatever version you are looking for. > There is a reason that Unicode (all versions still visible, archive up to 2000 increasingly visible) is a much better source for references. > > Michel From doug at ewellic.org Sat Oct 3 13:00:12 2015 From: doug at ewellic.org (Doug Ewell) Date: Sat, 3 Oct 2015 12:00:12 -0600 Subject: Acquiring DIS 10646 In-Reply-To: References: Message-ID: <5EE740EFB8CD44AB92F133A3A5B0D544@DougEwell> Sean Leonard wrote: > What I understand is that Draft 1 got shot down because it was at > variance with the nascent Unicode effort; If I remember correctly, Draft 1 looked a lot like an updated and expanded version of ISO 2022, much more than it did like today's Unicode/10646. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From jsbien at mimuw.edu.pl Sat Oct 3 13:24:29 2015 From: jsbien at mimuw.edu.pl (Janusz S. Bien) Date: Sat, 03 Oct 2015 20:24:29 +0200 Subject: Acquiring DIS 10646 In-Reply-To: <5EE740EFB8CD44AB92F133A3A5B0D544@DougEwell> References: <5EE740EFB8CD44AB92F133A3A5B0D544@DougEwell> Message-ID: <20151003202429.31618900k3rm3hi5@mail.mimuw.edu.pl> Quote/Cytat - Doug Ewell (Sat 03 Oct 2015 08:00:12 PM CEST): > Sean Leonard wrote: > >> What I understand is that Draft 1 got shot down because it was at >> variance with the nascent Unicode effort; > > If I remember correctly, Draft 1 looked a lot like an updated and > expanded version of ISO 2022, much more than it did like today's > Unicode/10646. Rob Pike, Ken Thompson Hello World http://plan9.bell-labs.com/sys/doc/utf.html The draft of ISO 10646 was not very attractive to us. It defined a sparse set of 32-bit characters, which would be hard to implement and have punitive storage requirements. Also, the draft attempted to mollify national interests by allocating 16-bit subspaces to national committees to partition individually. The suggested mode of use was to ??flip?? between separate national standards to implement the international standard. Regards Janusz -- Prof. dr hab. Janusz S. Bie? - Uniwersytet Warszawski (Katedra Lingwistyki Formalnej) Prof. Janusz S. Bie? - University of Warsaw (Formal Linguistics Department) jsbien at uw.edu.pl, jsbien at mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/ From asmus-inc at ix.netcom.com Sat Oct 3 14:28:50 2015 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Sat, 3 Oct 2015 12:28:50 -0700 Subject: Acquiring DIS 10646 In-Reply-To: <560FF12B.8010105@seantek.com> References: <560F5DC2.4080507@seantek.com> <560FF12B.8010105@seantek.com> Message-ID: <56102C72.9010008@ix.netcom.com> An HTML attachment was scrubbed... URL: From lists+unicode at seantek.com Sat Oct 3 14:35:38 2015 From: lists+unicode at seantek.com (Sean Leonard) Date: Sat, 3 Oct 2015 12:35:38 -0700 Subject: Acquiring DIS 10646 In-Reply-To: <20151003202429.31618900k3rm3hi5@mail.mimuw.edu.pl> References: <5EE740EFB8CD44AB92F133A3A5B0D544@DougEwell> <20151003202429.31618900k3rm3hi5@mail.mimuw.edu.pl> Message-ID: <56102E0A.6020503@seantek.com> On 10/3/2015 11:24 AM, Janusz S. Bien wrote: > Quote/Cytat - Doug Ewell (Sat 03 Oct 2015 08:00:12 > PM CEST): > >> Sean Leonard wrote: >> >>> What I understand is that Draft 1 got shot down because it was at >>> variance with the nascent Unicode effort; >> >> If I remember correctly, Draft 1 looked a lot like an updated and >> expanded version of ISO 2022, much more than it did like today's >> Unicode/10646. > > Rob Pike, Ken Thompson > Hello World > > http://plan9.bell-labs.com/sys/doc/utf.html > > The draft of ISO 10646 was not very attractive to us. It defined a > sparse set of 32-bit characters, which would be hard to implement and > have punitive storage requirements. Also, the draft attempted to > mollify national interests by allocating 16-bit subspaces to national > committees to partition individually. The suggested mode of use was to > ??flip?? between separate national standards to implement the > international standard. Yes, that's the one. Sean From lists+unicode at seantek.com Sun Oct 4 07:30:53 2015 From: lists+unicode at seantek.com (Sean Leonard) Date: Sun, 4 Oct 2015 05:30:53 -0700 Subject: Acquiring DIS 10646 In-Reply-To: <56102C72.9010008@ix.netcom.com> References: <560F5DC2.4080507@seantek.com> <560FF12B.8010105@seantek.com> <56102C72.9010008@ix.netcom.com> Message-ID: <56111BFD.4000703@seantek.com> On 10/3/2015 12:28 PM, Asmus Freytag (t) wrote: > On 10/3/2015 8:15 AM, Sean Leonard wrote: >> Thanks. >> >> Well, "DIS 10646" is the Draft International Standard, particularly >> Draft 1, from ~1990 or ~1991. (Sometimes it might have been called >> 10646.1.) Therefore it would likely only be in print form (or printed >> and scanned form). It's pretty old. What I understand is that Draft 1 >> got shot down because it was at variance with the nascent Unicode >> effort; Draft 2 was eventually adopted as ISO 10646:1993, and is >> equivalent to Unicode 1.1. (10646-1:1993 plus Amendments 5 to 7 = >> Unicode 2.0.) > > Sean, > > you never explained your specific interest in this matter. Personal > curiosity? An attempt to write the definite history of character encoding? A long time ago, in a galaxy far, far away.... (Okay it really was not that long ago, and it was pretty close at hand since it was on this list) I proposed adding C1 Control Pictures to Unicode. I am resurrecting that effort, but more slowly this time, with more research and input from implementers. The requirement is that all glyphs for U+0000 - U+00FF be graphically distinct. Debuggers used to do this by referencing the graphemes in the hardware code page, such as Code Page 437, but we have come a long way from 1981, so displaying ? for 0x05 does not make much modern sense. Merely substituting one of the other legacy code pages in for 0x80 - 0x9F does not make sense either. The characters of Code Page 437 overlap with U+00A0 - U+00FF in that range, for example. (Windows-1252 is somewhat more defensible, but Windows-1252 has 5 unassigned code points so it would be incomplete.) Sean From richard.wordingham at ntlworld.com Sun Oct 4 08:02:01 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Sun, 4 Oct 2015 14:02:01 +0100 Subject: Deleting Lone Surrogates Message-ID: <20151004140201.21c9f941@JRWUBU2> In the absence of a specific tailoring, is the combination of a lone surrogate and a combining mark a user-perceived character? Does a lone surrogate constitute a user-perceived character? The problem I have is that because of an application-specific bug, when I attempt to enter the sequence , I appear to be gettig the UTF-16 code unit sequence , which is being interpreted as the codepoint sequence . (The problem seems to arise because I use a sequence of two key strokes to enter candrabindu, and the application or input mechanism has to undo the entry of a supplementary character entered in response to the first keystroke. I've reported the problem as Bug 94753.) Because the lone surrogate is interpreted as the start of a user-perceived character, I can move the cursor to between U+1148F and U+D805. Then pressing the 'delete' key (as opposed to the 'rubout' key) will delete the U+D805. However, if the lone surrogate plus combining mark is a user-perceived character, then all I will be left with is . At present the offending application is treating Tirhuta combining marks as user-perceived characters, but I suspect the application has simply not caught up with Unicode Version 7 yet. Richard. From mark at macchiato.com Sun Oct 4 08:44:32 2015 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Sun, 4 Oct 2015 15:44:32 +0200 Subject: Deleting Lone Surrogates In-Reply-To: <20151004140201.21c9f941@JRWUBU2> References: <20151004140201.21c9f941@JRWUBU2> Message-ID: When I use http://unicode.org/cldr/utility/breaks.jsp, it does show the sequence ????? as just two grapheme clusters. In #29 we are specifically not concerned about ill-formed text (or other degenerate cases). I suppose it would be possible to handle isolated surrogates in different way (eg always breaking) if it represented a common problem, but someone would have to make a very good case for that. Mark *? Il meglio ? l?inimico del bene ?* On Sun, Oct 4, 2015 at 3:02 PM, Richard Wordingham < richard.wordingham at ntlworld.com> wrote: > In the absence of a specific tailoring, is the combination of a lone > surrogate and a combining mark a user-perceived character? Does a lone > surrogate constitute a user-perceived character? > > The problem I have is that because of an application-specific bug, > when I attempt to enter the sequence U+114BA TIRHUTA SIGN CANDRABINDU>, I appear to be gettig the UTF-16 code > unit sequence , which is being interpreted as > the codepoint sequence . > > (The problem seems to arise because I use a sequence of two key strokes > to enter candrabindu, and the application or input mechanism has to undo > the entry of a supplementary character entered in response to the first > keystroke. I've reported the problem as Bug 94753.) > > Because the lone surrogate is interpreted as the start of a > user-perceived character, I can move the cursor to between U+1148F and > U+D805. Then pressing the 'delete' key (as opposed to the 'rubout' > key) will delete the U+D805. However, if the lone surrogate plus > combining mark is a user-perceived character, then all I will be left > with is . At present the offending application is treating > Tirhuta combining marks as user-perceived characters, but I suspect the > application has simply not caught up with Unicode Version 7 yet. > > Richard. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Sun Oct 4 08:56:42 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Sun, 4 Oct 2015 15:56:42 +0200 Subject: Deleting Lone Surrogates In-Reply-To: <20151004140201.21c9f941@JRWUBU2> References: <20151004140201.21c9f941@JRWUBU2> Message-ID: IMHO, isolate surrogates are not valid starters for combining sequences, they must remain isolate : deleting this surrogate in your text editor should not delete the following combining mark which is a separate cluster (even if that cluster is defective before the deletion as it has NO base starter) For default grapheme clusters, it would be helpful to add a rule to force a cluster break before and after any lone surogate (i.e. for grapheme cluster breaking, treat any lone character as if it were a control like NUL U+0000). 2015-10-04 15:02 GMT+02:00 Richard Wordingham < richard.wordingham at ntlworld.com>: > In the absence of a specific tailoring, is the combination of a lone > surrogate and a combining mark a user-perceived character? Does a lone > surrogate constitute a user-perceived character? > > The problem I have is that because of an application-specific bug, > when I attempt to enter the sequence U+114BA TIRHUTA SIGN CANDRABINDU>, I appear to be gettig the UTF-16 code > unit sequence , which is being interpreted as > the codepoint sequence . > > (The problem seems to arise because I use a sequence of two key strokes > to enter candrabindu, and the application or input mechanism has to undo > the entry of a supplementary character entered in response to the first > keystroke. I've reported the problem as Bug 94753.) > > Because the lone surrogate is interpreted as the start of a > user-perceived character, I can move the cursor to between U+1148F and > U+D805. Then pressing the 'delete' key (as opposed to the 'rubout' > key) will delete the U+D805. However, if the lone surrogate plus > combining mark is a user-perceived character, then all I will be left > with is . At present the offending application is treating > Tirhuta combining marks as user-perceived characters, but I suspect the > application has simply not caught up with Unicode Version 7 yet. > > Richard. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From markus.icu at gmail.com Sun Oct 4 12:50:43 2015 From: markus.icu at gmail.com (Markus Scherer) Date: Sun, 4 Oct 2015 10:50:43 -0700 Subject: Deleting Lone Surrogates In-Reply-To: References: <20151004140201.21c9f941@JRWUBU2> Message-ID: I would not spend any time specifying intricate rules for unpaired surrogates in 16-bit strings, or out-of range values in 32-bit strings. Most processing will treat them like unassigned characters, like U+50005, with only default behaviors. markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Sun Oct 4 13:53:25 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Sun, 4 Oct 2015 20:53:25 +0200 Subject: Deleting Lone Surrogates In-Reply-To: References: <20151004140201.21c9f941@JRWUBU2>

Message-ID: The default behavior of unassigned characters are to treat them like base characters, so if they are followed by a combining mark, it would create a default grapheme cluster, which is not appropriate here. Surrogates are not chracters (so they cannot have any character properties), but they are assigned and so don't have "default" properties (only meant for *unassigned* codepoints). I still think that it is safer to treat them (for text segmentation purpose as pure isolates i.e. exactly like basic controls such as U+0000 NUL, or such as the U+FFFD replacement control which is typically used as visible placeholders for various errors). For normalisation purpose they should also have combining class 0 (i.e. acting as blockers against reorderings for canonical equivalences), and not as "transparent" (discarded and bypassed as if those surrogates were not present at all). 2015-10-04 19:50 GMT+02:00 Markus Scherer : > I would not spend any time specifying intricate rules for unpaired > surrogates in 16-bit strings, or out-of range values in 32-bit strings. > Most processing will treat them like unassigned characters, like U+50005, > with only default behaviors. > markus > -------------- next part -------------- An HTML attachment was scrubbed... URL: From asmus-inc at ix.netcom.com Sun Oct 4 14:21:11 2015 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Sun, 4 Oct 2015 12:21:11 -0700 Subject: Acquiring DIS 10646 In-Reply-To: <56111BFD.4000703@seantek.com> References: <560F5DC2.4080507@seantek.com> <560FF12B.8010105@seantek.com> <56102C72.9010008@ix.netcom.com> <56111BFD.4000703@seantek.com> Message-ID: <56117C27.6070905@ix.netcom.com> An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: jjfbbfgf.png Type: image/png Size: 1665 bytes Desc: not available URL: From asmus-inc at ix.netcom.com Sun Oct 4 14:30:23 2015 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Sun, 4 Oct 2015 12:30:23 -0700 Subject: Deleting Lone Surrogates In-Reply-To: <20151004140201.21c9f941@JRWUBU2> References: <20151004140201.21c9f941@JRWUBU2> Message-ID: <56117E4F.9010300@ix.netcom.com> An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Sun Oct 4 14:30:25 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Sun, 4 Oct 2015 20:30:25 +0100 Subject: Deleting Lone Surrogates In-Reply-To: References: <20151004140201.21c9f941@JRWUBU2> Message-ID: <20151004203025.605f0ae6@JRWUBU2> On Sun, 4 Oct 2015 15:44:32 +0200 Mark Davis ?? wrote: > When I use http://unicode.org/cldr/utility/breaks.jsp, it does show > the sequence ????? as just two grapheme clusters. But that's the sequence , which has no lone surrogates at all! (I had to look at the raw email file to be sure of what the text was - my email client displays U+FFFD and malformed alleged UTF-8 the same.) I believe I would have a good chance of repairing that by replacing U+FFFD by nothing. It's not even certain that the substitution to replace U+FFFD would work. With a more fully supported script in LibreOffice, I would have to switch 'CTL diacritic' matching off and hope that substitution replaced the shortest match. That currently works for replacing one Thai consonant by another. To systematically replace a non-spacing Thai character by another, I have to resort to 'regular expression' search and replace. I must hope that they never choose to interpret the search as matching extended grapheme clusters. Do all Unicode character properties extend to all codepoints? If not, how does one tell which do and which don't? If the Unicode segmentation algorithms do apply to sequences of codepoints, as opposed to merely to Unicode strings, then indeed is a legacy grapheme cluster. It's an extremely unhelpful one! > In #29 we are specifically not concerned about ill-formed text (or > other degenerate cases). I suppose it would be possible to handle > isolated surrogates in different way (eg always breaking) if it > represented a common problem, but someone would have to make a very > good case for that. I suppose the argument will go that by using rare scripts or obsolete characters, one deserves all the problems that one gets. The only widely used script where one is likely to encounter lone surrogates is CJK, and they are less of a problem there. Ideally, one shouldn't get isolated surrogates, but when one does, the mechanisms intended to prevent them occurring can make dealing with them difficult. Richard. From richard.wordingham at ntlworld.com Sun Oct 4 14:38:02 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Sun, 4 Oct 2015 20:38:02 +0100 Subject: Deleting Lone Surrogates In-Reply-To: References: <20151004140201.21c9f941@JRWUBU2>

Message-ID: <20151004203802.2189da64@JRWUBU2> On Sun, 4 Oct 2015 10:50:43 -0700 Markus Scherer wrote: > I would not spend any time specifying intricate rules for unpaired > surrogates in 16-bit strings, or out-of range values in 32-bit > strings. Most processing will treat them like unassigned characters, > like U+50005, with only default behaviors. The core problem here is that many editors will not allow one to delete just a non-initial character from a grapheme cluster. I fear there may be editors that don't even allow one to delete the final character. This may not be a problem when one works with a small set of grapheme clusters, as in French or German, or possibly even Vietnamese, but becomes a problem when working with such a large set that the notion of them being user-perceived characters strains credulity. A stray U+50005 before a combining mark would also be fiddly to get rid of, but even if the editor does not allow the entry of arbitrary scalar values, a user might fix the problem by creating an HTML file containing the character and then copying the character from the HTML file to a find and replace command. This trick is unlikely to work for a lone surrogate. Richard. From verdy_p at wanadoo.fr Sun Oct 4 14:48:12 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Sun, 4 Oct 2015 21:48:12 +0200 Subject: Deleting Lone Surrogates In-Reply-To: <20151004203025.605f0ae6@JRWUBU2> References: <20151004140201.21c9f941@JRWUBU2> <20151004203025.605f0ae6@JRWUBU2> Message-ID: 2015-10-04 21:30 GMT+02:00 Richard Wordingham < richard.wordingham at ntlworld.com>: > On Sun, 4 Oct 2015 15:44:32 +0200 > Mark Davis ?? wrote: > > > When I use http://unicode.org/cldr/utility/breaks.jsp, it does show > > the sequence ????? as just two grapheme clusters. > > But that's the sequence , which has no lone > surrogates at all! (I had to look at the raw email file to be sure of > what the text was - my email client displays U+FFFD and malformed > alleged UTF-8 the same.) Mark just said that it was what was shown, i.e. the lone surrogate got treated as U+FFFD. However my opinion is that ????? (using U+FFFD substitution) gives 2 grapheme clusters, I would prefer a solution that gives 3 grapheme clusters, as if the lone surrogate was a line-break control, so that the third character (combining, but just after the lone surrogate) will not combine with it but will be handled as a defective combining sequence with no starter at all before it. -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Sun Oct 4 16:18:42 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Sun, 4 Oct 2015 22:18:42 +0100 Subject: Deleting Lone Surrogates In-Reply-To: <56117E4F.9010300@ix.netcom.com> References: <20151004140201.21c9f941@JRWUBU2> <56117E4F.9010300@ix.netcom.com> Message-ID: <20151004221842.1b90cdfe@JRWUBU2> On Sun, 4 Oct 2015 12:30:23 -0700 "Asmus Freytag (t)" wrote: > If you have a bug that doesn't let you enter a sequence without > creating a lone surrogate followed by a combining mark, that's a > bug... Unfortunately, the bug appears to be in an ill-defined interface in which I have observed regression even within the BMP. We've discussed the ambiguity of 'delete one character' in the context of normalisation before on this list, and the surest solution seemed to be for the application to surrender some control of its 'backing store' to the input method. It's conceivable that the input methods that are compatible for the BMP are incompatible in the supplementary planes. For now, I'm going to have to either work round the problem by using dead keys instead or be thankful that the application hasn't caught up with Unicode 7.0. Richard. From asmus-inc at ix.netcom.com Sun Oct 4 16:29:16 2015 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Sun, 4 Oct 2015 14:29:16 -0700 Subject: Deleting Lone Surrogates In-Reply-To: <20151004203802.2189da64@JRWUBU2> References: <20151004140201.21c9f941@JRWUBU2>

<20151004203802.2189da64@JRWUBU2> Message-ID: <56119A2C.3060907@ix.netcom.com> An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Sun Oct 4 16:35:56 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Sun, 4 Oct 2015 22:35:56 +0100 Subject: Deleting Lone Surrogates In-Reply-To: References: <20151004140201.21c9f941@JRWUBU2> <20151004203025.605f0ae6@JRWUBU2> Message-ID: <20151004223556.770f2c68@JRWUBU2> On Sun, 4 Oct 2015 21:48:12 +0200 Philippe Verdy wrote: > 2015-10-04 21:30 GMT+02:00 Richard Wordingham < > richard.wordingham at ntlworld.com>: > > On Sun, 4 Oct 2015 15:44:32 +0200 > > Mark Davis ?? wrote: > > > When I use http://unicode.org/cldr/utility/breaks.jsp, it does > > > show the sequence ????? as just two grapheme clusters. > > But that's the sequence , which has no > > lone surrogates at all! > Mark just said that it was what was shown, i.e. the lone surrogate got > treated as U+FFFD. That's not what the English says, and I'm surprised if that's what a literal translation into French means. I do half suspect that he actually tried to post a lone surrogate. > However my opinion is that ????? (using U+FFFD substitution) gives 2 > grapheme clusters, I would prefer a solution that gives 3 grapheme > clusters, as if the lone surrogate was a line-break control, so that > the third character (combining, but just after the lone surrogate) > will not combine with it but will be handled as a defective combining > sequence with no starter at all before it. I'd much prefer to be able to delete the first character of a grapheme cluster. It's annoying to have to retype 4 characters because one's mistyped the first of the 4 characters in a grapheme cluster. Removing the restriction would be much more useful. Richard. From asmus-inc at ix.netcom.com Sun Oct 4 17:34:13 2015 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Sun, 4 Oct 2015 15:34:13 -0700 Subject: Deleting Lone Surrogates In-Reply-To: <20151004223556.770f2c68@JRWUBU2> References: <20151004140201.21c9f941@JRWUBU2> <20151004203025.605f0ae6@JRWUBU2> <20151004223556.770f2c68@JRWUBU2> Message-ID: <5611A965.3010304@ix.netcom.com> An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Sun Oct 4 17:54:30 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Sun, 4 Oct 2015 23:54:30 +0100 Subject: NNBSP and Word Boundaries In-Reply-To: References: <20151001172633.2a72f48f@JRWUBU2> Message-ID: <20151004235430.74bca234@JRWUBU2> On Fri, 2 Oct 2015 09:25:01 +0200 Mark Davis ?? wrote: > We add: > > WB13c Mongolian_Letter ? NNBSP > WB13d NNBSP ? Mongolian_Letter > > *If* we want to also change behavior on the other side of the NNBSP, > whenever the Mongolian_Letter and NNBSP occur in sequence, we add 2 > additional rules (with the appropriate values for ..., like Numeric) > > WB13c Mongolian_Letter NNBSP (...) > WB13d (...) ? NNBSP Mongolian_Letter I'll assume the last two are meant to be WB13e and WB13f. We can achieve the effects down to the first WB13d simply by changing NNBSP from XX to MidNumLet. This would also provide a proper "espace fine" for French use within numbers ( https://www.druide.com/enquetes/pour-des-espaces-ins%C3%A9cables-impeccables ) to separate groups of 3 digits. This needs *no* extra rules. Now for combined numbers and letters, we might consider adding the two rules: WB12a Numeric MidNumLet ? AHLetter WB12b Numeric ? MidNumLet AHLetter I think we should go the whole hog, and instead have WB12c (Numeric|AHLetter) MidNumLetQ ? (Numeric|AHLetter) WB12d (Numeric|AHLetter) ? MidNumLetQ (Numeric|AHLetter) Perhaps there are good reasons against them - I'm not aware of any. (I don't think it is wrong to treat "no.2" as a single word.) These rules would make the abbreviated names of a good many Thai forms (e.g. ??.?, a marriage certificate) into a single word. WB12c and WB12d overlap with WB6, WB7, WB11 and WB12, which could be slightly simplified. Richard. From richard.wordingham at ntlworld.com Sun Oct 4 18:14:13 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Mon, 5 Oct 2015 00:14:13 +0100 Subject: Deleting Lone Surrogates In-Reply-To: <56119A2C.3060907@ix.netcom.com> References: <20151004140201.21c9f941@JRWUBU2>

<20151004203802.2189da64@JRWUBU2> <56119A2C.3060907@ix.netcom.com> Message-ID: <20151005001413.74e6dae4@JRWUBU2> On Sun, 4 Oct 2015 14:29:16 -0700 "Asmus Freytag (t)" wrote: > On 10/4/2015 12:38 PM, Richard Wordingham wrote: > The problem you are trying to solve is to allow editing on > the code point level, or, if you will, the keystroke level. > Generally, there will be a sweet spot for each language (and each > user) with respect to what to erase or undo. > For sequences that belong to a given language, you can pick the > behavior that makes most sense in them, but for lone surrogates, by > definition you are dealing with broken text that doesn't follow any > conventions. Who's 'you'? Customisation is frequently not available. In fact, I don't recall seeing it on offer. > It should also be something that doesn't occur commonly. So, for all > of those reasons, I see no particular problem with giving that a > "generic" behavior, which could be that of deleting the entire > combining sequence; especially if your interface normally deletes > sequences as a unit. > But in any case, the minimal requirement on an editor is that it lets > you delete (and then retype) enough text to get it back to an > uncorrupted state. In the problem I hit, I would nearly be left with two options - never having CANDRABINDU and always having it preceded by CANDRABINDU. Whenever I enter CANDRABINDU, it is preceded by the lone surrogate. Consequently, the option of retyping the sequence is of no avail. Fortunately, in the application where I met the problem, the lone surrogates, and nothing else, get deleted when the file is saved. The problem could very easily be a lot worse. ---- > Catch-22 here. In filtering input to the dialog to prevent it from > being used to corrupt text, you prevent it from being used to repair > text. Interesting. Not very different to having a very roll-stable aeroplane. If you ever do end up upside-down, you have a big problem. Richard. From asmus-inc at ix.netcom.com Sun Oct 4 18:57:15 2015 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Sun, 4 Oct 2015 16:57:15 -0700 Subject: Deleting Lone Surrogates In-Reply-To: <20151005001413.74e6dae4@JRWUBU2> References: <20151004140201.21c9f941@JRWUBU2>

<20151004203802.2189da64@JRWUBU2> <56119A2C.3060907@ix.netcom.com> <20151005001413.74e6dae4@JRWUBU2> Message-ID: <5611BCDB.6090903@ix.netcom.com> An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Sun Oct 4 19:24:40 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Mon, 5 Oct 2015 01:24:40 +0100 Subject: Deleting Lone Surrogates In-Reply-To: <5611A965.3010304@ix.netcom.com> References: <20151004140201.21c9f941@JRWUBU2> <20151004203025.605f0ae6@JRWUBU2> <20151004223556.770f2c68@JRWUBU2> <5611A965.3010304@ix.netcom.com> Message-ID: <20151005012440.11ca7b17@JRWUBU2> On Sun, 4 Oct 2015 15:34:13 -0700 "Asmus Freytag (t)" wrote: > On 10/4/2015 2:35 PM, Richard Wordingham wrote: >> I'd much prefer to be able to delete the first character of a >> grapheme >> cluster. It's annoying to have to retype 4 characters because one's >> mistyped the first of the 4 characters in a grapheme cluster. >> Removing the restriction would be much more useful. > That makes sense for common typos, less so, for uncommon (hopefully) > data corruption. Allowing access within the cluster is generally useful. Providing more access just makes it easier to repair things. One problem is that there isn't a 'suspend shaping' option to allow one to see what one is doing. This matters when canonical combining classes are not available to sort out the ordering of components. > For some languages, you'll be typing several keystrokes, even if it's > a single code point; there seems to be limited desire to allow you to > "edit" the keystrokes. The creators of the application do not know how many keystrokes were used. A multi-platform application is not likely to take note of what keys were pressed even when this information is available. > For other languages I would expect a UI design > to cater to what local custom prefers. Local custom? 'Local custom' is usually one of the following: a) pen and ink, possibly with scraper. b) typewriter and tippex c) Hacked ASCII (and similar) Only with complex ligatures would you not have access to each character. The only parallels to what happens now that I can think of that might count as 'custom' are: 1) European 8-bit codes, where letter plus diacritic is treated as a unit. 2) Korean, where one couldn't chop and change the individual jamo. 3) Thai, where a tone mark can severely restrict what scraping can do. A UI design might respond to loud enough howls of user protest. You may recall Thai howls of protest when the ability to independently delete preposed vowels was lost. Thai may have some complex vowel symbols, but as far as the grapheme clusters go, *Thai* doesn't get more complicated than CVT (consonant, vowel (just one!) and tone). Some of the minority languages in the Thai script might be a bit more complicated. I do recall SIL's split cursor, which attempted to address the difficulties of navigating through a stack of diacritics. I miss it, even though I never got to grips with all its subtleties. What I believe is much more the case is that Unicode encourages 'one size fits all'. There are massive *translation* efforts for user interfaces. As to other parts of the text input/output, they are usually separate from the applications. The keyboard is almost totally independent of the application. Fonts are restricted to attempts to provide adequate coverage, but the ideal is that the user provides his own. I think the LibreOffice search and replace interface says a lot. It has visible support for Japanese - they holler and may well add their own support into the core project - and there are some CTL options which make best sense from the point of view of the Arabic script. The limitations on editing are one of the few places where the UI is under the tight control of the programmers. By and large, they seem to be influenced by a few sources, such as the Unicode technical reports. Refutation awaited. Now an attitude of 'one size fits all' does get things done. It might be a bit rough, but it's a lot better than nothing. Richard. From richard.wordingham at ntlworld.com Sun Oct 4 19:29:05 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Mon, 5 Oct 2015 01:29:05 +0100 Subject: Deleting Lone Surrogates In-Reply-To: <5611BCDB.6090903@ix.netcom.com> References: <20151004140201.21c9f941@JRWUBU2>

<20151004203802.2189da64@JRWUBU2> <56119A2C.3060907@ix.netcom.com> <20151005001413.74e6dae4@JRWUBU2> <5611BCDB.6090903@ix.netcom.com> Message-ID: <20151005012905.6fcbb062@JRWUBU2> On Sun, 4 Oct 2015 16:57:15 -0700 "Asmus Freytag (t)" wrote: > On 10/4/2015 4:14 PM, Richard Wordingham wrote: > respect to what to erase or undo. >>> For sequences that belong to a given language, you can pick the >>> behavior that makes most sense in them, but for lone surrogates, by >>> definition you are dealing with broken text that doesn't follow any >>> conventions. >> Who's 'you'? Customisation is frequently not available. In fact, I >> don't recall seeing it on offer. > The UI developer. > And there's nothing Unicode can do about lack of customizability. Actually, there is. I believe suggestions and recommendations in the technical reports are quite influential. Richard. From naz at gassiep.com Mon Oct 5 02:39:23 2015 From: naz at gassiep.com (Naz Gassiep) Date: Mon, 5 Oct 2015 18:39:23 +1100 Subject: Proposals for Arabic honorifics Message-ID: <5612292B.8040208@gassiep.com> Hi all, We are considering writing a proposal for Arabic honorifics which are missing from Unicode. There are already a few in there, notably U+FDFA and U+FDFB. There are two existing proposals, L2/14-147 and L2/14-152, which each propose additions. L2/14-147 proposes seventeen new characters and L2/14-152 proposes a further two. There are a few other characters that are not included in these proposals, and I was considering preparing a proposal of my own. I will work with a team of people who are willing to contribute time to this work. We are considering two options: 1. Prepare an additional proposal for the characters that were missing from the existing spec and also from the two proposals mentioned above. 2. Prepare a collating proposal which rolls the two proposals as well as the others that we feel are missing into a single proposal. Currently, we favour the second option. We would ensure that full descriptions, names, character properties, and detailed examples are provided for each character to substantiate its use in modern plain text. We would also suggest code points in line with the existing proposal L2/14-147. We don't want to step on the toes of the original submitters, Roozbeh Pournader or Lateef Sagar Shaikh. We wish to be clear that we will draw on their existing proposals to the maximum extent possible to ensure that we do not submit a conflicting proposal, but a superset proposal that incorporates their proposals as well as the additional characters we have identified. We have evaluated these two, and a true superset proposal is possible such that no conflicts between either those two proposals or our own will materialize. Are there any issues that we may face in preparing and submitting our proposal? Any guidance from this mailing list would be highly valued. Many thanks, - Naz. From duerst at it.aoyama.ac.jp Mon Oct 5 06:50:14 2015 From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J._D=c3=bcrst?=) Date: Mon, 5 Oct 2015 20:50:14 +0900 Subject: Deleting Lone Surrogates In-Reply-To: <56117E4F.9010300@ix.netcom.com> References: <20151004140201.21c9f941@JRWUBU2> <56117E4F.9010300@ix.netcom.com> Message-ID: <561263F6.1000308@it.aoyama.ac.jp> On 2015/10/05 04:30, Asmus Freytag (t) wrote: > On 10/4/2015 6:02 AM, Richard Wordingham wrote: >> In the absence of a specific tailoring, is the combination of a lone >> surrogate and a combining mark a user-perceived character? Does a lone >> surrogate constitute a user-perceived character? > > In an editing interface, a lone surrogate should be a user perceived character, > as otherwise you won't be able to manually delete it. Markus suggests that it be > treated like an unassigned code point. In an editing tool (of which an editing interface is a part of), a lone surrogate should just be removed! Apparently, that's what happens in Richard's case, but only eventually. Regards, Martin. From samjnaa at gmail.com Mon Oct 5 07:14:52 2015 From: samjnaa at gmail.com (Shriramana Sharma) Date: Mon, 5 Oct 2015 17:44:52 +0530 Subject: Unicode in passwords In-Reply-To: <7BD7361C-AAFD-40D9-9D14-5B41295EAE6D@lboro.ac.uk> References: <4057F09C-5426-40ED-AB3D-356BBBC85276@telia.com> <2E86845B-6A9F-4D5C-A02A-C2C82F4DB28F@gmail.com> <000601d0fbff$42881070$c7983150$@gmail.com> <20151001083322.5440cc2a@JRWUBU2> <7BD7361C-AAFD-40D9-9D14-5B41295EAE6D@lboro.ac.uk> Message-ID: I recently came across this bug report where a filesystem encrypted with a Cyrillic script password could not be decrypted at boot time: https://bugzilla.redhat.com/show_bug.cgi?id=681250 -- Shriramana Sharma ???????????? ???????????? From marc.blanchet at viagenie.ca Mon Oct 5 07:30:58 2015 From: marc.blanchet at viagenie.ca (Marc Blanchet) Date: Mon, 05 Oct 2015 08:30:58 -0400 Subject: Unicode in passwords In-Reply-To: References: <4057F09C-5426-40ED-AB3D-356BBBC85276@telia.com> <2E86845B-6A9F-4D5C-A02A-C2C82F4DB28F@gmail.com> <000601d0fbff$42881070$c7983150$@gmail.com> <20151001083322.5440cc2a@JRWUBU2> <7BD7361C-AAFD-40D9-9D14-5B41295EAE6D@lboro.ac.uk> Message-ID: On 5 Oct 2015, at 8:14, Shriramana Sharma wrote: > I recently came across this bug report where a filesystem encrypted > with a Cyrillic script password could not be decrypted at boot time: > > https://bugzilla.redhat.com/show_bug.cgi?id=681250 And? From what I understand, this is related to the fact that the OS has two levels of boot/console/installation scripts and the first level is very basic regarding i18n (i.e. us-ascii only guaranteed to work). Marc. > > > -- > Shriramana Sharma ???????????? > ???????????? From samjnaa at gmail.com Mon Oct 5 08:42:05 2015 From: samjnaa at gmail.com (Shriramana Sharma) Date: Mon, 5 Oct 2015 19:12:05 +0530 Subject: Unicode in passwords In-Reply-To: References: <4057F09C-5426-40ED-AB3D-356BBBC85276@telia.com> <2E86845B-6A9F-4D5C-A02A-C2C82F4DB28F@gmail.com> <000601d0fbff$42881070$c7983150$@gmail.com> <20151001083322.5440cc2a@JRWUBU2> <7BD7361C-AAFD-40D9-9D14-5B41295EAE6D@lboro.ac.uk>

Message-ID: On 10/5/15, Marc Blanchet wrote: > On 5 Oct 2015, at 8:14, Shriramana Sharma wrote: > >> https://bugzilla.redhat.com/show_bug.cgi?id=681250 > > And? Well the OP did say: I'm researching potential problems and best practices for password policies that allow non-Latin-1 Unicode characters. The link seemed valid food for the research as was offerred FWIW. -- Shriramana Sharma ???????????? ???????????? From marc.blanchet at viagenie.ca Mon Oct 5 08:45:17 2015 From: marc.blanchet at viagenie.ca (Marc Blanchet) Date: Mon, 05 Oct 2015 09:45:17 -0400 Subject: Unicode in passwords In-Reply-To: References: <4057F09C-5426-40ED-AB3D-356BBBC85276@telia.com> <2E86845B-6A9F-4D5C-A02A-C2C82F4DB28F@gmail.com> <000601d0fbff$42881070$c7983150$@gmail.com> <20151001083322.5440cc2a@JRWUBU2> <7BD7361C-AAFD-40D9-9D14-5B41295EAE6D@lboro.ac.uk>

Message-ID: On 5 Oct 2015, at 9:42, Shriramana Sharma wrote: > On 10/5/15, Marc Blanchet wrote: >> On 5 Oct 2015, at 8:14, Shriramana Sharma wrote: >> >>> https://bugzilla.redhat.com/show_bug.cgi?id=681250 >> >> And? > > Well the OP did say: > > > I'm researching potential problems and best practices for password > policies that allow non-Latin-1 Unicode characters. > > > The link seemed valid food for the research as was offerred FWIW. sure. but roughly one could conclude from the bug report that only allow us-ascii is safe, which may not be what could be ??best practices?? depending on the point of view? Marc. > > -- > Shriramana Sharma ???????????? > ???????????? From samjnaa at gmail.com Mon Oct 5 09:47:03 2015 From: samjnaa at gmail.com (Shriramana Sharma) Date: Mon, 5 Oct 2015 20:17:03 +0530 Subject: Unicode in passwords In-Reply-To: References: <4057F09C-5426-40ED-AB3D-356BBBC85276@telia.com> <2E86845B-6A9F-4D5C-A02A-C2C82F4DB28F@gmail.com> <000601d0fbff$42881070$c7983150$@gmail.com> <20151001083322.5440cc2a@JRWUBU2> <7BD7361C-AAFD-40D9-9D14-5B41295EAE6D@lboro.ac.uk>

Message-ID: I had hoped it would be obvious my reply was not intended to the "best practices" part of the OP, but to the "potential problems" part of it... In any case, I have nothing further to say on this topic. -- Shriramana Sharma ???????????? ???????????? From verdy_p at wanadoo.fr Mon Oct 5 09:51:25 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Mon, 5 Oct 2015 16:51:25 +0200 Subject: Deleting Lone Surrogates In-Reply-To: <561263F6.1000308@it.aoyama.ac.jp> References: <20151004140201.21c9f941@JRWUBU2> <56117E4F.9010300@ix.netcom.com> <561263F6.1000308@it.aoyama.ac.jp> Message-ID: Not silently ! Even if this removal is required to go on editing, this must be notified to the user as it may occur in unedited parts of the file (and it may be the sign that the document is not fully plain text, so the user should not save the edited file) If this is caused by a quirk in the user input (defect of the input mode or keyboard layout), there should be a notification. But for a general purpose editor that allows editing files including binary ones (e.g. Emacs), it is best to NOT drop those lone surrogates at all, and effectively treat them in isolation for ALL purposes (the DELETE key should not delete more than this lone surrogate (it may be necessary to adjst the cursor position after the deletion if the editor does not support placing the cursor in the middle of a combining sequence, but a LONE surrogate + a combining character should still be treated as two separate clusters and the cursor or selection should be placable between the lone surrogate and the combining mark.) Note that file formats that contain binary parts and plain text parts do exist, e.g. media files that contain a final plain text section for metadata or for some XML data signature : it is safe to edit that final part in a text editor, provided that it does not silently change the encoding of the binary part. In summary, I do not like the idea of silently dropping lone surrogates in editors. If the editor needs it because it cannot safely handle binary parts, the notification will say to the user that he should not use that editor and choose something else, or it will allow the user to select another appropriate file encoding to edit the file safely. The user should not save the file blindly as it will be corrupted silently. Doing otherwise would be a security issue. And this remark extends to all other protocols using plain text input ; lone surrogates should not be dropped silently (unless explicitly requested for exemple in a maintenance cleanup or repair) : it this lone surrogate violates the further processing, the only safe option is to reject the whole text and report the error if text data is required but missing. 2015-10-05 13:50 GMT+02:00 Martin J. D?rst : > On 2015/10/05 04:30, Asmus Freytag (t) wrote: > >> On 10/4/2015 6:02 AM, Richard Wordingham wrote: >> >>> In the absence of a specific tailoring, is the combination of a lone >>> surrogate and a combining mark a user-perceived character? Does a lone >>> surrogate constitute a user-perceived character? >>> >> >> In an editing interface, a lone surrogate should be a user perceived >> character, >> as otherwise you won't be able to manually delete it. Markus suggests >> that it be >> treated like an unassigned code point. >> > > In an editing tool (of which an editing interface is a part of), a lone > surrogate should just be removed! Apparently, that's what happens in > Richard's case, but only eventually. > > Regards, Martin. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From marc.blanchet at viagenie.ca Mon Oct 5 09:59:38 2015 From: marc.blanchet at viagenie.ca (Marc Blanchet) Date: Mon, 05 Oct 2015 10:59:38 -0400 Subject: Unicode in passwords In-Reply-To: References: <4057F09C-5426-40ED-AB3D-356BBBC85276@telia.com> <2E86845B-6A9F-4D5C-A02A-C2C82F4DB28F@gmail.com> <000601d0fbff$42881070$c7983150$@gmail.com> <20151001083322.5440cc2a@JRWUBU2> <7BD7361C-AAFD-40D9-9D14-5B41295EAE6D@lboro.ac.uk>

Message-ID: On 5 Oct 2015, at 10:47, Shriramana Sharma wrote: > I had hoped it would be obvious my reply was not intended to the "best > practices" part of the OP, but to the "potential problems" part of > it... sure. my comment was also just informative, not targeting to your comment, but targeting the fact that ??best practices?? may not be ??us-ascii?? only if you want to be i18n. Marc. > In any case, I have nothing further to say on this topic. > > -- > Shriramana Sharma ???????????? > ???????????? From bortzmeyer at nic.fr Mon Oct 5 10:12:00 2015 From: bortzmeyer at nic.fr (Stephane Bortzmeyer) Date: Mon, 5 Oct 2015 17:12:00 +0200 Subject: Unicode in passwords In-Reply-To: <2E86845B-6A9F-4D5C-A02A-C2C82F4DB28F@gmail.com> References: <4057F09C-5426-40ED-AB3D-356BBBC85276@telia.com> <2E86845B-6A9F-4D5C-A02A-C2C82F4DB28F@gmail.com> Message-ID: <20151005151200.GA7379@laperouse.bortzmeyer.org> On Wed, Sep 30, 2015 at 04:15:30PM -0700, Clark S. Cox III wrote a message of 73 lines which said: > You really wouldn?t want ?Schl?ssel? and ?Schl?ssel? being different > passwords, would you? (assuming that my mail client and/or OS is not > interfering, the first is NFC, while the second is NFD) Hence the RFC 7613, mentioned already here by Marc Blanchet, that you must really read if you're interesed in Unicode passwords. In that case, the RFC is clear: NFC mandatory (and UTF-8 encoding). 4. Normalization Rule: Unicode Normalization Form C (NFC) MUST be applied to all characters. From doug at ewellic.org Mon Oct 5 10:24:57 2015 From: doug at ewellic.org (Doug Ewell) Date: Mon, 05 Oct 2015 08:24:57 -0700 Subject: Acquiring DIS 10646 Message-ID: <20151005082457.665a7a7059d7ee80bb4d670165c8327d.c32c0ba233.wbe@email03.secureserver.net> I too am puzzled as to what DIS 10646 and C1 control pictures have to do with each other. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From asmus-inc at ix.netcom.com Mon Oct 5 10:57:52 2015 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Mon, 5 Oct 2015 08:57:52 -0700 Subject: Deleting Lone Surrogates In-Reply-To: References: <20151004140201.21c9f941@JRWUBU2> <56117E4F.9010300@ix.netcom.com> <561263F6.1000308@it.aoyama.ac.jp> Message-ID: <56129E00.40909@ix.netcom.com> On 10/5/2015 7:51 AM, Philippe Verdy wrote: > Not silently ! Even if this removal is required to go on editing, this > must be notified to the user as it may occur in unedited parts of the > file (and it may be the sign that the document is not fully plain > text, so the user should not save the edited file) > If this is caused by a quirk in the user input (defect of the input > mode or keyboard layout), there should be a notification. As long as we are discussing, as Richard is, recommendations for implementers, I fully agree with Phillipe. Manually editing surrogate corruptions might be something that could be relegated to an "expert mode", but automatic correction without user confirmation ("May we clean up your file?") would indeed be spooky and dangerous. A./ -------------- next part -------------- An HTML attachment was scrubbed... URL: From kenwhistler at att.net Mon Oct 5 12:11:39 2015 From: kenwhistler at att.net (Ken Whistler) Date: Mon, 5 Oct 2015 10:11:39 -0700 Subject: Scope of Unicode Character Properties (was: Re: Deleting Lone Surrogates) In-Reply-To: <20151004203025.605f0ae6@JRWUBU2> References: <20151004140201.21c9f941@JRWUBU2> <20151004203025.605f0ae6@JRWUBU2> Message-ID: <5612AF4B.9040406@att.net> Section 3.5, Properties, of the standard attempts to address this. "Code point properties" are properties of the code points, per se, and clearly do have all code points (U+0000..U+10FFFF) in their scope. An example is the Surrogate code point property, which wouldn't make much sense if it didn't apply to surrogate code points! "Encoded character properties" are properties of the characters themselves -- attributes like Ideographic or Numeric_Value. For completeness, those are given *default* values for all reserved code points (and for noncharacter and PUA code points). In principle, the scope should be all Unicode scalar values: U+0000..U+D7FF, U+E000..U+10FFFF, because it doesn't make much sense to talk about character properties for code points that are ill-formed and which cannot ever actually represent a character. However, in practice, it is simplest to extend the *default* values of encoded character properties to the surrogate code points, so that in the cases where they occur in ill-formed text, APIs and applications have some hope of doing something useful, rather than just reacting exceptionally to featureless singularities embedded in text. Hence, the bullet in the text in the standard: * For each encoded character property there is a mapping from every code point to some value in the set of values associated with that property. There is nothing in the standard, as I read it, that imposes a conformance requirement on any process that would *require* it to interpret an isolated surrogate code point and give it a particular property value. However, it would be reasonable (and permitted) for an API to actually report a default value for a surrogate code point (i.e., treating it more or less like the reserved code point U+50005 that Marcus mentioned). Such behavior in a character property API is likely to result in more graceful behavior than simply throwing exceptions. --Ken On 10/4/2015 12:30 PM, Richard Wordingham wrote: > Do all Unicode character properties extend to all codepoints? If not, > how does one tell which do and which don't? ... > Richard. From kenwhistler at att.net Mon Oct 5 14:32:45 2015 From: kenwhistler at att.net (Ken Whistler) Date: Mon, 5 Oct 2015 12:32:45 -0700 Subject: Why Nothing Ever Goes Away (was: Re: Acquiring DIS 10646) In-Reply-To: <20151005082457.665a7a7059d7ee80bb4d670165c8327d.c32c0ba233.wbe@email03.secureserver.net> References: <20151005082457.665a7a7059d7ee80bb4d670165c8327d.c32c0ba233.wbe@email03.secureserver.net> Message-ID: <5612D05D.7000407@att.net> On 10/5/2015 8:24 AM, Doug Ewell wrote: > I too am puzzled as to what DIS 10646 and C1 control pictures have to do > with each other. > What an *excellent* cue to start a riff on arcane Unicode history! First, let me explain what I think Sean Leonard's concern here is. 1. On 10/4/2015 5:30 AM, Sean wrote: "I proposed adding C1 Control Pictures to Unicode. ... The requirement is that all glyphs for U+0000 - U+00FF be graphically distinct." Ah, but Sean has noticed that of all the representative glyphs we have use in the current code charts for C1 control codes, exactly *3* of them share an odd glyph. U+0080, U+0081, and U+0099 use the same dotted box with an "XXX" in it. That creates a conflict with the requirement that Sean has stated for glyphs for *graphic symbols for* control codes, presumably for addition the to 2400 Control Pictures block and some extensions elsewhere, each with a visually distinct representation. 2. The Unicode code charts are (deliberately) vague about U+0080, U+0081, and U+0099. All other C1 control codes have aliases to the ISO 6429 set of control functions, but in ISO 6429, those three control codes don't have any assigned functions (or names). And because the C1 aliases in the Unicode code charts are (deliberately) based on ISO 6429, U+0080, U+0081, and U+0099 are only identified as "", with no alias in the charts, and with an arbitrary "XXX" box glyph. 3. Concerned about this gap, Sean did some due diligence research on the web, and turned up documentation pages such as: http://utopia.knoware.nl/users/eprebel/Communication/CharacterSets/Controls.html Pertinent to this discussion is the section for C1 on that page which (incorrectly) includes "DIS 10646" in the list of "Standards". More to the point, the entries for the 3 C1 code points in question are documented as: 08/00 ... PAD PADding character (only in DIS 10646) 08/01 ... HOP High Octet Preset (only in DIS 10646) ... 09/09 ... SGCI Single Graphic Character Introducer (only in DIS 10646) Aha! Hence the need to track down a copy of DIS 10646 (meaning in actuality, the appropriately numbered WG2 N666, "DIS 10646", dated November 4, 1990). That was actually what became DIS 1, the DIS that failed, the DIS that led to the *second* DIS 10646, which was the basis of the Unicode/10646 merger. But I digress... ;-) 4. O.k., so with that connection out of the way, I can proceed to the topic of this thread: Why Nothing Ever Goes Away. PAD, HOP, and SGCI were arcane, proposed architectural additions to the early drafts of 10646, from the days when 10646 was still slavishly following the ISO 2022 framework, and was avoiding C0 and C1 byte values in all representations, including single-, double-, triple-, and quadruple-byte forms for characters. HOP was one of those half-baked terminal protocol byte compression concoctions. The idea was that since some commonly used blocks of characters would require double-byte representation, but would all have the same "high octet", you could send a HOP, and then a bunch of low octets down the line. In effect, it was intended as a script switcher. SGCI was complementary to that. It would let you introduce a sequence of multiple octets for a single character, without having to switch out of your high octet preset mode. PAD I forget the exact details of. Something to do with padding out character representations into fixed length. All of these were firmly rejected in the merger discussions and the failed DIS vote. Actually, they were down in the noise compared to major issues like CJK plane swapping and such, but there clearly was no need for 10646 to invent new control functions like these, and the early drafts of the Unicode Standard had nothing of the sort. So these were gone in DIS 1.2 for 10646. They were *never* published as part of ISO 10646-1:1993 (or any later edition). Nor were they ever published in an ISO control function standard. Nor were they ever published in the Unicode Standard, of course. They were never standard *anything* -- just ill-advised concept functions that later got dropped in the drafts. But wait! If these disappeared from any standard draft way back in 1991(!), why are we still talking about them? Why are they still documented on web pages for C1 control characters in 2015, 24 years later? Funny you should ask! The problem is that they went viral. And that in an age before anybody really knew what "going viral" even meant. ;-) The first problem is that a bunch of mnemonics for characters were published in an RFC. And those mnemonics included characters from early drafts of 10646. The notorious document in question is RFC 1345: Simonsen, K., "Character Mnemonics & Character Sets", June 1992. Go ahead, it is still there: https://tools.ietf.org/rfc/rfc1345.txt And that has entries for the non-existent control codes, which by the time RFC 1345 was published, had *already* been removed from the 10646 drafts. To wit: PA 0080 PADDING CHARACTER (PAD) HO 0081 HIGH OCTET PRESET (HOP) GC 0099 SINGLE GRAPHIC CHARACTER INTRODUCER (SGCI) RFC 1345 was, in turn, referenced by other important IETF documents, including the important RFC 2070, "Internationalization of the Hypertext Markup Language", which defines the syntax for character entity names. Entity names for PAD, HOP, and SGCI then found their way into Java and other implementations. They ended up referenced in tables supporting regular expressions. And so on. Somehow they had become the walking dead control functions. This came back around to the Unicode Standard about the time the U+1F514 BELL and U+0007 alias BELL name collision issue hit the fan. The UTC response to this problem was to augment the formal name aliases to include all widely used control function names and abbreviations, so that testing for name collisions in that name space would prevent any future BELL/BELL issues. See, in particular, the related PRI on this topic for Unicode 6.1.0: http://www.unicode.org/review/pri202/ which explicitly mentions U+0080, U+0081, and U+0099 and their aliases, because of a need for backward compatibility to then-existing usage in Perl 5. The outcome of that PRI was to add a bunch of formal name aliases, *including* ones for PAD, HOP, and SGCI (or SGC). To wit, from NameAliases.txt: ======================================================= # PADDING CHARACTER and HIGH OCTET PRESET represent # architectural concepts initially proposed for early # drafts of ISO/IEC 10646-1. They were never actually # approved or standardized: hence their designation # here as the "figment" type. Formal name aliases # (and corresponding abbreviations) for these code # points are included here because these names leaked # out from the draft documents and were published in # at least one RFC whose names for code points was # implemented in Perl regex expressions. 0080;PADDING CHARACTER;figment 0080;PAD;abbreviation 0081;HIGH OCTET PRESET;figment 0081;HOP;abbreviation # SINGLE GRAPHIC CHARACTER INTRODUCER is another # architectural concept from early drafts of ISO/IEC 10646-1 # which was never approved and standardized. 0099;SINGLE GRAPHIC CHARACTER INTRODUCER;figment 0099;SGC;abbreviation ============================================================= Because of stability guarantees, however, NameAliases.txt is a write-once, read-only, unerasable file. For better or for worse, we are now stuck forever with those name aliases for U+0080, U+0081, and U+0099, *even though* the relevant control functions were never, ever actually standardized or used anywhere. Think of them as just part of the arcane mysteries now: odd labels for the three code points, which (nearly) nobody understands. Another of the many Unicode just so stories. :-) --Ken From richard.wordingham at ntlworld.com Mon Oct 5 14:58:48 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Mon, 5 Oct 2015 20:58:48 +0100 Subject: Deleting Lone Surrogates In-Reply-To: References: <20151004140201.21c9f941@JRWUBU2> <56117E4F.9010300@ix.netcom.com> <561263F6.1000308@it.aoyama.ac.jp> Message-ID: <20151005205848.1b622a08@JRWUBU2> On Mon, 5 Oct 2015 16:51:25 +0200 Philippe Verdy wrote: > 2015-10-05 13:50 GMT+02:00 Martin J. D?rst : > > > In an editing tool (of which an editing interface is a part of), a > > lone surrogate should just be removed! Apparently, that's what > > happens in Richard's case, but only eventually. > Not silently ! Even if this removal is required to go on editing, > this must be notified to the user as it may occur in unedited parts > of the file (and it may be the sign that the document is not fully > plain text, so the user should not save the edited file) > If this is caused by a quirk in the user input (defect of the input > mode or keyboard layout), there should be a notification. The lone surrogates (as I surmise) in this case are caused by the user input being misinterpreted. The sequence of strings delivered to a program running X receiving the same sequence of keystrokes is U+1148F, U+114C0, U+0008, U+114BF, and I have no reason to doubt that the offending program is receiving the same sequence. My working hypothesis is that this is being simplified to U+1148F, U+D805, U+114BF; the presence of U+D805 is a program error. I can reproduce the problem in a previously empty file. Now, on Windows, old MS keyboards at least deliver supplementary characters in a pair of WM_CHAR messages. If one of these ligatures were corrupted so that only the first of the messages was delivered, it is not obvious to me how a program would readily detect the omission. It would only become obvious when the start of the next *character* was received. Richard. From verdy_p at wanadoo.fr Mon Oct 5 17:37:31 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Tue, 6 Oct 2015 00:37:31 +0200 Subject: Why Nothing Ever Goes Away (was: Re: Acquiring DIS 10646) In-Reply-To: <5612D05D.7000407@att.net> References: <20151005082457.665a7a7059d7ee80bb4d670165c8327d.c32c0ba233.wbe@email03.secureserver.net> <5612D05D.7000407@att.net> Message-ID: 2015-10-05 21:32 GMT+02:00 Ken Whistler : > > On 10/5/2015 8:24 AM, Doug Ewell wrote: > >> I too am puzzled as to what DIS 10646 and C1 control pictures have to do >> with each other. >> >> > What an *excellent* cue to start a riff on arcane Unicode history! > > First, let me explain what I think Sean Leonard's concern here is. > > 1. On 10/4/2015 5:30 AM, Sean wrote: "I proposed adding C1 Control > Pictures to > Unicode. ... The requirement is that all glyphs for U+0000 - U+00FF be > graphically distinct." > > Ah, but Sean has noticed that of all the representative glyphs we have use > in the current code charts for C1 control codes, exactly *3* of them share > an odd glyph. U+0080, U+0081, and U+0099 use the same dotted box > with an "XXX" in it. That creates a conflict with the requirement that > Sean has stated for glyphs for *graphic symbols for* control codes, > presumably for addition the to 2400 Control Pictures block and some > extensions elsewhere, each with a visually distinct representation. Good remark, but that does not mean that we really need to encode new code points for C1 control pictures. What is really needed is to change their representative glyph in charts: their dotted box should better include "0080", "0081" and "0099" in them rather than "XXX", if those C1 positions don't have any *agreed* ASCII-letters aliases (though their common abbreviations are listed in the English Wikipedia article as "PAD", "HOP", and "SGCI" respectively) https://en.wikipedia.org/wiki/C0_and_C1_control_codes Note this old L2 discussion note for their unspecified aliases by Ken Whistler: http://www.unicode.org/L2/L2011/11281-control-aliases.txt -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Mon Oct 5 17:57:23 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Tue, 6 Oct 2015 00:57:23 +0200 Subject: Why Nothing Ever Goes Away (was: Re: Acquiring DIS 10646) In-Reply-To: References: <20151005082457.665a7a7059d7ee80bb4d670165c8327d.c32c0ba233.wbe@email03.secureserver.net> <5612D05D.7000407@att.net> Message-ID: Also the aliases for C1 controls were formally registered in 1983 only for the two ranges U+0084..U+0097 and U+009B..U+009F for ISO 6429. So the abbreviation (and names) aliases given to: - U+0082 (BPH =BREAK PERMITTED HERE), - U+0083 (NBH = NO BREAK HERE), - U+0098 (SOS=START OF STRING) and - U+009A (SCI=SINGLE CHARACTER INTRODUCER) are also discutable (but they may have other sources than just ISO 6429, probably from IBM for its proprietary EBCDIC-based systems). In that case the same sources could have given names/abbreviations to U+0080, U+0081 and U+0099. The problem could be that their late mapping from EBCDIC to some ISO8859-compatible encoding was still fuzzy before some date, or was also fuzzy within EBCDIC-based encodings themselves across their versions or in their implementations and applications on those systems. Anyway those aliased names and abbreviations have been published by Unicode and should remain stable now. 2015-10-06 0:37 GMT+02:00 Philippe Verdy : > 2015-10-05 21:32 GMT+02:00 Ken Whistler : > >> >> On 10/5/2015 8:24 AM, Doug Ewell wrote: >> >>> I too am puzzled as to what DIS 10646 and C1 control pictures have to do >>> with each other. >>> >>> >> What an *excellent* cue to start a riff on arcane Unicode history! >> >> First, let me explain what I think Sean Leonard's concern here is. >> >> 1. On 10/4/2015 5:30 AM, Sean wrote: "I proposed adding C1 Control >> Pictures to >> Unicode. ... The requirement is that all glyphs for U+0000 - U+00FF be >> graphically distinct." >> >> Ah, but Sean has noticed that of all the representative glyphs we have use >> in the current code charts for C1 control codes, exactly *3* of them share >> an odd glyph. U+0080, U+0081, and U+0099 use the same dotted box >> with an "XXX" in it. That creates a conflict with the requirement that >> Sean has stated for glyphs for *graphic symbols for* control codes, >> presumably for addition the to 2400 Control Pictures block and some >> extensions elsewhere, each with a visually distinct representation. > > > Good remark, but that does not mean that we really need to encode new code > points for C1 control pictures. > > What is really needed is to change their representative glyph in charts: > their dotted box should better include "0080", "0081" and "0099" in them > rather than "XXX", if those C1 positions don't have any *agreed* > ASCII-letters aliases (though their common abbreviations are listed in the > English Wikipedia article as "PAD", "HOP", and "SGCI" respectively) > > https://en.wikipedia.org/wiki/C0_and_C1_control_codes > > Note this old L2 discussion note for their unspecified aliases by Ken > Whistler: > > http://www.unicode.org/L2/L2011/11281-control-aliases.txt > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Mon Oct 5 18:26:16 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Tue, 6 Oct 2015 01:26:16 +0200 Subject: Scope of Unicode Character Properties (was: Re: Deleting Lone Surrogates) In-Reply-To: <5612AF4B.9040406@att.net> References: <20151004140201.21c9f941@JRWUBU2> <20151004203025.605f0ae6@JRWUBU2> <5612AF4B.9040406@att.net> Message-ID: 2015-10-05 19:11 GMT+02:00 Ken Whistler : > However, it would be reasonable (and permitted) for an API to actually > report a default value for a surrogate code point (i.e., treating it more > or less like the reserved code point U+50005 that Marcus mentioned). > Unassigned (reserved) code points, when followed by an assigned combining mark would still be treated as starters of a combining sequence by default. This is not (IMHO) desirable for lone surrogates that should better be handled in isolation independantly of what follows them. My opinion is that they should be treated like new line controls, so that the combining mark after it will also be separated into a defective combining sequence without any starter (e.g. 000A 0302 creates two clusters, this should be the same for D800 0302. D800 will have no defined glyph to render, but the glyph for U+FFFD may be displayed, or just a ".notdef" tofu box). Now for break opportunities, those lone surrogates should not create a newline or paragraph break opportunity, but they may create a word break opportunity to allow their easy separation and selection by a double-click on this tofu in an editor; they may even create a syllable break opportunity before and after them to allow wrapping long lines there). Those adaptations however are not described at all in annexes speaking about text segmentations. So those surrogates (which are permanently assigned) could have their own code point properties more formally defined. In my opinion handling them like U+0000 is much better than handling thme like U+50005, which should stay reserved and handled as standard starters with default combining class 0. Also those lone surrogates should be Bidi-neutral (imagine they occur in the middle of some Arabic text, they should probably not change the direction of the surrounding text and should not alter the embedding context). -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Mon Oct 5 19:08:26 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Tue, 6 Oct 2015 02:08:26 +0200 Subject: Unicode in passwords In-Reply-To: <20151005151200.GA7379@laperouse.bortzmeyer.org> References: <4057F09C-5426-40ED-AB3D-356BBBC85276@telia.com> <2E86845B-6A9F-4D5C-A02A-C2C82F4DB28F@gmail.com> <20151005151200.GA7379@laperouse.bortzmeyer.org> Message-ID: NFC is probably not the best choice for passwords. It should probably be NFKC Look also in the recent proposed update for UAX #31, and consider the special case where an application does not want passwords to be case-significant, but accepts using something else than just ASCII letters: it will be then necessry to apply some closure for NFKC. Finally note that passwords are not necessarily single identifiers (whitespaces and word separators are accepted, but whitespaces should require special handling with trimming (at both ends) and compression of multiple occurences. It would also be necessay to make sure that acceptable passwords at least begin with an XID_Start character. May be all this discussion could be a new section in UAX #31 to take into account the possible presence of whitespaces (for "pass phrases" which are not really "identifiers") in "Medial" positions : define a profile as described in UAX #31 to add whitespaces in "Medial" and remove them from excluded characters, and possibly extend the set of "Start" to more than just XID_Start (e.g. you could use some punctuation like '!' or mathematical sign like '+', and possibly also accept non-decimal digits that are preserved after NFKC closure) 2015-10-05 17:12 GMT+02:00 Stephane Bortzmeyer : > On Wed, Sep 30, 2015 at 04:15:30PM -0700, > Clark S. Cox III wrote > a message of 73 lines which said: > > > You really wouldn?t want ?Schl?ssel? and ?Schl?ssel? being different > > passwords, would you? (assuming that my mail client and/or OS is not > > interfering, the first is NFC, while the second is NFD) > > Hence the RFC 7613, mentioned already here by Marc Blanchet, that you > must really read if you're interesed in Unicode passwords. > > In that case, the RFC is clear: NFC mandatory (and UTF-8 encoding). > > 4. Normalization Rule: Unicode Normalization Form C (NFC) MUST be > applied to all characters. > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Mon Oct 5 19:23:45 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Tue, 6 Oct 2015 02:23:45 +0200 Subject: Unicode in passwords In-Reply-To: References: <4057F09C-5426-40ED-AB3D-356BBBC85276@telia.com> <2E86845B-6A9F-4D5C-A02A-C2C82F4DB28F@gmail.com> <20151005151200.GA7379@laperouse.bortzmeyer.org> Message-ID: Also some people may want to use now emojis within their passwords or pass phrases (they are now very common on most smartphones and layouts for tactile screens or in instant messaging applications used on desktops, using mouse clicks or taps for selecting them). But I would not recommend them for encrypting bootable disks or in BIOS/UEFI boot environments without support for extended input methods and rich graphics to render them on basic text consoles, unless they are part of a national encoding standard and supported natively). For boot environments, you'll be limited by the local hardware support, but if there's such a support (keyboard or font), it may be helpful to include some extra symbols, to block remote accesses without this native support (e.g. on Japanese systems, you could use the extra keys found only on Japanese keyboards and you won't be able to control the system without the appropriate device recognized in the booting environment). 2015-10-06 2:08 GMT+02:00 Philippe Verdy : > NFC is probably not the best choice for passwords. It should probably be > NFKC > > Look also in the recent proposed update for UAX #31, and consider the > special case where an application does not want passwords to be > case-significant, but accepts using something else than just ASCII letters: > it will be then necessry to apply some closure for NFKC. > Finally note that passwords are not necessarily single identifiers > (whitespaces and word separators are accepted, but whitespaces should > require special handling with trimming (at both ends) and compression of > multiple occurences. It would also be necessay to make sure that acceptable > passwords at least begin with an XID_Start character. > > May be all this discussion could be a new section in UAX #31 to take into > account the possible presence of whitespaces (for "pass phrases" which are > not really "identifiers") in "Medial" positions : define a profile as > described in UAX #31 to add whitespaces in "Medial" and remove them from > excluded characters, and possibly extend the set of "Start" to more than > just XID_Start (e.g. you could use some punctuation like '!' or > mathematical sign like '+', and possibly also accept non-decimal digits > that are preserved after NFKC closure) > > > > 2015-10-05 17:12 GMT+02:00 Stephane Bortzmeyer : > >> On Wed, Sep 30, 2015 at 04:15:30PM -0700, >> Clark S. Cox III wrote >> a message of 73 lines which said: >> >> > You really wouldn?t want ?Schl?ssel? and ?Schl?ssel? being different >> > passwords, would you? (assuming that my mail client and/or OS is not >> > interfering, the first is NFC, while the second is NFD) >> >> Hence the RFC 7613, mentioned already here by Marc Blanchet, that you >> must really read if you're interesed in Unicode passwords. >> >> In that case, the RFC is clear: NFC mandatory (and UTF-8 encoding). >> >> 4. Normalization Rule: Unicode Normalization Form C (NFC) MUST be >> applied to all characters. >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From duerst at it.aoyama.ac.jp Mon Oct 5 22:37:58 2015 From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J._D=c3=bcrst?=) Date: Tue, 6 Oct 2015 12:37:58 +0900 Subject: Unicode in passwords In-Reply-To: References: <4057F09C-5426-40ED-AB3D-356BBBC85276@telia.com> <2E86845B-6A9F-4D5C-A02A-C2C82F4DB28F@gmail.com> <000601d0fbff$42881070$c7983150$@gmail.com> Message-ID: <56134216.3040401@it.aoyama.ac.jp> Some additional concerns: - Input methods for Chinese, Japanese,... need visual feedback to check that the correct Han character was selected. That may show (some parts of) the password to bystanders. - Length limitations of 8 bytes are few and far between these days, but they still exist. Even where they are gone, they may have been replaced with "safe" limitations, say e.g. 50 bytes. That may still be pretty restrictive for some languages when using UTF-8. - There may occasionally be different length limitations for different kinds of access with the same password. That can create very difficult situations where the length limitation cuts off part of a UTF-8 byte sequence. - Some interfaces try to estimate the 'quality' of a password on password creation. Short passwords, or passwords with only lower-case Latin may be rejected, others labeled as 'medium safe', and so on. A password with lots of bytes may be labeled as 'excellent' even though it consists of characters all taken from the same small script, and thus has rather low entropy. Of course, there's the effect that at least for a while, the bad guys may think it's too bothersome to try non-ASCII passwords, so that may temporarily make them somewhat safer. Regards, Martin. On 2015/10/01 14:01, Mark Davis ?? wrote: > I've heard some concerns, mostly around the UI for people typing in > passwords; that they get frustrated when they have to type their password > on different devices: > > 1. A device may not have keyboard mappings with all the keys for their > language. > 2. The keyboard mappings across devices vary where they put keys, > especially for minority script characters using some pattern of > shift/alt/option/etc.. So the pattern of keys that they use on one may be > different than on another. > 3. People are often 'blind' to the characters being entered: they just > see a dot, for example. If the keyboards for their language are not > standard, then that makes it difficult. > 4. Even if they see, for an instant, the character they type, if the > device doesn't have a font for their language's characters, it may be just > a box. > 5. Even if those are not true, the glyph may not be distinctive enough > if the size is too small. > > > > Mark > > *? Il meglio ? l?inimico del bene ?* > > On Thu, Oct 1, 2015 at 6:11 AM, Jonathan Rosenne > wrote: > >> For languages such as Java, passwords should be handled as byte arrays >> rather than strings. This may make it difficult to apply normalization. >> >> >> >> Jonathan Rosenne >> >> >> >> *From:* Unicode [mailto:unicode-bounces at unicode.org] *On Behalf Of *Clark >> S. Cox III >> *Sent:* Thursday, October 01, 2015 2:16 AM >> *To:* Hans ?berg >> *Cc:* unicode at unicode.org; John O'Conner >> *Subject:* Re: Unicode in passwords >> >> >> >> >> >> On 2015/09/30, at 13:29, Hans ?berg wrote: >> >> >> >> >> >> On 30 Sep 2015, at 18:33, John O'Conner wrote: >> >> Can you recommend any documents to help me understand potential issues (if >> any) for password policies and validation methods that allow characters >> from more "exotic" portions of the Unicode space? >> >> >> On UNIX computers, one computes a hash (like SHA-256), which is then used >> to authenticate the password up to a high probability. The hash is stored >> in the open, but it is not known how to compute the password from the hash, >> so knowing the hash does not easily allow authentication. >> >> So if the password is >> >> >> >> ? normalized and then ? >> >> >> >> encoded in say UTF-8 and then hashed, it would seem to take care of most >> problems. >> >> >> >> You really wouldn?t want ?Schl?ssel? and ?Schl?ssel? being different >> passwords, would you? (assuming that my mail client and/or OS is not >> interfering, the first is NFC, while the second is NFD) >> > From duerst at it.aoyama.ac.jp Mon Oct 5 22:39:36 2015 From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J._D=c3=bcrst?=) Date: Tue, 6 Oct 2015 12:39:36 +0900 Subject: Unicode in passwords In-Reply-To: <000601d0fbff$42881070$c7983150$@gmail.com> References: <4057F09C-5426-40ED-AB3D-356BBBC85276@telia.com> <2E86845B-6A9F-4D5C-A02A-C2C82F4DB28F@gmail.com> <000601d0fbff$42881070$c7983150$@gmail.com> Message-ID: <56134278.6010508@it.aoyama.ac.jp> On 2015/10/01 13:11, Jonathan Rosenne wrote: > For languages such as Java, passwords should be handled as byte arrays rather than strings. This may make it difficult to apply normalization. Well, they should be received from the user interface as strings, then normalized, then converted to byte arrays using a well-defined single encoding. Somewhat tedious, but hopefully not difficult. Regards, Martin. From yoriyuki.yamagata at aist.go.jp Mon Oct 5 22:57:51 2015 From: yoriyuki.yamagata at aist.go.jp (Yoriyuki Yamagata) Date: Tue, 6 Oct 2015 12:57:51 +0900 Subject: Unicode in passwords In-Reply-To: References: Message-ID: <57223A23-037F-4925-B2B1-EA1F4930E3ED@aist.go.jp> Dear John, FYI, IETF is working on this issue. See Internet Draft https://tools.ietf.org/html/draft-ietf-precis-saslprepbis-17 based on PRECIS framework RFC 7564 https://tools.ietf.org/html/rfc7564 Best, > 2015/10/01 1:33?John O'Conner ????? > > I'm researching potential problems and best practices for password policies that allow non-Latin-1 Unicode characters. My searching of the unicode.org site showed me a general security considerations document (UTR #36) but nothing specific for password policies using Unicode. > > Can you recommend any documents to help me understand potential issues (if any) for password policies and validation methods that allow characters from more "exotic" portions of the Unicode space? > > Best regards, > John O'Conner > ? Yoriyuki Yamagata National Institute of Advanced Science and Technology (AIST), Senior Researcher http://staff.aist.go.jp/yoriyuki.yamagata/en/ From bortzmeyer at nic.fr Tue Oct 6 03:48:14 2015 From: bortzmeyer at nic.fr (Stephane Bortzmeyer) Date: Tue, 6 Oct 2015 10:48:14 +0200 Subject: Unicode in passwords In-Reply-To: <57223A23-037F-4925-B2B1-EA1F4930E3ED@aist.go.jp> References: <57223A23-037F-4925-B2B1-EA1F4930E3ED@aist.go.jp> Message-ID: <20151006084814.GA17135@laperouse.bortzmeyer.org> On Tue, Oct 06, 2015 at 12:57:51PM +0900, Yoriyuki Yamagata wrote a message of 33 lines which said: > FYI, IETF is working on this issue. See Internet Draft > https://tools.ietf.org/html/draft-ietf-precis-saslprepbis-17 based > on PRECIS framework RFC 7564 https://tools.ietf.org/html/rfc7564 As alreday mentioned on that list, the draft is no longer a draft, it was published as a RFC, RFC 7613, two months ago From mark at macchiato.com Tue Oct 6 04:21:42 2015 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Tue, 6 Oct 2015 11:21:42 +0200 Subject: Unicode in passwords In-Reply-To: <20151006084814.GA17135@laperouse.bortzmeyer.org> References: <57223A23-037F-4925-B2B1-EA1F4930E3ED@aist.go.jp> <20151006084814.GA17135@laperouse.bortzmeyer.org> Message-ID: While I think that RFC is useful, it has been interesting just how many of the problems recounted on this list go far beyond it, often having to do with UI issues. It would be useful to have a paper somewhere that organizes all of the problems presented here, and maybe makes a stab at describing techniques for handling them. Mark *? Il meglio ? l?inimico del bene ?* On Tue, Oct 6, 2015 at 10:48 AM, Stephane Bortzmeyer wrote: > On Tue, Oct 06, 2015 at 12:57:51PM +0900, > Yoriyuki Yamagata wrote > a message of 33 lines which said: > > > FYI, IETF is working on this issue. See Internet Draft > > https://tools.ietf.org/html/draft-ietf-precis-saslprepbis-17 based > > on PRECIS framework RFC 7564 https://tools.ietf.org/html/rfc7564 > > As alreday mentioned on that list, the draft is no longer a draft, it > was published as a RFC, RFC 7613, two months ago > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jcb+unicode at inf.ed.ac.uk Tue Oct 6 05:25:40 2015 From: jcb+unicode at inf.ed.ac.uk (Julian Bradfield) Date: Tue, 6 Oct 2015 11:25:40 +0100 Subject: Unicode in passwords References: <4057F09C-5426-40ED-AB3D-356BBBC85276@telia.com> <2E86845B-6A9F-4D5C-A02A-C2C82F4DB28F@gmail.com> <20151005151200.GA7379@laperouse.bortzmeyer.org> Message-ID: On 2015-10-06, Philippe Verdy wrote: > Finally note that passwords are not necessarily single identifiers > (whitespaces and word separators are accepted, but whitespaces should > require special handling with trimming (at both ends) and compression of > multiple occurences. Why would you trim or compress whitespace? Using multiple spaces seems a perfectly legitimate way of making a password harder to guess. -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. From lists+unicode at seantek.com Tue Oct 6 06:14:15 2015 From: lists+unicode at seantek.com (Sean Leonard) Date: Tue, 6 Oct 2015 04:14:15 -0700 Subject: Why Nothing Ever Goes Away In-Reply-To: <5612D05D.7000407@att.net> References: <20151005082457.665a7a7059d7ee80bb4d670165c8327d.c32c0ba233.wbe@email03.secureserver.net> <5612D05D.7000407@att.net> Message-ID: <5613AD07.9060401@seantek.com> On 10/5/2015 12:32 PM, Ken Whistler wrote: > > On 10/5/2015 8:24 AM, Doug Ewell wrote: >> I too am puzzled as to what DIS 10646 and C1 control pictures have to do >> with each other. >> > > What an *excellent* cue to start a riff on arcane Unicode history! > > First, let me explain what I think Sean Leonard's concern here is. > > 1. On 10/4/2015 5:30 AM, Sean wrote: "I proposed adding C1 Control > Pictures to > Unicode. ... The requirement is that all glyphs for U+0000 - U+00FF be > graphically distinct." > [...] > 2. The Unicode code charts are (deliberately) vague about U+0080, U+0081, > and U+0099. > [...] > 3. Concerned about this gap, Sean did some due diligence research on the > web[...] Hence the need to track down a copy of DIS 10646 (meaning in > actuality, the appropriately numbered WG2 N666, "DIS 10646", dated > November 4, 1990). ??????????!!! ??????????!???????????????? ???????????? -Sean From lists+unicode at seantek.com Tue Oct 6 07:24:06 2015 From: lists+unicode at seantek.com (Sean Leonard) Date: Tue, 6 Oct 2015 05:24:06 -0700 Subject: Why Nothing Ever Goes Away In-Reply-To: References: <20151005082457.665a7a7059d7ee80bb4d670165c8327d.c32c0ba233.wbe@email03.secureserver.net> <5612D05D.7000407@att.net>

Message-ID: <5613BD66.3080707@seantek.com> > 2. The Unicode code charts are (deliberately) vague about U+0080, U+0081, > and U+0099. All other C1 control codes have aliases to the ISO 6429 > set of control functions, but in ISO 6429, those three control codes > don't > have any assigned functions (or names). On 10/5/2015 3:57 PM, Philippe Verdy wrote: > Also the aliases for C1 controls were formally registered in 1983 only > for the two ranges U+0084..U+0097 and U+009B..U+009F for ISO 6429. If I may, I would appreciate another history lesson: In ISO 2022 / 6429 land, it is apparent that the C1 controls are mainly aliases for ESC 4/0 - 5/15. ( @ through _ ) This might vary depending on what is loaded into the C1 register, but overall, it just seems like saving one byte. Why was C1 invented in the first place? And, why did Unicode deem it necessary to replicate the C1 block at 0x80-0x9F, when all of the control characters (codes) were equally reachable via ESC 4/0 - 5/15? I understand why it is desirable to align U+0000 - U+007F with ASCII, and maybe even U+0000 - U+00FF with Latin-1 (ISO-8859-1). But maybe Windows-1252, MacRoman, and all the other non-ISO-standardized 8-bit encodings got this much right: duplicating control codes is basically a waste of very precious character code real estate. Sean PS I was not able to turn up ISO 6429:1983, but I did find ECMA-48, 4th Ed., December 1986, which has the following text: *** 5.4 Elements of the C1 Set These control functions are represented: - In a 7-bit code by 2-character escape sequences of the form ESC Fe, where ESC is represented by bit combination 01/11 and Fe is represented by a bit combination from 04/00 to 05/15. - In an 8-bit code by bit combinations from 08/00 to 09/15. *** This text is seemingly repeated in many analogous standards ca. ~1974 - ~1992. PPS I happen to have a copy of ANSI X3.41-1974 "American National Standard Code Extension Techniques for Use with the 7-Bit Coded Character Set of [ASCII]". The invention/existence of C1 goes back to this time, as does the use of ESC Fe to invoke C1 characters in a 7-bit code, and 0x80-0x9F to invoke C1 characters in an 8-bit code. (See, in particular, Clauses 5.3.3.1 and 5.3.6). In particular, Clause 7.3.1.2 says: "The use of ESC Fe sequence in an 8-bit environment is contrary to the intention of this standard but, should they occur, their meaning is the same as in the 7-bit environment." I can appreciate why it was desirable to "fold" C1 in an 8-bit environment into a 7-bit environment with ESC Fe. (If, in fact, that was the direction of standardization: invent a new thing and then devise a coding to express the new thing in the old thing.) It is less obvious why Unicode adopted C1, however, when the trend was to jettison the 94-character Tetris block assignments in favor of a wide-open field for character assignment. Except for the trend in Unicode to "avoid assigning characters when explicitly asked, unless someone implements them without asking, and the implementation catches on, and then just assign the whole lot of them, even when they overlap with existing assignments, and then invent composite characters, which further compound the possible overlapping combinations". ?? From verdy_p at wanadoo.fr Tue Oct 6 08:04:44 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Tue, 6 Oct 2015 15:04:44 +0200 Subject: Unicode in passwords In-Reply-To: <56134278.6010508@it.aoyama.ac.jp> References: <4057F09C-5426-40ED-AB3D-356BBBC85276@telia.com> <2E86845B-6A9F-4D5C-A02A-C2C82F4DB28F@gmail.com> <000601d0fbff$42881070$c7983150$@gmail.com> <56134278.6010508@it.aoyama.ac.jp> Message-ID: Note that Java strings DO allow the presence of lone surrogates, as well as non-characters , because Java strings are unrestricted vectors of 16-bit code units (non-BMP characters are handled as pairs of surrogates). In those conditions, normalizing the Java string will leave those lone surrogates (and non-characters) as is, or will throw an exception, depending on the API used. Java strings do not have any implied encoding (their "char" members are also unrestricted 16-bit code units, they have some basic properties but only in BMP, defined in the builtin Character class API: properties for non-BMP characters require using a library to provide them, such as ICU4J). This is essentially the same kind as C/C++ "wide" strings using 16-bit wchar_t, except that: - C/C++ wide strings do not allow the inclusion of U+0000 which is a terminator, unless you use a string class keeping the actual string length (and not just the allocated buffer length which may be larger). - Java strings, including litterals, are immutable, and optionally atomized into a global dictionary, which includes all string litterals to share the storage space of multiple instances with equal contents, including across distinct classes from distinct packages. - This also true for string literals (which are all immutable and atomized, and initialized from the compiled bytecode of classes using a modified version of UTF-8 that preserves all 16-bit code units (including lone surrogates and non-characters like U+FFFF), but also store U+0000 as <0xC0,0x80>. This modified UTF-8 encoding is also what you get if you use the JNI interface version with 8-bit string (this internally requires a conversion by JNI, using a temporary buffer); if you use the JNI interface version with 16-bit strings, you work directly with the internal 16-bit java strings and there's no conversion: you'll also get the lone surrogates and all non-characters and you are not restricted to only valid UTF-16. - Java strings are commonly used for fast initialization of large immutable binary arrays because the conversion from Modified-UTF-8 to 16-bit strings does not require running any comp?led bytecode (this is not true for other static arrays which requires large code for array litterals and not warrantied to be immutable: the alternative to this large compiled code is to initialize those large static arrays by I*/O *from an external stream, such as a file beside the class in the same package, and possibly packed in the same JAR). Java passwords are "strings" but then still allow them to include arbitrary 16-bit code units, even if they violate UTF-16 restrictions. You will not get much difference is you use byte arrays, the only change being the difference of size of code units. Between those two representation you are free to convert them with ANY encodings pair, and not just assuming UTF-8<>UTF-16. However, for security reasons, it's best to avoid string litterals for passwords, because they can be enumerated from the global dictionnary of atomized strings, or directly by reading the byte code of the compiled class where they are sored in modified-UTF-8 but loaded and used as arbitrary 16-bit strings (but the same is true if you use a byte array literal ! you can just parse the initilization byte code to get the list of bytes). If passwords or authorization keys are stored somewhere (as strings or as byte arrays) they should be encrypted into a safe storage and not in static string litterals or byte array initializers (they will BOTH be clear text in the bytecode of the compiled class). In both cases, there is NO normalization applied implicitly or checked/enforced by the API (the only check that occurs is at class loading time for the Modified-UTF-8 encoding for string literals: if it is wrong the class will not load at all, you'll get an invalid class exception; there's no such ckeck at all for the encoding of byte array initializers, the only checks are the validity of the java initializer byte code and bounds of array indexes used by the initiliazer code). 2015-10-06 5:39 GMT+02:00 Martin J. D?rst : > On 2015/10/01 13:11, Jonathan Rosenne wrote: > >> For languages such as Java, passwords should be handled as byte arrays >> rather than strings. This may make it difficult to apply normalization. >> > > Well, they should be received from the user interface as strings, then > normalized, then converted to byte arrays using a well-defined single > encoding. Somewhat tedious, but hopefully not difficult. > > Regards, Martin. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Tue Oct 6 08:13:25 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Tue, 6 Oct 2015 15:13:25 +0200 Subject: Unicode in passwords In-Reply-To: References: <4057F09C-5426-40ED-AB3D-356BBBC85276@telia.com> <2E86845B-6A9F-4D5C-A02A-C2C82F4DB28F@gmail.com> <20151005151200.GA7379@laperouse.bortzmeyer.org> Message-ID: I don't think it is a good idea for tectual passwords to make differences based on the number of spaces. Being plain text they are likely to be displayed in utser interfaces in a way that the user will not see. Without trimming, users won't see the initial or final space, and the password input method may not display them as well (e.g. in an HTML input form or when using a button to generate passphrases that users must then copy-paste to their password manager or to some private text document). Some password storages also will implicitly trim and compress those strings (e.g. in a fixed-width column of a table in a database). There's also frequently no visual hint when entering or displaying those spaces and compression occurs implicitly, or pass phrases may be line wrapped in the middle where you won't see the number of spaces. 2015-10-06 12:25 GMT+02:00 Julian Bradfield : > On 2015-10-06, Philippe Verdy wrote: > > Finally note that passwords are not necessarily single identifiers > > (whitespaces and word separators are accepted, but whitespaces should > > require special handling with trimming (at both ends) and compression of > > multiple occurences. > > Why would you trim or compress whitespace? Using multiple spaces seems a > perfectly legitimate way of making a password harder to guess. > > -- > The University of Edinburgh is a charitable body, registered in > Scotland, with registration number SC005336. > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Tue Oct 6 08:27:36 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Tue, 6 Oct 2015 15:27:36 +0200 Subject: Unicode in passwords In-Reply-To: <20151006084814.GA17135@laperouse.bortzmeyer.org> References: <57223A23-037F-4925-B2B1-EA1F4930E3ED@aist.go.jp> <20151006084814.GA17135@laperouse.bortzmeyer.org> Message-ID: And there are severe issues in this RFC for its case mapping profile: it requires converting "uppercase" characters to "lowercase", but these properties are not stable (see for example the history of Cherokee letters, changed from gc=Lo to gc=Lu when lowercase letters were added and with case pairs added at the same time, see also the addition of the capital sharp S for German). That RFC should used used the Unicode "Case Folding" algorithm which is stable (case folded strings are NOT necessarily all lowercase, they are just warrantied to keep a single case variant, and case folding implies the use of compatibility normalization forms, i.e. NFKC or NFKD, to get the correct closure: the standard Unicode normalizations are also stable) ! 2015-10-06 10:48 GMT+02:00 Stephane Bortzmeyer : > On Tue, Oct 06, 2015 at 12:57:51PM +0900, > Yoriyuki Yamagata wrote > a message of 33 lines which said: > > > FYI, IETF is working on this issue. See Internet Draft > > https://tools.ietf.org/html/draft-ietf-precis-saslprepbis-17 based > > on PRECIS framework RFC 7564 https://tools.ietf.org/html/rfc7564 > > As alreday mentioned on that list, the draft is no longer a draft, it > was published as a RFC, RFC 7613, two months ago > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Tue Oct 6 08:57:37 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Tue, 6 Oct 2015 15:57:37 +0200 Subject: Why Nothing Ever Goes Away In-Reply-To: <5613BD66.3080707@seantek.com> References: <20151005082457.665a7a7059d7ee80bb4d670165c8327d.c32c0ba233.wbe@email03.secureserver.net> <5612D05D.7000407@att.net>

<5613BD66.3080707@seantek.com> Message-ID: 2015-10-06 14:24 GMT+02:00 Sean Leonard : > 2. The Unicode code charts are (deliberately) vague about U+0080, U+0081, >> and U+0099. All other C1 control codes have aliases to the ISO 6429 >> set of control functions, but in ISO 6429, those three control codes don't >> have any assigned functions (or names). >> > > On 10/5/2015 3:57 PM, Philippe Verdy wrote: > >> Also the aliases for C1 controls were formally registered in 1983 only >> for the two ranges U+0084..U+0097 and U+009B..U+009F for ISO 6429. >> > > If I may, I would appreciate another history lesson: > In ISO 2022 / 6429 land, it is apparent that the C1 controls are mainly > aliases for ESC 4/0 - 5/15. ( @ through _ ) This might vary depending on > what is loaded into the C1 register, but overall, it just seems like saving > one byte. > > Why was C1 invented in the first place? > Look for the history of EBCDIC and its adaptation/conversion with ASCII compatible encodings: round trip conversion wasneeded (using a only a simple reordering of byte values, with no duplicates). EBCDIC has used many controls that were not part of C0 and were kept in the C1 set. Ignore the 7-bit compatiblity encoding using pairs, they were only needed for ISO 2022, but ISO 6429 defines a profile where those longer sequences are not needed and even forbidden in 8-bit contexts or in contexts where aliases are undesirable and invalidated, such as security environments. With your thoughts, I would conclude that assigning characters in the G1 set was also a duplicate, because it is reachable with a C0 "shifting" control + a position of the G0 set. In that case ISO 8859-1 or Windows 1252 was also an unneeded duplication ! And we would live today in a 7-bit only world. C1 controls have their own identity. The 7-bit encoding using ESC is just a hack to make them fit in 7-bit and it only works where the ESC control is assumed to play this function according to ISO 2022, ISO 6429, or other similar old 7-bit protocols such as Videotext (which was widely used in France with the free "Minitel" terminal, long before the introduction of the Internet to the general public around 1992-1995). Today Videotext is definitely dead (the old call numbers for this slow service are now definitely defunct, the Minitels are recycled wastes, they stopped being distributed and replaced by applications on PC connected to the Internet, but now all the old services are directly on the internet and none of them use 7-bit encodings for their HTML pages, or their mobile applications). France has also definitely abandoned its old French version of ISO 646, there are no longer any printer supporting versions of ISO 646 other than ASCII, but they still support various 8-bit encodings. 7-bit encodings are things of the past (they were only justified at times where communication links were slow and generated lots of transmission errors, and the only implemented mecanism to check them was to use a single parity bit per character. Today we transmit long datagrams and prefer using checks codes for the whole (such as CRC, or autocorrecting codes). 8-bit encodings are much easier and faster to process for transmitting not just text but also binary data. Let's forget the 7-bit world definitely. We have also abandonned the old UTF-7 in Unicode ! I've not seen it used anywhere except in a few old emails sent at end of the 90's, because many mail servers were still not 8-bit clean and silently transformed non-ASCII bytes in unpredictable ways or using unspecified encodings, or just siltently dropped the high bit, assuming it was just a parity bit : at that time, emails were not sent with SMTP, but with the old UUCP protocol and could take weeks to be delivered to the final recipient, as there was still no global routing infrastructure and many hops were necessary via non-permanent modem links. My opinion of UTF-7 is that it was just a temporary and experimental solution to help system admins and developers adopt the new UCS, including for their old 7-bit environments. -------------- next part -------------- An HTML attachment was scrubbed... URL: From jcb+unicode at inf.ed.ac.uk Tue Oct 6 09:31:22 2015 From: jcb+unicode at inf.ed.ac.uk (Julian Bradfield) Date: Tue, 6 Oct 2015 15:31:22 +0100 (BST) Subject: Unicode in passwords References: <4057F09C-5426-40ED-AB3D-356BBBC85276@telia.com> <2E86845B-6A9F-4D5C-A02A-C2C82F4DB28F@gmail.com> <20151005151200.GA7379@laperouse.bortzmeyer.org>

Message-ID: On 2015-10-06, Philippe Verdy wrote: > I don't think it is a good idea for tectual passwords to make differences > based on the number of spaces. Being plain text they are likely to be > displayed in utser interfaces in a way that the user will not see. Without This is true of all passwords. Passwords have to be typed by finger memory, not by looking at them (unless you're the type who puts them on sticky notes, in which case you type by looking at the text on the note). One doesn't normally see the characters, at best a count of characters. > trimming, users won't see the initial or final space, and the password > input method may not display them as well (e.g. in an HTML input form or All browsers I use display spaces in input boxes, and put blobs for hidden fields. Do you have evidence for broken input fields? > when using a button to generate passphrases that users must then copy-paste > to their password manager or to some private text document). Copy-paste works on all my systems, too - do you have evidence of broken copy-paste in this way? > Some password > storages also will implicitly trim and compress those strings (e.g. in a If it compresses it on setting, but doesn't compress it on testing, or vice versa, then that's a bug. If it does the same for setting and testing, it doesn't matter (except to compromise the crack-resistance of the password). > fixed-width column of a table in a database). There's also frequently no > visual hint when entering or displaying those spaces and compression occurs Evidence? Maybe if you're typing a password into a Word document it's hard to count spaces, but why would you be doing that? -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. From asmus-inc at ix.netcom.com Tue Oct 6 10:33:51 2015 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Tue, 6 Oct 2015 08:33:51 -0700 Subject: Why Nothing Ever Goes Away In-Reply-To: <5613BD66.3080707@seantek.com> References: <20151005082457.665a7a7059d7ee80bb4d670165c8327d.c32c0ba233.wbe@email03.secureserver.net> <5612D05D.7000407@att.net>

<5613BD66.3080707@seantek.com> Message-ID: <5613E9DF.8020406@ix.netcom.com> An HTML attachment was scrubbed... URL: From doug at ewellic.org Tue Oct 6 11:02:57 2015 From: doug at ewellic.org (Doug Ewell) Date: Tue, 06 Oct 2015 09:02:57 -0700 Subject: Why Nothing Ever Goes Away Message-ID: <20151006090257.665a7a7059d7ee80bb4d670165c8327d.f7c4b8601c.wbe@email03.secureserver.net> Asmus Freytag (t) wrote: > Nobody wanted to follow the IBM code page 437 (then still the most > widely used single byte vendor standard). Although to this day, the UN/LOCODE manual [1] still refers to 437 as "the standard United States character set" and claims that it "conforms to these ISO standards" (8859-1:1987 and 10646-1:1993). [1] http://www.unece.org/fileadmin/DAM/cefact/locode/2015-1_UNLOCODE_SecretariatNotes.pdf > Also, the overloading of 0x80-0xFF by Windows did not happen all at > once, earlier versions had left much of that space open, And it's still not completely filled, in any of the 125x code pages except for the quirky 1256. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From asmus-inc at ix.netcom.com Tue Oct 6 12:14:06 2015 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Tue, 6 Oct 2015 10:14:06 -0700 Subject: Unicode in passwords In-Reply-To: References: <4057F09C-5426-40ED-AB3D-356BBBC85276@telia.com> <2E86845B-6A9F-4D5C-A02A-C2C82F4DB28F@gmail.com> <20151005151200.GA7379@laperouse.bortzmeyer.org>

Message-ID: <5614015E.3010302@ix.netcom.com> An HTML attachment was scrubbed... URL: From unicode at lindenbergsoftware.com Tue Oct 6 12:39:08 2015 From: unicode at lindenbergsoftware.com (Norbert Lindenberg) Date: Tue, 6 Oct 2015 10:39:08 -0700 Subject: Unicode in passwords In-Reply-To: References: <4057F09C-5426-40ED-AB3D-356BBBC85276@telia.com> <2E86845B-6A9F-4D5C-A02A-C2C82F4DB28F@gmail.com> <000601d0fbff$42881070$c7983150$@gmail.com> <56134278.6010508@it.aoyama.ac.jp> Message-ID: > On Oct 6, 2015, at 6:04 , Philippe Verdy wrote: > > In those conditions, normalizing the Java string will leave those lone surrogates (and non-characters) as is, or will throw an exception, depending on the API used. Java strings do not have any implied encoding (their "char" members are also unrestricted 16-bit code units, they have some basic properties but only in BMP, defined in the builtin Character class API: properties for non-BMP characters require using a library to provide them, such as ICU4J). The Java Character class was enhanced in J2SE 5.0 to support supplementary characters. The String class was specified to be based on UTF-16, and string processing throughout the platform was updated to support supplementary characters based on UTF-16. These changes have been available to the public since 2004. For a summary, see http://www.oracle.com/technetwork/articles/java/supplementary-142654.html Norbert From jcb+unicode at inf.ed.ac.uk Tue Oct 6 14:13:12 2015 From: jcb+unicode at inf.ed.ac.uk (Julian Bradfield) Date: Tue, 6 Oct 2015 20:13:12 +0100 (BST) Subject: Unicode in passwords References: <4057F09C-5426-40ED-AB3D-356BBBC85276@telia.com> <2E86845B-6A9F-4D5C-A02A-C2C82F4DB28F@gmail.com> <20151005151200.GA7379@laperouse.bortzmeyer.org>

<5614015E.3010302@ix.netcom.com> Message-ID: On 2015-10-06, Asmus Freytag (t) wrote: > All browsers I use display spaces in input boxes, and put blobs for > hidden fields. Do you have evidence for broken input fields? > >
> Network keys. That interface seems to consistently give people a > choice to reveal the key.
? That's not broken in the way Philippe was discussing. > Copy-paste works on all my systems, too - do you have evidence of > broken copy-paste in this way? > >
> I've seen input fields where sites don't allow paste on the second > copy (the confirmation copy).
>
> Even for non-password things.
That's not relevantly broken, either - it's a design feature, to make sure you can type the password again (from finger memory!). -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. From richard.wordingham at ntlworld.com Tue Oct 6 14:19:27 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Tue, 6 Oct 2015 20:19:27 +0100 Subject: Unicode in passwords In-Reply-To: References: <57223A23-037F-4925-B2B1-EA1F4930E3ED@aist.go.jp> <20151006084814.GA17135@laperouse.bortzmeyer.org> Message-ID: <20151006201927.603269d9@JRWUBU2> On Tue, 6 Oct 2015 11:21:42 +0200 Mark Davis ?? wrote: > While I think that RFC is useful, it has been interesting just how > many of the problems recounted on this list go far beyond it, often > having to do with UI issues. It would be useful to have a paper > somewhere that organizes all of the problems presented here, and > maybe makes a stab at describing techniques for handling them. Indeed, there are several different scenarios. The most prototypical are: 1) Initial access to a stand-alone computing device, the conventional logging on. In this case, it is usually risky to use anything but printable ASCII. 2) Internet passwords for use in privacy. Basically any non-trivial combination of characters should be acceptable, provided it will not be mangled in transmission. Under the rules of Unicode, this means that the text should be normalised before becoming a mere sequence of bytes. Note that in the second scenario, there is normally an 'administrator' who can put things right. Richard. From richard.wordingham at ntlworld.com Tue Oct 6 14:57:34 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Tue, 6 Oct 2015 20:57:34 +0100 Subject: Unicode in passwords In-Reply-To: References: <4057F09C-5426-40ED-AB3D-356BBBC85276@telia.com> <2E86845B-6A9F-4D5C-A02A-C2C82F4DB28F@gmail.com> <20151005151200.GA7379@laperouse.bortzmeyer.org>

<5614015E.3010302@ix.netcom.com> Message-ID: <20151006205734.038a869f@JRWUBU2> On Tue, 6 Oct 2015 20:13:12 +0100 (BST) Julian Bradfield wrote: > On 2015-10-06, Asmus Freytag (t) wrote: > > All browsers I use display spaces in input boxes, and put blobs for > > hidden fields. Do you have evidence for broken input fields? > > > >
> > Network keys. That interface seems to consistently give people a > > choice to reveal the key.
> > ? That's not broken in the way Philippe was discussing. No, but if you make the password up as you type it, you might not then notice that one accidentally typed a double space. > > Copy-paste works on all my systems, too - do you have evidence of > > broken copy-paste in this way? > > > >
> > I've seen input fields where sites don't allow paste on the > > second copy (the confirmation copy).
> >
> > Even for non-password things.
> > That's not relevantly broken, either - it's a design feature, to make > sure you can type the password again (from finger memory!). It's an interesting issue for a password that one can't type. It's by no means a guarantee, either. I once specified a new a password that changed case in the middle not realising that I had started with caps lock on. Consequently, both copies has the wrong capitalisation. I was using a wireless keyboard, which to conserve battery power doesn't have a caps lock indicator. (In the old days, caps lock would have physically locked, but that's not how keyboard drivers work nowadays.) It took a little while before it occurred to me that I might have had a problem with caps lock. Richard. From richard.wordingham at ntlworld.com Tue Oct 6 15:14:59 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Tue, 6 Oct 2015 21:14:59 +0100 Subject: Why Nothing Ever Goes Away In-Reply-To: References: <20151005082457.665a7a7059d7ee80bb4d670165c8327d.c32c0ba233.wbe@email03.secureserver.net> <5612D05D.7000407@att.net>

<5613BD66.3080707@seantek.com> Message-ID: <20151006211459.0c9a4399@JRWUBU2> On Tue, 6 Oct 2015 15:57:37 +0200 Philippe Verdy wrote: > My opinion of UTF-7 is that > it was just a temporary and experimental solution to help system > admins and developers adopt the new UCS, including for their old > 7-bit environments. If you have a human controlling the interpretation, UTF-7 was a good way of avoiding data being mangled by interfaces that insisted that unlabelled (indeed, sometimes, unlabellable) 8-bit text was UTF-8 or conversely, Latin-1 or code page 1252. The old Yahoo groups web interface for senders was pretty much restricted to 8-bit ISO-2022 encodings without it. C1 characters would be converted to Latin-1 on the assumption that they were Windows 1252. Browsers dropping UTF-7 support was a major inconvenience. Richard. From verdy_p at wanadoo.fr Tue Oct 6 15:43:55 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Tue, 6 Oct 2015 22:43:55 +0200 Subject: Unicode in passwords In-Reply-To: References: <4057F09C-5426-40ED-AB3D-356BBBC85276@telia.com> <2E86845B-6A9F-4D5C-A02A-C2C82F4DB28F@gmail.com> <20151005151200.GA7379@laperouse.bortzmeyer.org>

Message-ID: 2015-10-06 16:31 GMT+02:00 Julian Bradfield : > On 2015-10-06, Philippe Verdy wrote: > > I don't think it is a good idea for tectual passwords to make differences > > based on the number of spaces. Being plain text they are likely to be > > displayed in utser interfaces in a way that the user will not see. > Without > > This is true of all passwords. Passwords have to be typed by finger > memory, not by looking at them (unless you're the type who puts them > on sticky notes, in which case you type by looking at the text on the > note). One doesn't normally see the characters, at best a count of > characters. > > > trimming, users won't see the initial or final space, and the password > > input method may not display them as well (e.g. in an HTML input form or > > All browsers I use display spaces in input boxes, and put blobs for > hidden fields. Do you have evidence for broken input fields? > I was speaking of OUTPUT fields : you want to display passwords that are stored somewhere (including in a text document stored in some safe place such as an external flash drive). People can't remember many passwords. Hiding them on screen is a fake security, what we need is complex passwords (difficult to memoize so we need a wallet to store them but people will also **printing** them and not store them in a electronic format), and many passwords (one for each site or application requiring one). But they also want to be able to type them correctly: long passwords hidden on screen will not help much (Hidden passwords in input forms is just to avoid some spying eyes on your screen, but people can still pay on your keystrokes...) If people are concerned by eyes, they'll need to hide their keyboard input (notably on touch screens!) but also their screen by first making sure there's nobody around to look at what you do. If there's a camera, hiding the password on screen will also no help, it will also be easy to see your keystrokes. Biometric identification is also another fake security (because it is immutable, when passwords can be and should be changed regularly) and it is extremely easy to duplicate a biometric data record (to be more effective, the physical captor device should be internally secured and its internal data instantly flushed in case of intrusion, and this device should be securely authenticated in addition to performing the biometric check, but the biometric data should not be transmitted, instead it should be used to compute a secure hash from the hidden biometric data and negociated and checked unique randomized data from the source requesting the access, it should use public key encryption with a couple of public/private key pairs, not symetric keys, or triple key pairs if using another independant third party: the private keys will never be exchanged or duplicated). But some time you'll need to reset those keys and the only tool you'll have will be to use cleartext pass phrases, even if there's a physical device identification, encryption with key pairs and the extremely private biometric data. Unfortunately biometric data is now shared with governmental third parties, and even exchanged internationally (they are present on passports and biometric passports are now mandatory for any one taking a plane to/from/via the United States and now in many European countries as well; DNA tracks are also very easyto capture. Biometric data is no longer a private property, they cannot be used as secrets for access authentication or signatures). There's still nothing to replace pass phrases and those need to be user friendly for their legitimate owners. -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Tue Oct 6 15:53:00 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Tue, 6 Oct 2015 22:53:00 +0200 Subject: Unicode in passwords In-Reply-To: <20151006205734.038a869f@JRWUBU2> References: <4057F09C-5426-40ED-AB3D-356BBBC85276@telia.com> <2E86845B-6A9F-4D5C-A02A-C2C82F4DB28F@gmail.com> <20151005151200.GA7379@laperouse.bortzmeyer.org>

<5614015E.3010302@ix.netcom.com> <20151006205734.038a869f@JRWUBU2> Message-ID: 2015-10-06 21:57 GMT+02:00 Richard Wordingham < richard.wordingham at ntlworld.com>: > It's an interesting issue for a password that one can't type. It's by > no means a guarantee, either. I once specified a new a password that > changed case in the middle not realising that I had started with caps > lock on. Consequently, both copies has the wrong capitalisation. I > was using a wireless keyboard, which to conserve battery power doesn't > have a caps lock indicator. (In the old days, caps lock would have > physically locked, but that's not how keyboard drivers work nowadays.) > It took a little while before it occurred to me that I might have had a > problem with caps lock. > This is a demonstration that using case differences to add more combinations in short passwords is a bad design. As well hiding typed input is not a good idea: we need at least a pressable button to look/confirm what we are typing. Instead of lettercase combinations limited to ASCII, it is highly preferable to extend the character repertoire to Unicode and accept letters in NFKC form and unified by case folding (NOT conversion to lowercase or uppercase, as it is not stable across Unicode versions). So we should define here the usable set of characters (and define characters that should be ignored and discarded if present on input). This should be a profile in UAX #31 (and we should issue a strong warning against the recent RFC that forgot the issue : its case-insensitive profile based on NFC and conversion to lowercase is definitely broken !) -------------- next part -------------- An HTML attachment was scrubbed... URL: From naz at gassiep.com Tue Oct 6 22:28:54 2015 From: naz at gassiep.com (Naz Gassiep) Date: Wed, 7 Oct 2015 14:28:54 +1100 Subject: Proposals for Arabic honorifics In-Reply-To: <5612292B.8040208@gassiep.com> References: <5612292B.8040208@gassiep.com> Message-ID: <56149176.5060404@gassiep.com> If there are no comments on this specific issue, could someone care to comment on the idea of writing a proposal that extends and existing proposal? Is this considered bad form, or is it OK so long as it doesn't unnecessarily raise conflicting proposals? - Naz. On 5/10/2015 6:39 PM, Naz Gassiep wrote: > Hi all, > We are considering writing a proposal for Arabic honorifics which are > missing from Unicode. There are already a few in there, notably U+FDFA > and U+FDFB. > > There are two existing proposals, L2/14-147 and L2/14-152, which each > propose additions. L2/14-147 proposes seventeen new characters and > L2/14-152 proposes a further two. > > There are a few other characters that are not included in these > proposals, and I was considering preparing a proposal of my own. I > will work with a team of people who are willing to contribute time to > this work. We are considering two options: > > 1. Prepare an additional proposal for the characters that were missing > from the existing spec and also from the two proposals mentioned above. > 2. Prepare a collating proposal which rolls the two proposals as well > as the others that we feel are missing into a single proposal. > > Currently, we favour the second option. We would ensure that full > descriptions, names, character properties, and detailed examples are > provided for each character to substantiate its use in modern plain > text. We would also suggest code points in line with the existing > proposal L2/14-147. > > We don't want to step on the toes of the original submitters, Roozbeh > Pournader or Lateef Sagar Shaikh. We wish to be clear that we will > draw on their existing proposals to the maximum extent possible to > ensure that we do not submit a conflicting proposal, but a superset > proposal that incorporates their proposals as well as the additional > characters we have identified. We have evaluated these two, and a true > superset proposal is possible such that no conflicts between either > those two proposals or our own will materialize. > > Are there any issues that we may face in preparing and submitting our > proposal? > Any guidance from this mailing list would be highly valued. > Many thanks, > - Naz. From lisam at us.ibm.com Tue Oct 6 23:02:32 2015 From: lisam at us.ibm.com (Lisa Moore) Date: Tue, 6 Oct 2015 21:02:32 -0700 Subject: Proposals for Arabic honorifics In-Reply-To: <56149176.5060404@gassiep.com> References: <5612292B.8040208@gassiep.com> <56149176.5060404@gassiep.com> Message-ID: <201510070402.t9742crx026410@d01av03.pok.ibm.com> Hello Naz, Thank you for discussing your proposal on the unicode list. Not all experts monitor that list. That said, feel free to submit a proposal to "docsubmit at unicode.org". Look forward to seeing your proposal. Lisa From: Naz Gassiep To: unicode at unicode.org Date: 10/06/2015 08:50 PM Subject: Re: Proposals for Arabic honorifics Sent by: "Unicode" If there are no comments on this specific issue, could someone care to comment on the idea of writing a proposal that extends and existing proposal? Is this considered bad form, or is it OK so long as it doesn't unnecessarily raise conflicting proposals? - Naz. On 5/10/2015 6:39 PM, Naz Gassiep wrote: > Hi all, > We are considering writing a proposal for Arabic honorifics which are > missing from Unicode. There are already a few in there, notably U+FDFA > and U+FDFB. > > There are two existing proposals, L2/14-147 and L2/14-152, which each > propose additions. L2/14-147 proposes seventeen new characters and > L2/14-152 proposes a further two. > > There are a few other characters that are not included in these > proposals, and I was considering preparing a proposal of my own. I > will work with a team of people who are willing to contribute time to > this work. We are considering two options: > > 1. Prepare an additional proposal for the characters that were missing > from the existing spec and also from the two proposals mentioned above. > 2. Prepare a collating proposal which rolls the two proposals as well > as the others that we feel are missing into a single proposal. > > Currently, we favour the second option. We would ensure that full > descriptions, names, character properties, and detailed examples are > provided for each character to substantiate its use in modern plain > text. We would also suggest code points in line with the existing > proposal L2/14-147. > > We don't want to step on the toes of the original submitters, Roozbeh > Pournader or Lateef Sagar Shaikh. We wish to be clear that we will > draw on their existing proposals to the maximum extent possible to > ensure that we do not submit a conflicting proposal, but a superset > proposal that incorporates their proposals as well as the additional > characters we have identified. We have evaluated these two, and a true > superset proposal is possible such that no conflicts between either > those two proposals or our own will materialize. > > Are there any issues that we may face in preparing and submitting our > proposal? > Any guidance from this mailing list would be highly valued. > Many thanks, > - Naz. -------------- next part -------------- An HTML attachment was scrubbed... URL: From jcb+unicode at inf.ed.ac.uk Wed Oct 7 04:59:57 2015 From: jcb+unicode at inf.ed.ac.uk (Julian Bradfield) Date: Wed, 7 Oct 2015 10:59:57 +0100 Subject: Unicode in passwords References: <4057F09C-5426-40ED-AB3D-356BBBC85276@telia.com> <2E86845B-6A9F-4D5C-A02A-C2C82F4DB28F@gmail.com> <20151005151200.GA7379@laperouse.bortzmeyer.org>

Message-ID: On 2015-10-06, Philippe Verdy wrote: > I was speaking of OUTPUT fields : you want to display passwords that are > stored somewhere (including in a text document stored in some safe place > such as an external flash drive). People can't remember many passwords. Again, output fields (such as in the Firefox password manager), in my experience, display the text that is in them, not a stripped and compressed version. If they don't, it's a bug. If you start using passwords including NBSP and EM-DASH, then it's going to get a bit awkward - but you should know you're doing that, and take measures accordingly. > Hiding them on screen is a fake security, what we need is complex passwords > (difficult to memoize so we need a wallet to store them but people will > also **printing** them and not store them in a electronic format), and many It's questionable whether there is ever a need to print a password, except in the case of an automatically generated hard-copy password reset. My digital will (if I'd produced one) would need about half a dozen passwords, mainly the master password for the password manager, plus some sensitive finance and system admin ones. That's few enough to write down by hand (or type by hand into a text file), with appropriate notes. > passwords (one for each site or application requiring one). But they also > want to be able to type them correctly: long passwords hidden on screen Most of our students seem (when I see them logging in to give presentations) to have long passwords - 20-30 characters - and they don't seem to have a problem. This also illustrates why defaulting to hidden passwords is useful. > Biometric identification is also another fake security (because it is Not sure what this has to do with Unicode in passwords. > immutable, when passwords can be and should be changed regularly) and it is Bruce Schneier is one of the best known and most respected security researchers around today, and here's his advice: So in general: you don't need to regularly change the password to your computer or online financial accounts (including the accounts at retail sites); definitely not for low-security accounts. You should change your corporate login password occasionally, and you need to take a good hard look at your friends, relatives, and paparazzi before deciding how often to change your Facebook password. But if you break up with someone you've shared a computer with, change them all. ( https://www.schneier.com/blog/archives/2010/11/changing_passwo.html ) -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. From bortzmeyer at nic.fr Wed Oct 7 06:16:16 2015 From: bortzmeyer at nic.fr (Stephane Bortzmeyer) Date: Wed, 7 Oct 2015 13:16:16 +0200 Subject: Unicode in passwords In-Reply-To: References: <2E86845B-6A9F-4D5C-A02A-C2C82F4DB28F@gmail.com> <20151005151200.GA7379@laperouse.bortzmeyer.org>

<5614015E.3010302@ix.netcom.com> <20151006205734.038a869f@JRWUBU2> Message-ID: <20151007111615.GA25230@nic.fr> On Tue, Oct 06, 2015 at 10:53:00PM +0200, Philippe Verdy wrote a message of 72 lines which said: > it is highly preferable to extend the character repertoire to > Unicode and accept letters in NFKC form and unified by case folding As I said before, "the ship has sailed". RFC 7613 has been published, and uses NFC and case preservation. It is IMHO useless to reopen this discussion. > the recent RFC that forgot the issue : its case-insensitive profile > based on NFC and conversion to lowercase is definitely broken !) What is broken is your analysis. RFC 7613 does not convert passwords to lowercase. Indeed, it says exactly the opposite, which seems to indicate that you did not read it before calling it broken: Case-Mapping Rule: Uppercase and titlecase characters MUST NOT be mapped to their lowercase equivalents. From verdy_p at wanadoo.fr Wed Oct 7 06:46:06 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Wed, 7 Oct 2015 13:46:06 +0200 Subject: Unicode in passwords In-Reply-To: <20151007111615.GA25230@nic.fr> References: <2E86845B-6A9F-4D5C-A02A-C2C82F4DB28F@gmail.com> <20151005151200.GA7379@laperouse.bortzmeyer.org>

<5614015E.3010302@ix.netcom.com> <20151006205734.038a869f@JRWUBU2> <20151007111615.GA25230@nic.fr> Message-ID: 2015-10-07 13:16 GMT+02:00 Stephane Bortzmeyer : > On Tue, Oct 06, 2015 at 10:53:00PM +0200, > Philippe Verdy wrote > a message of 72 lines which said: > > > it is highly preferable to extend the character repertoire to > > Unicode and accept letters in NFKC form and unified by case folding > > As I said before, "the ship has sailed". RFC 7613 has been published, > and uses NFC and case preservation. It is IMHO useless to reopen this > discussion. > Reread the RFC, it discusses the case-insensitive profile using NFC and conversion to lowercase, this is the bug. > > > the recent RFC that forgot the issue : its case-insensitive profile > > based on NFC and conversion to lowercase is definitely broken !) > > What is broken is your analysis. RFC 7613 does not convert passwords > to lowercase. Indeed, it says exactly the opposite, which seems to > indicate that you did not read it before calling it broken: > > Case-Mapping Rule: Uppercase and titlecase characters MUST NOT be > mapped to their lowercase equivalents. > You are reading the other section for the case-sensitive profile (in SASLprep, section 6.1), which is absolutely not forbidden for user names, and already an established practice since too many decennial (email addresses, local user names in Windows...), and this very new RFC will not change this practice before very long. -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Wed Oct 7 11:10:14 2015 From: doug at ewellic.org (Doug Ewell) Date: Wed, 07 Oct 2015 09:10:14 -0700 Subject: Unicode in passwords Message-ID: <20151007091014.665a7a7059d7ee80bb4d670165c8327d.79ab53c881.wbe@email03.secureserver.net> Philippe Verdy wrote: > This is a demonstration that using case differences to add more > combinations in short passwords is a bad design. But more and more organizations and banks and supermarket rewards programs are demanding it, along with "at least one digit" and "at least one 'special' character" and "at least N characters in length" and "must change every N days" -- regardless of what Bruce Schneier or anyone else says. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From doug at ewellic.org Wed Oct 7 11:11:52 2015 From: doug at ewellic.org (Doug Ewell) Date: Wed, 07 Oct 2015 09:11:52 -0700 Subject: Why Nothing Ever Goes Away Message-ID: <20151007091152.665a7a7059d7ee80bb4d670165c8327d.4dce1a05ee.wbe@email03.secureserver.net> Richard Wordingham wrote: > Browsers dropping UTF-7 support was a major inconvenience. Especially when the real problem with cross-site scripting was *auto-detection* of UTF-7. Requiring users to override the encoding and select UTF-7 manually would have solved most problems. Dropping UTF-7 entirely was not necessary. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From verdy_p at wanadoo.fr Thu Oct 8 11:14:38 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Thu, 8 Oct 2015 18:14:38 +0200 Subject: Unicode in passwords In-Reply-To: <20151007091014.665a7a7059d7ee80bb4d670165c8327d.79ab53c881.wbe@email03.secureserver.net> References: <20151007091014.665a7a7059d7ee80bb4d670165c8327d.79ab53c881.wbe@email03.secureserver.net> Message-ID: They demand such passwords only for their web services, which are accessed by web browsers. Not for booting devices. Being on the web, the protocols are based on HTML and web browsers. As the web is now Unicode with UTF-8 in a vast majority of contents, those web services are already UTF-8 ready (it is also a requirement on those web insterfaces used by banks to have javascript support). So restricting those web passwords to only ASCII is a bad choice : to extend the usable charset, forcing the inclusion of ASCII capitals and punctuation is not sufficient, There is certainly a better way to extend the set to include as well all characters supported by browser input methods for the targeted languages, and that are still easy to type in on most devices (this means not adding characters not suppoted by old versions of Windows or by basic smartphones). With an extended repertoire (not restricted to ASCII, thanks to UTF-8 on the web), password lengths could remain relatively short and easy to type and remember (the alternative using passphrases also requires being able to type words in the local language in its basic orthography, and some compatibility normalization, as well as case folding will be helpful to provide good interoperability across client devices, where typing letters with mised case is frequently very inconvenient on touche devices, as well for people with disabilities and that type with only one finger). Still, there are still many banks whose passwords are limited to only basic decimal digits, and limited to at most 8 of them. As this is not enough, the input forms will also request other numbers that people frequently cannot easily remember. others will use two-factor authentication using mobile phones and confirmation codes sent by SMS, or will send an additional code in physical letters, they will take footprints of the browser or IMEI code of the smartphone used and preapproval required before trusting devices, or giving the number of a physical credit card by procesing a ?0.00 online payment with it, and some pseudo "secret" questions (social security number, identity card/passport/driver licence number...) but some are very week and ask for something that is rarely secret such as the birth date (Facebook initially published it by default to anyone without asking when you create the account, now it is private by default, except for the birthday application enabled by default and notifying all "friends". But too late for those that had created their account years ago, it is now public for eternity even if it can be hidden on the current version of profiles... similar fake secrets are names of family members and pets, as all the info is) 2015-10-07 18:10 GMT+02:00 Doug Ewell : > Philippe Verdy wrote: > > > This is a demonstration that using case differences to add more > > combinations in short passwords is a bad design. > > But more and more organizations and banks and supermarket rewards > programs are demanding it, along with "at least one digit" and "at least > one 'special' character" and "at least N characters in length" and "must > change every N days" -- regardless of what Bruce Schneier or anyone else > says. > > -- > Doug Ewell | http://ewellic.org | Thornton, CO ???? > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Thu Oct 8 15:09:07 2015 From: doug at ewellic.org (Doug Ewell) Date: Thu, 08 Oct 2015 13:09:07 -0700 Subject: Unicode in passwords Message-ID: <20151008130907.665a7a7059d7ee80bb4d670165c8327d.df792f40da.wbe@email03.secureserver.net> Philippe Verdy wrote: > They demand such passwords only for their web services, which are > accessed by web browsers. Not for booting devices. My company enforces all of the password restrictions I listed, as well as "ASCII only," for access both to individual PCs and to the company network. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From jsoconner at gmail.com Fri Oct 9 12:26:02 2015 From: jsoconner at gmail.com (John O'Conner) Date: Fri, 09 Oct 2015 17:26:02 +0000 Subject: Unicode Digest, Vol 22, Issue 9 In-Reply-To: References: Message-ID: As a response to all the issues I've learned from everyone, my immediate recommendation for my company's current policy is to constrain our passwords to printable ASCII now in order to buy time to learn more about all the issues that you and others have mentioned. I appreciate all the feedback on the topic. Clearly there are issues to consider, and I'll make an effort to gather up all the issues everyone mentioned for a single, consolidated list. On Fri, Oct 9, 2015 at 10:00 AM wrote: > > From: Doug Ewell > To: verdy_p at wanadoo.fr > Cc: Unicode Mailing List > Date: Thu, 08 Oct 2015 13:09:07 -0700 > Subject: RE: Unicode in passwords > Philippe Verdy wrote: > > > They demand such passwords only for their web services, which are > > accessed by web browsers. Not for booting devices. > > My company enforces all of the password restrictions I listed, as well > as "ASCII only," for access both to individual PCs and to the company > network. > > -- > Doug Ewell | http://ewellic.org | Thornton, CO ???? > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From lists+unicode at seantek.com Fri Oct 9 13:05:54 2015 From: lists+unicode at seantek.com (Sean Leonard) Date: Fri, 9 Oct 2015 11:05:54 -0700 Subject: Why Nothing Ever Goes Away In-Reply-To: <5613E9DF.8020406@ix.netcom.com> References: <20151005082457.665a7a7059d7ee80bb4d670165c8327d.c32c0ba233.wbe@email03.secureserver.net> <5612D05D.7000407@att.net>

<5613BD66.3080707@seantek.com> <5613E9DF.8020406@ix.netcom.com> Message-ID: <56180202.4010905@seantek.com> Satisfactory answers, thank you very much. Going back to doing more research. (Silence does not imply abandoning the C1 Control Pictures project; just a lot to synthesize.) Regarding the three points U+0080, U+0081, and U+0099: the fact that Unicode defers mostly to ISO 6429 and other standards before its time (e.g., ANSI X3.32 / ISO 2047 / ECMA-17) means that it is not particularly urgent that those code points get Unicode names. I also do not find that their lack of definition precludes pictorial representations. In the current U+2400 block, the Standard says: "The diagonal lettering glyphs are only exemplary; alternate representations may be, and often are used in the visible display of control codes", and, Section 22.7. I am now in possession of a copy of ANSI X3.32-1973 and ECMA-17:1968 (the latter is available on ECMA's website). I find it worthwhile to point out that the Transmission Controls and Format Effectors were not standardized by the time of ECMA-17:1968, but the symbols are the same nonetheless. ANSI X3.32-1973 has the standardized control names for those characters. Sean On 10/6/2015 6:57 AM, Philippe Verdy wrote: > > 2015-10-06 14:24 GMT+02:00 Sean Leonard >: > > 2. The Unicode code charts are (deliberately) vague about > U+0080, U+0081, > and U+0099. All other C1 control codes have aliases to the ISO > 6429 > set of control functions, but in ISO 6429, those three control > codes don't > have any assigned functions (or names). > > > On 10/5/2015 3:57 PM, Philippe Verdy wrote: > > Also the aliases for C1 controls were formally registered in > 1983 only for the two ranges U+0084..U+0097 and U+009B..U+009F > for ISO 6429. > > > If I may, I would appreciate another history lesson: > In ISO 2022 / 6429 land, it is apparent that the C1 controls are > mainly aliases for ESC 4/0 - 5/15. ( @ through _ ) This might vary > depending on what is loaded into the C1 register, but overall, it > just seems like saving one byte. > > Why was C1 invented in the first place? > > > Look for the history of EBCDIC and its adaptation/conversion with > ASCII compatible encodings: round trip conversion wasneeded (using a > only a simple reordering of byte values, with no duplicates). EBCDIC > has used many controls that were not part of C0 and were kept in the > C1 set. Ignore the 7-bit compatiblity encoding using pairs, they were > only needed for ISO 2022, but ISO 6429 defines a profile where those > longer sequences are not needed and even forbidden in 8-bit contexts > or in contexts where aliases are undesirable and invalidated, such as > security environments. > > With your thoughts, I would conclude that assigning characters in the > G1 set was also a duplicate, because it is reachable with a C0 > "shifting" control + a position of the G0 set. In that case ISO 8859-1 > or Windows 1252 was also an unneeded duplication ! And we would live > today in a 7-bit only world. > > C1 controls have their own identity. The 7-bit encoding using ESC is > just a hack to make them fit in 7-bit and it only works where the ESC > control is assumed to play this function according to ISO 2022, ISO > 6429, or other similar old 7-bit protocols such as Videotext (which > was widely used in France with the free "Minitel" terminal, long > before the introduction of the Internet to the general public around > 1992-1995). > > Today Videotext is definitely dead (the old call numbers for this slow > service are now definitely defunct, the Minitels are recycled wastes, > they stopped being distributed and replaced by applications on PC > connected to the Internet, but now all the old services are directly > on the internet and none of them use 7-bit encodings for their HTML > pages, or their mobile applications). France has also definitely > abandoned its old French version of ISO 646, there are no longer any > printer supporting versions of ISO 646 other than ASCII, but they > still support various 8-bit encodings. > > 7-bit encodings are things of the past (they were only justified at > times where communication links were slow and generated lots of > transmission errors, and the only implemented mecanism to check them > was to use a single parity bit per character. Today we transmit long > datagrams and prefer using checks codes for the whole (such as CRC, or > autocorrecting codes). 8-bit encodings are much easier and faster to > process for transmitting not just text but also binary data. > > Let's forget the 7-bit world definitely. We have also abandonned the > old UTF-7 in Unicode ! I've not seen it used anywhere except in a few > old emails sent at end of the 90's, because many mail servers were > still not 8-bit clean and silently transformed non-ASCII bytes in > unpredictable ways or using unspecified encodings, or just siltently > dropped the high bit, assuming it was just a parity bit : at that > time, emails were not sent with SMTP, but with the old UUCP protocol > and could take weeks to be delivered to the final recipient, as there > was still no global routing infrastructure and many hops were > necessary via non-permanent modem links. My opinion of UTF-7 is that > it was just a temporary and experimental solution to help system > admins and developers adopt the new UCS, including for their old 7-bit > environments. On 10/6/2015 8:33 AM, Asmus Freytag (t) wrote: > On 10/6/2015 5:24 AM, Sean Leonard wrote: >> And, why did Unicode deem it necessary to replicate the C1 block at >> 0x80-0x9F, when all of the control characters (codes) were equally >> reachable via ESC 4/0 - 5/15? I understand why it is desirable to >> align U+0000 - U+007F with ASCII, and maybe even U+0000 - U+00FF with >> Latin-1 (ISO-8859-1). But maybe Windows-1252, MacRoman, and all the >> other non-ISO-standardized 8-bit encodings got this much right: >> duplicating control codes is basically a waste of very precious >> character code real estate > > Because Unicode aligns with ISO 8859-1, so that transcoding from that > was a simple zero-fill to 16 bits. > > 8859-1 was the most widely used single byte (full 8-bit) ISO standard > at the time, and making that transition easy was beneficial, both > practically and politically. > > Vendor standards all disagreed on the upper range, and it would not > have been feasible to single out any of them. Nobody wanted to follow > the IBM code page 437 (then still the most widely used single byte > vendor standard). > > > Note, that by "then" I refer to dates earlier than the dates of the > final drafts, because may of those decisions date back to earlier > periods where the drafts were first developed.Also, the overloading of > 0x80-0xFF by Windows did not happen all at once, earlier versions had > left much of that space open, but then people realized that as long as > you were still limited to 8 bits, throwing away 32 codes was an issue. > > Now, for Unicode, 32 out of 64K values (initially) or 1114112 (now), > don't matter, so being "clean" didn't cost much. (Note that even for > UTF-8, there's no special benefit of a value being inside that second > range of 128 codes. > > Finally, even if the range had not been dedicated to C1, the 32 codes > would have had to be given space, because the translation into ESC > sequences is not universal, so, in transcoding data you needed to have > a way to retain the difference between the raw code and the ESC > sequence, or your round-trip would not be lossless. > > A./ -------------- next part -------------- An HTML attachment was scrubbed... URL: From lists+unicode at seantek.com Fri Oct 9 13:32:38 2015 From: lists+unicode at seantek.com (Sean Leonard) Date: Fri, 9 Oct 2015 11:32:38 -0700 Subject: Pictorial Representations of BS and DEL Message-ID: <56180846.3030504@seantek.com> Hello: As we continue to riff on the history of character encodings, I am searching for the most accurate standards-based pictorial representations of BS (U+0008) and DEL (U+007F) in Unicode. ECMA-17:1968 and ANSI X3.32-1973 depict U+0008 as an arrow pointing from the bottom-right to the top-left, slightly arced upwards. They depict U+007F as a filled box symbol comprised of five diagonal slashes oriented from bottom-left to top-right, with no border. All of the other control pictures (from those standards) have specific code point assignments in Unicode. Whether those glyphs are used for U+2400 et. seq. is, of course, up to the font designer. But it's nice to know they are there as fallbacks. What are the most accurate pictorial representations in the existing Unicode Standard for BS and DEL? Frankly there are so many arrows that it's hard to make heads or tails (pun intended) out of which one is the best. However, in all the arrows I looked for, I did not see one that was a sufficiently close match. There is another standard governing these sorts of things, namely ISO 9995. I would not be surprised if it has something to say about Backspace, as the Backspace keytop is standardized to look like: Backspace <--- Note that there are still many left-pointing arrows in the Unicode standard, so which Unicode left-pointing arrow is the closest one to the one typically printed on a keytop? Regarding DEL: ? U+25A8 SQUARE WITH UPPER RIGHT TO LOWER LEFT FILL is close, but it has a black box border. ? U+2425 SYMBOL FOR DELETE FORM TWO is depicted as three slashes in the middle, not five slashes, and is from ISO 9995-7. It is a symbol for "undoable delete". I presume that the omission of the fourth and fifth slashes is intentional. ? U+2302 HOUSE is the corresponding grapheme in Code Page 437, and so many people would probably be familiar with using this to depict U+007F. But we are trying to bury Code Page 437. (Note: this is relevant to the C1 control character CCH U+0094, which is intended to eliminate ambiguity about the meaning of BS. Arguably ? U+232B ERASE TO THE LEFT is the most appropriate for CCH, but it could also be used for BS, and that is the problem because BS is more nebulous but far, far more ubiquitous than CCH.) Sean From bugraaydin1999 at gmail.com Sat Oct 10 07:59:14 2015 From: bugraaydin1999 at gmail.com (patapatachakapon .) Date: Sat, 10 Oct 2015 15:59:14 +0300 Subject: Rights to the Emoji Message-ID: Hello, I work for a small company in Turkey. We would like to import/sell products that have pictures of Emoji on them (such as keychains, cups etc.) , here in Turkey. The Emoji we would like to use on our products are the ones that are titled Native on the chart that I've attached to this email. I would like to know whether or not it's required to buy the rights these Emoji. Are Emoji copyrighted, or can they be used by anyone for design purposes? Thanks so much in advance! -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: emoji.jpg Type: image/jpeg Size: 84967 bytes Desc: not available URL: From magnus at bodin.org Sat Oct 10 12:25:59 2015 From: magnus at bodin.org (=?UTF-8?Q?Magnus_Bodin_=E2=98=80?=) Date: Sat, 10 Oct 2015 19:25:59 +0200 Subject: Rights to the Emoji In-Reply-To: References: Message-ID: This might shed some light: http://words.steveklabnik.com/emoji-licensing On Sat, Oct 10, 2015 at 2:59 PM, patapatachakapon . < bugraaydin1999 at gmail.com> wrote: > Hello, > > I work for a small company in Turkey. We would like to import/sell > products that have pictures of Emoji on them (such as keychains, cups etc.) > , here in Turkey. The Emoji we would like to use on our products are the > ones that are titled Native on the chart that I've attached to this email. > I would like to know whether or not it's required to buy the rights these > Emoji. Are Emoji copyrighted, or can they be used by anyone for design > purposes? > > Thanks so much in advance! > -------------- next part -------------- An HTML attachment was scrubbed... URL: From magnus at bodin.org Sat Oct 10 12:28:55 2015 From: magnus at bodin.org (=?UTF-8?Q?Magnus_Bodin_=E2=98=80?=) Date: Sat, 10 Oct 2015 19:28:55 +0200 Subject: Rights to the Emoji In-Reply-To: References: Message-ID: Here is an emoji that is CC-licensed. https://signalvnoise.com/posts/3395-neckbeard Let us know when you sell neckbeard pillows. On Sat, Oct 10, 2015 at 7:25 PM, Magnus Bodin ? wrote: > This might shed some light: > > http://words.steveklabnik.com/emoji-licensing > > On Sat, Oct 10, 2015 at 2:59 PM, patapatachakapon . < > bugraaydin1999 at gmail.com> wrote: > >> Hello, >> >> I work for a small company in Turkey. We would like to import/sell >> products that have pictures of Emoji on them (such as keychains, cups etc.) >> , here in Turkey. The Emoji we would like to use on our products are the >> ones that are titled Native on the chart that I've attached to this email. >> I would like to know whether or not it's required to buy the rights these >> Emoji. Are Emoji copyrighted, or can they be used by anyone for design >> purposes? >> >> Thanks so much in advance! >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From wjgo_10009 at btinternet.com Sat Oct 10 05:14:50 2015 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Sat, 10 Oct 2015 11:14:50 +0100 (BST) Subject: How can my research become implemented in a standardized manner? In-Reply-To: <33365178.33329.1444395933996.JavaMail.defaultUser@defaultHost> References: <33365178.33329.1444395933996.JavaMail.defaultUser@defaultHost> Message-ID: <6041047.15822.1444472090118.JavaMail.defaultUser@defaultHost> Please note that I am on moderated post, so if this post does get sent to the Unicode mailing list it will be because the moderator has kindly agreed to it being circulated. I have recently made significant progress with my research in communication through the language barrier. The capabilities are greatly improved. On 7 October 2015 I submitted a document, hoping that it would become included in the Unicode Document Register. I have been informed that a group of people have examined the document and determined that it is out of scope for UTC. I am not seeking to question that decision. As an independent researcher, not representing an organization, nor in fact employed by any organization at all, I am trying to get the system standardized as an international standard. I feel that trying to produce first a widely-used system using a Private Use Area encoding is not a realistic practical goal and even if it were practical the result would be lots of legacy data. I feel that to become successful the system needs standardization and implementation to go forward together. So what to do? More generally, how are the format and the encoding of tagspaces to be carried out in the future? The document is available on the web at the present time in two places. There is a file available for download as an attachment in a forum post of 8 October 2015 in the High-Logic Gallery forum. Adding a direct link to the post is not at present possible using the particular email system that I am using. There is direct access in my family webspace. www.users.globalnet.co.uk/~ngo/two_tagspaces.pdf In addition I have deposited the document at the British Library. William Overington 10 October 2015 -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Sun Oct 11 16:20:34 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Sun, 11 Oct 2015 22:20:34 +0100 Subject: Counting Codepoints Message-ID: <20151011222034.2a1348ae@JRWUBU2> Is the number of codepoints in a UTF-16 string well defined? For example, which of the following two statements are true? (a) The ill-formed three code-unit Unicode 16-bit string <0xDC00, 0xD800, 0xDC20> contains two codepoints, U+DC00 and U+10020. (b) The ill-formed three code-unit Unicode 16-bit string <0xDC00, 0xD800, 0xDC20> contains three codepoints, U+DC00, U+D800 and U+DC20. Statement (a) is probably more useful, but I couldn't find anything to rule that statement (b) is false. Richard. From c933103 at gmail.com Sun Oct 11 18:03:18 2015 From: c933103 at gmail.com (gfb hjjhjh) Date: Mon, 12 Oct 2015 07:03:18 +0800 Subject: How can my research become implemented in a standardized manner? In-Reply-To: <6041047.15822.1444472090118.JavaMail.defaultUser@defaultHost> References: <33365178.33329.1444395933996.JavaMail.defaultUser@defaultHost> <6041047.15822.1444472090118.JavaMail.defaultUser@defaultHost> Message-ID: I believe using markup languages would be a better approach than getting some new character. 2015/10/12 4:27 "William_J_G Overington" : > Please note that I am on moderated post, so if this post does get sent to > the Unicode mailing list it will be because the moderator has kindly agreed > to it being circulated. > > I have recently made significant progress with my research in > communication through the language barrier. The capabilities are > greatly improved. > > On 7 October 2015 I submitted a document, hoping that it would become > included in the Unicode Document Register. > > I have been informed that a group of people have examined the document and > determined that it is out of scope for UTC. > > I am not seeking to question that decision. > > As an independent researcher, not representing an organization, nor in > fact employed by any organization at all, I am trying to get the system > standardized as an international standard. > > I feel that trying to produce first a widely-used system using a Private > Use Area encoding is not a realistic practical goal and even if it were > practical the result would be lots of legacy data. I feel that to become > successful the system needs standardization and implementation to go > forward together. > > So what to do? > > More generally, how are the format and the encoding of tagspaces to be > carried out in the future? > > The document is available on the web at the present time in two places. > > There is a file available for download as an attachment in a forum post of > 8 October 2015 in the High-Logic Gallery forum. > > Adding a direct link to the post is not at present possible using the > particular email system that I am using. > > There is direct access in my family webspace. > > www.users.globalnet.co.uk/~ngo/two_tagspaces.pdf > > In addition I have deposited the document at the British Library. > > William Overington > > 10 October 2015 > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Sun Oct 11 18:08:23 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Mon, 12 Oct 2015 01:08:23 +0200 Subject: Counting Codepoints In-Reply-To: <20151011222034.2a1348ae@JRWUBU2> References: <20151011222034.2a1348ae@JRWUBU2> Message-ID: Both statements are false. The ill-fomed sequence <0xDC00, 0xD800, 0xDC20> in invalid for UTF-16, because it contains 1 invalid code unit for UTF-16 (the unpaired surrogate 0xDC00), followed by a single code point (U+10020). All 3 surrogate codepoints U+DC00, U+DC00 and U+DC20 are NOT encoded (as they are not representable in valid UTF-16). The number of codepoints in a **valid** UTF-16 string is perfectly well defined. If the encoded string is not valid UTF-16, then the number of codepoints in it is NOT defined (whever the invalid code units will be dropped or replaced, and the number of replacement codepoints can also vary depending on implementation, but an implementation can also consider the whole string as invalid and will return no code points at all or could stop returning any code point after the first error encountered and drop all the rest, or substitute all the rest with a single replacement character). Only the number of 16-bit code units is defined (this number does not depend on UTF-16 validity). 2015-10-11 23:20 GMT+02:00 Richard Wordingham < richard.wordingham at ntlworld.com>: > Is the number of codepoints in a UTF-16 string well defined? > > For example, which of the following two statements are true? > > (a) The ill-formed three code-unit Unicode 16-bit string <0xDC00, > 0xD800, 0xDC20> contains two codepoints, U+DC00 and U+10020. > > (b) The ill-formed three code-unit Unicode 16-bit string <0xDC00, > 0xD800, 0xDC20> contains three codepoints, U+DC00, U+D800 and U+DC20. > > Statement (a) is probably more useful, but I couldn't find anything to > rule that statement (b) is false. > > Richard. > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Sun Oct 11 18:32:28 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Mon, 12 Oct 2015 01:32:28 +0200 Subject: How can my research become implemented in a standardized manner? In-Reply-To: References: <33365178.33329.1444395933996.JavaMail.defaultUser@defaultHost> <6041047.15822.1444472090118.JavaMail.defaultUser@defaultHost> Message-ID: In fact this is not just inventing new characters, all this personal research is about inventing a new human language as well ! This cannot be done alone without people interested in communicatiung in that language. This is also more than new characters for the orthography, it is also about creating a grammar, and defining usages. All this work will not succeed without first developping a glossary (later a dictionnary, not necessarily bilingual) and educational supports, and opening it to discussions and evolutions. Consider the hard work that was done for creating Esperanto (even if it did not require inventing a new script, some characters were invented using uncommon combinations of existing Latin letters and diacritics), this is a very long way before the communciation can become really useful and starts dissimenating and being used to create real text with it. What is strange with that language for now is that it is only meant to be read, but it has no phonology at all. This is then very far from a human language (whose primary support has always been oral first before being written). Without the oral form, the language will not succeed * Consider Esperanto, it also has an oral form, more or less based on Polish and German phonologies, even if there are variable accents, but this is more or less stabilized by using a formal phonology, simplifying the actual phonetics for minimal mutual understanding by educated differenciation of groups of related phonems). * Consider Emojis: they basically represent basic nouns in Japanese or English, and they are more or less translatable. They also include a few some "adjectives" (e.g. skin color), and a basic syntax for them. There are some forms of compound nouns (e.g. FAMILY or COUPLE) linked by ZWJ rather than hyphens but their complete meaning is based on their component nouns (MAN, WOMAN, BOY, GIRL). They are successful because they adequately represent common nouns or expressions in many languages with more or less equal meaning, so they are easily read orally. 2015-10-12 1:03 GMT+02:00 gfb hjjhjh : > I believe using markup languages would be a better approach than getting > some new character. > 2015/10/12 4:27 "William_J_G Overington" : > > Please note that I am on moderated post, so if this post does get sent to >> the Unicode mailing list it will be because the moderator has kindly agreed >> to it being circulated. >> >> I have recently made significant progress with my research in >> communication through the language barrier. The capabilities are >> greatly improved. >> >> On 7 October 2015 I submitted a document, hoping that it would become >> included in the Unicode Document Register. >> >> I have been informed that a group of people have examined the document >> and determined that it is out of scope for UTC. >> >> I am not seeking to question that decision. >> >> As an independent researcher, not representing an organization, nor in >> fact employed by any organization at all, I am trying to get the system >> standardized as an international standard. >> >> I feel that trying to produce first a widely-used system using a Private >> Use Area encoding is not a realistic practical goal and even if it were >> practical the result would be lots of legacy data. I feel that to become >> successful the system needs standardization and implementation to go >> forward together. >> >> So what to do? >> >> More generally, how are the format and the encoding of tagspaces to be >> carried out in the future? >> >> The document is available on the web at the present time in two places. >> >> There is a file available for download as an attachment in a forum post >> of 8 October 2015 in the High-Logic Gallery forum. >> >> Adding a direct link to the post is not at present possible using the >> particular email system that I am using. >> >> There is direct access in my family webspace. >> >> www.users.globalnet.co.uk/~ngo/two_tagspaces.pdf >> >> In addition I have deposited the document at the British Library. >> >> William Overington >> >> 10 October 2015 >> >> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From shervinafshar at gmail.com Sun Oct 11 19:51:05 2015 From: shervinafshar at gmail.com (Shervin Afshar) Date: Sun, 11 Oct 2015 17:51:05 -0700 Subject: Rights to the Emoji In-Reply-To: References: Message-ID: Those listed in the column titled "Native" come from the operating system (in your case, Mac OS X) and/or browser you are viewing that page on. One can assume that the right to those belong to the entity who develops those software. A safer approach for you would be to use symbols from Emoji One[1]; if you can attribute that project on your products, you can use them for free; if you can not do that, they require that you contact them for a custom paid license [2]. Also, with the paid license you are helping a project publishing content under Creative Common license. [1]: http://emojione.com/ [2]: http://emojione.com/faq#faq5 ? Shervin On Sat, Oct 10, 2015 at 5:59 AM, patapatachakapon . < bugraaydin1999 at gmail.com> wrote: > Hello, > > I work for a small company in Turkey. We would like to import/sell > products that have pictures of Emoji on them (such as keychains, cups etc.) > , here in Turkey. The Emoji we would like to use on our products are the > ones that are titled Native on the chart that I've attached to this email. > I would like to know whether or not it's required to buy the rights these > Emoji. Are Emoji copyrighted, or can they be used by anyone for design > purposes? > > Thanks so much in advance! > -------------- next part -------------- An HTML attachment was scrubbed... URL: From kenwhistler at att.net Sun Oct 11 23:36:49 2015 From: kenwhistler at att.net (Ken Whistler) Date: Sun, 11 Oct 2015 21:36:49 -0700 Subject: Counting Codepoints In-Reply-To: <20151011222034.2a1348ae@JRWUBU2> References: <20151011222034.2a1348ae@JRWUBU2> Message-ID: <561B38E1.5070007@att.net> On 10/11/2015 2:20 PM, Richard Wordingham wrote: > Is the number of codepoints in a UTF-16 string well defined? > > For example, which of the following two statements are true? > > (a) The ill-formed three code-unit Unicode 16-bit string <0xDC00, > 0xD800, 0xDC20> contains two codepoints, U+DC00 and U+10020. > > (b) The ill-formed three code-unit Unicode 16-bit string <0xDC00, > 0xD800, 0xDC20> contains three codepoints, U+DC00, U+D800 and U+DC20. > > Statement (a) is probably more useful, but I couldn't find anything to > rule that statement (b) is false. I think the correct answer is probably: (c) The ill-formed three code unit Unicode 16-bit string <0xDC00, 0xD800, 0xDC20> contains one code point, U+10020 and one uninterpreted (and uninterpretable) high surrogate code unit 0xDC00. In other words, I don't think it is useful or helpful to map isolated, uninterpretable surrogate code units *to* surrogate code points. Surrogate code points are an artifact of the code architecture. They are code points in the code space which *cannot* be represented in UTF-16, by definition. Any discussion about properties for surrogate code points is a matter of designing graceful API fallback for instances which have to deal with ill-formed strings and do *something*. I don't think that should extend to treating isolated surrogate code units as having interpretable status, *as if* they were valid code points represented in the string. It might be easier to get a handle on this if folks were to ask, instead how many code points are in the ill-formed Unicode 8-bit string <0x61, 0xF4, 0x90, 0x90, 0x90, 0x61>. 6 code units, yes, but how many code points? I'd say two code points and 4 uninterpretable, ill-formed UTF-8 code units, rather than any other possible answer. Basically, you get the same kind of answer if the ill-formed string were, instead <0x61, 0xED, 0xA0, 0x80, 0x61>. Two code points and 3 uninterpretable, ill-formed UTF-8 code units. That is a better answer than trying to map 0xED 0xA0 0x80 to U+D800 and then saying, oh, that is a surrogate code *point*. --Ken From duerst at it.aoyama.ac.jp Sun Oct 11 23:58:35 2015 From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J._D=c3=bcrst?=) Date: Mon, 12 Oct 2015 13:58:35 +0900 Subject: Rights to the Emoji In-Reply-To: References: Message-ID: <561B3DFB.2070605@it.aoyama.ac.jp> You can also design your own version of the emoji you want to use. [I'm not a lawyer, but as far as I understand,] what's protected is the individual design, not the idea of a "donut" or "frowning face" emoji as such. Regards, Martin. On 2015/10/12 09:51, Shervin Afshar wrote: > Those listed in the column titled "Native" come from the operating system > (in your case, Mac OS X) and/or browser you are viewing that page on. One > can assume that the right to those belong to the entity who develops those > software. > > A safer approach for you would be to use symbols from Emoji One[1]; if you > can attribute that project on your products, you can use them for free; if > you can not do that, they require that you contact them for a custom paid > license [2]. > > Also, with the paid license you are helping a project publishing content > under Creative Common license. > > [1]: http://emojione.com/ > [2]: http://emojione.com/faq#faq5 > > ? Shervin > > On Sat, Oct 10, 2015 at 5:59 AM, patapatachakapon . < > bugraaydin1999 at gmail.com> wrote: > >> Hello, >> >> I work for a small company in Turkey. We would like to import/sell >> products that have pictures of Emoji on them (such as keychains, cups etc.) >> , here in Turkey. The Emoji we would like to use on our products are the >> ones that are titled Native on the chart that I've attached to this email. >> I would like to know whether or not it's required to buy the rights these >> Emoji. Are Emoji copyrighted, or can they be used by anyone for design >> purposes? >> >> Thanks so much in advance! >> > From mark at macchiato.com Mon Oct 12 00:46:51 2015 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Mon, 12 Oct 2015 07:46:51 +0200 Subject: Rights to the Emoji In-Reply-To: References: Message-ID: