From mark at macchiato.com Thu Oct 1 00:01:12 2015 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Thu, 1 Oct 2015 07:01:12 +0200 Subject: Unicode in passwords In-Reply-To: <000601d0fbff$42881070$c7983150$@gmail.com> References: <4057F09C-5426-40ED-AB3D-356BBBC85276@telia.com> <2E86845B-6A9F-4D5C-A02A-C2C82F4DB28F@gmail.com> <000601d0fbff$42881070$c7983150$@gmail.com> Message-ID: I've heard some concerns, mostly around the UI for people typing in passwords; that they get frustrated when they have to type their password on different devices: 1. A device may not have keyboard mappings with all the keys for their language. 2. The keyboard mappings across devices vary where they put keys, especially for minority script characters using some pattern of shift/alt/option/etc.. So the pattern of keys that they use on one may be different than on another. 3. People are often 'blind' to the characters being entered: they just see a dot, for example. If the keyboards for their language are not standard, then that makes it difficult. 4. Even if they see, for an instant, the character they type, if the device doesn't have a font for their language's characters, it may be just a box. 5. Even if those are not true, the glyph may not be distinctive enough if the size is too small. Mark *? Il meglio ? l?inimico del bene ?* On Thu, Oct 1, 2015 at 6:11 AM, Jonathan Rosenne wrote: > For languages such as Java, passwords should be handled as byte arrays > rather than strings. This may make it difficult to apply normalization. > > > > Jonathan Rosenne > > > > *From:* Unicode [mailto:unicode-bounces at unicode.org] *On Behalf Of *Clark > S. Cox III > *Sent:* Thursday, October 01, 2015 2:16 AM > *To:* Hans ?berg > *Cc:* unicode at unicode.org; John O'Conner > *Subject:* Re: Unicode in passwords > > > > > > On 2015/09/30, at 13:29, Hans ?berg wrote: > > > > > > On 30 Sep 2015, at 18:33, John O'Conner wrote: > > Can you recommend any documents to help me understand potential issues (if > any) for password policies and validation methods that allow characters > from more "exotic" portions of the Unicode space? > > > On UNIX computers, one computes a hash (like SHA-256), which is then used > to authenticate the password up to a high probability. The hash is stored > in the open, but it is not known how to compute the password from the hash, > so knowing the hash does not easily allow authentication. > > So if the password is > > > > ? normalized and then ? > > > > encoded in say UTF-8 and then hashed, it would seem to take care of most > problems. > > > > You really wouldn?t want ?Schl?ssel? and ?Schl?ssel? being different > passwords, would you? (assuming that my mail client and/or OS is not > interfering, the first is NFC, while the second is NFD) > -------------- next part -------------- An HTML attachment was scrubbed... URL: From marc at keyman.com Thu Oct 1 00:19:35 2015 From: marc at keyman.com (Marc Durdin) Date: Thu, 1 Oct 2015 05:19:35 +0000 Subject: Unicode in passwords In-Reply-To: References: <4057F09C-5426-40ED-AB3D-356BBBC85276@telia.com> <2E86845B-6A9F-4D5C-A02A-C2C82F4DB28F@gmail.com> <000601d0fbff$42881070$c7983150$@gmail.com> Message-ID: <1CEDD746887FFF4B834688E7AF5FDA5A82323850@federation.tavultesoft.local> That?s a good list. A few other things I?ve seen: 1. Even if the user sees the character for an instant, complex script characters can be very puzzling as they appear differently and ?out of order? when isolated. 2. The number of dots corresponds to the number of code points, which is misleading with complex scripts or advanced input methods: you won?t necessarily see one dot per keystroke; in some cases, typing a character may replace a dot with another dot or even delete a dot. 3. Directionality can be frustrating. I?ve had to assist in situations where a user has set a new Windows password using a custom keyboard, and then been unable to login, e.g. with Remote Desktop, or even with the standard Windows login screen. iOS, for example, doesn?t even allow the user to select a different input method for password boxes ? it seems to always be Latin script only (even if you?ve removed all your Latin script keyboards from Settings). Marc From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Mark Davis ?? Sent: Thursday, 1 October 2015 3:01 PM To: Jonathan Rosenne Cc: Unicode Public Subject: Re: Unicode in passwords I've heard some concerns, mostly around the UI for people typing in passwords; that they get frustrated when they have to type their password on different devices: 1. A device may not have keyboard mappings with all the keys for their language. 2. The keyboard mappings across devices vary where they put keys, especially for minority script characters using some pattern of shift/alt/option/etc.. So the pattern of keys that they use on one may be different than on another. 3. People are often 'blind' to the characters being entered: they just see a dot, for example. If the keyboards for their language are not standard, then that makes it difficult. 4. Even if they see, for an instant, the character they type, if the device doesn't have a font for their language's characters, it may be just a box. 5. Even if those are not true, the glyph may not be distinctive enough if the size is too small. Mark ? Il meglio ? l?inimico del bene ? On Thu, Oct 1, 2015 at 6:11 AM, Jonathan Rosenne > wrote: For languages such as Java, passwords should be handled as byte arrays rather than strings. This may make it difficult to apply normalization. Jonathan Rosenne From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Clark S. Cox III Sent: Thursday, October 01, 2015 2:16 AM To: Hans ?berg Cc: unicode at unicode.org; John O'Conner Subject: Re: Unicode in passwords On 2015/09/30, at 13:29, Hans ?berg > wrote: On 30 Sep 2015, at 18:33, John O'Conner > wrote: Can you recommend any documents to help me understand potential issues (if any) for password policies and validation methods that allow characters from more "exotic" portions of the Unicode space? On UNIX computers, one computes a hash (like SHA-256), which is then used to authenticate the password up to a high probability. The hash is stored in the open, but it is not known how to compute the password from the hash, so knowing the hash does not easily allow authentication. So if the password is ? normalized and then ? encoded in say UTF-8 and then hashed, it would seem to take care of most problems. You really wouldn?t want ?Schl?ssel? and ?Schl?ssel? being different passwords, would you? (assuming that my mail client and/or OS is not interfering, the first is NFC, while the second is NFD) -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Thu Oct 1 02:33:22 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Thu, 1 Oct 2015 08:33:22 +0100 Subject: Unicode in passwords In-Reply-To: References: <4057F09C-5426-40ED-AB3D-356BBBC85276@telia.com> <2E86845B-6A9F-4D5C-A02A-C2C82F4DB28F@gmail.com> <000601d0fbff$42881070$c7983150$@gmail.com> Message-ID: <20151001083322.5440cc2a@JRWUBU2> On Thu, 1 Oct 2015 07:01:12 +0200 Mark Davis ?? wrote: > I've heard some concerns, mostly around the UI for people typing in > passwords; that they get frustrated when they have to type their > password on different devices: > > 1. A device may not have keyboard mappings with all the keys for > their language. The typographers will probably give English as an example! Where's the en dash key? > 2. The keyboard mappings across devices vary where they put keys, > especially for minority script characters using some pattern of > shift/alt/option/etc.. So the pattern of keys that they use on one > may be different than on another. Even ASCII can have problems. A password containing '#' and '|' can't be entered when a physical US keyboard (102 keys) is interpreted using a mapping for a British keyboard (103 keys). (There seem to be different conventions as to which key is missing.) Richard. From richard.wordingham at ntlworld.com Thu Oct 1 02:42:28 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Thu, 1 Oct 2015 08:42:28 +0100 Subject: UAX #29, Unicode Text Segmentation, update to improve Mongolian word segmentation In-Reply-To: <560C4E6D.4050004@unicode.org> References: <560C4E6D.4050004@unicode.org> Message-ID: <20151001084228.63589572@JRWUBU2> On Wed, 30 Sep 2015 14:04:45 -0700 announcements at unicode.org wrote: > For further background on this issue and possible > ways to address it, see PRI #308 > , /Property Change for U+202F > NARROW NO-BREAK SPACE (NNBSP)/. Is this the announcement of PRI #308? Richard. From mathias at qiwi.be Thu Oct 1 02:59:22 2015 From: mathias at qiwi.be (Mathias Bynens) Date: Thu, 1 Oct 2015 09:59:22 +0200 Subject: Unicode in passwords In-Reply-To: <1CEDD746887FFF4B834688E7AF5FDA5A82323850@federation.tavultesoft.local> References: <4057F09C-5426-40ED-AB3D-356BBBC85276@telia.com> <2E86845B-6A9F-4D5C-A02A-C2C82F4DB28F@gmail.com> <000601d0fbff$42881070$c7983150$@gmail.com> <1CEDD746887FFF4B834688E7AF5FDA5A82323850@federation.tavultesoft.local> Message-ID: <0D7A9A89-4B53-421F-BCEF-CFD975B5F11B@qiwi.be> > On 1 Oct 2015, at 07:19, Marc Durdin wrote: > > 2. The number of dots corresponds to the number of code points, which is misleading with complex scripts or advanced input methods: you won?t necessarily see one dot per keystroke; in some cases, typing a character may replace a dot with another dot or even delete a dot. Lots of systems have a bug where supplementary code points show up as two dots instead of one, due to UTF-16 being used internally. OS X is an example. Demo (open in your browser): data:text/html, From mark at macchiato.com Thu Oct 1 05:18:47 2015 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Thu, 1 Oct 2015 12:18:47 +0200 Subject: Unicode in passwords In-Reply-To: <20151001083322.5440cc2a@JRWUBU2> References: <4057F09C-5426-40ED-AB3D-356BBBC85276@telia.com> <2E86845B-6A9F-4D5C-A02A-C2C82F4DB28F@gmail.com> <000601d0fbff$42881070$c7983150$@gmail.com> <20151001083322.5440cc2a@JRWUBU2> Message-ID: As to #1, my note needs some clarification. For characters that don't typically occur on *any* keyboards, people don't typically use those in their passwords, so switching between different devices doesn't matter. (One caveat would be where the password dialog permits selection from a palette. That way it is independent of device.) The problem comes in where someone uses (as I do), a Mac, a Windows box, a Chromebook, and an Android tablet & phone. The Mac makes it easy to type an em-dash?to use your example. It is slightly less easy on Android, a real pain on Windows, and I haven't even tried on a Chomebook (maybe easy, maybe not, just haven't tried). So for me to use an em-dash in a password would just be opening up to annoyance. I just had a quick look, and it appears that on the latest systems we have data for in CLDR, em-dash is typeable (somehow) on: - all of the android keyboards - 85% of the osx keyboards - 27% of chromeos keyboards - 9% of windows keyboards http://www.unicode.org/cldr/charts/28/keyboards/chars2keyboards.html It's even somewhat uglier in the case where I'm typing a password on a borrowed/public computing device (although typing a password on such a device may not be exactly a great idea from a security standpoint!). Mark *? Il meglio ? l?inimico del bene ?* On Thu, Oct 1, 2015 at 9:33 AM, Richard Wordingham < richard.wordingham at ntlworld.com> wrote: > On Thu, 1 Oct 2015 07:01:12 +0200 > Mark Davis ?? wrote: > > > I've heard some concerns, mostly around the UI for people typing in > > passwords; that they get frustrated when they have to type their > > password on different devices: > > > > 1. A device may not have keyboard mappings with all the keys for > > their language. > > The typographers will probably give English as an example! Where's > the en dash key? > > > 2. The keyboard mappings across devices vary where they put keys, > > especially for minority script characters using some pattern of > > shift/alt/option/etc.. So the pattern of keys that they use on one > > may be different than on another. > > Even ASCII can have problems. A password containing '#' and '|' can't > be entered when a physical US keyboard (102 keys) is interpreted using > a mapping for a British keyboard (103 keys). (There seem to be > different conventions as to which key is missing.) > > Richard. > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From A.Schappo at lboro.ac.uk Thu Oct 1 07:56:48 2015 From: A.Schappo at lboro.ac.uk (Andre Schappo) Date: Thu, 1 Oct 2015 12:56:48 +0000 Subject: Unicode in passwords In-Reply-To: <20151001083322.5440cc2a@JRWUBU2> References: <4057F09C-5426-40ED-AB3D-356BBBC85276@telia.com> <2E86845B-6A9F-4D5C-A02A-C2C82F4DB28F@gmail.com> <000601d0fbff$42881070$c7983150$@gmail.com> <20151001083322.5440cc2a@JRWUBU2> Message-ID: <7BD7361C-AAFD-40D9-9D14-5B41295EAE6D@lboro.ac.uk> On 1 Oct 2015, at 08:33, Richard Wordingham wrote: > > Even ASCII can have problems. A password containing '#' and '|' can't > be entered when a physical US keyboard (102 keys) is interpreted using > a mapping for a British keyboard (103 keys). (There seem to be > different conventions as to which key is missing.) I used to have a # in one of my passwords. It used to be fun finding where the # key was on a computer's default pre-login keyboard mapping which frequently did not match what was printed on the physical keys. I became quite adept at it and it certainly made for a more secure password because of the challenge of finding # on the keyboard. I, personally, would really like to have a non-ascii unicode password. I would when choosing a non-ascii unicode password test to make sure I could enter it on all the devices I use. Andr? Schappo From richard.wordingham at ntlworld.com Thu Oct 1 11:26:33 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Thu, 1 Oct 2015 17:26:33 +0100 Subject: NNBSP and Word Boundaries Message-ID: <20151001172633.2a72f48f@JRWUBU2> The background document for PRI #308 (Property Change for NNBSP), http://www.unicode.org/review/pri308/pri308-background.html , says, "The only other widely noted use for U+202F NNBSP is for representation of the thin non-breaking space (espace fine ins?cable) regularly seen next to certain punctuation marks in French style typography. However, the word segmentation change for U+202F should have no impact in that context, as ExtendNumLet is explicitly for preventing breaks between letters, but does not prevent the identification of word boundaries next to punctuation marks." Unfortunately, this isn't quite true. In the text fragment " dit: ", there would be internal word-boundaries before 'd' and before and after ':', but the word isolated would be the four characters "dit". One solution would be replace NNBSP by U+2009 THIN SPACE, for with untailored line-breaking there would be no line break between it and the 't' or colon, but there would be a word break between the 't' and the thin space. The problem is that characters with property ExtendNumLet can be the first or last character of a word as well as a character strictly within a word. In this respect, the property differs from characters with the property MidNumLet. The problem with using that property instead is that such characters, such as FULL STOP, may be flanked by letters or numbers within a word, but not both. The problem then arises with the Mongolian analogue of '4th' etc. - it is written digit, NNBSP, letters, and is a single word. Richard. From mark at macchiato.com Fri Oct 2 02:25:01 2015 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Fri, 2 Oct 2015 09:25:01 +0200 Subject: NNBSP and Word Boundaries In-Reply-To: <20151001172633.2a72f48f@JRWUBU2> References: <20151001172633.2a72f48f@JRWUBU2> Message-ID: Like Andy, I'm hesitant about changing the gc of NNBSP, because of backwards compatibility concerns. I'm also starting to think that scoping the wb change to Mongolian may not be a bad thing. We might want to explore what it would look like, since it would preserve the maximum compatibility for current use of NNBSP with French and other languages. (The use of NNBSP in French, although not all that common, I suspect would swamp?in terms of frequency of usage?the use with Mongolian, simply because the amount of text worldwide in French is so much greater.) Context The proposed WB change is from XX to EX Old relevant props: WB ; EX ; ExtendNumLet WB ; LE ; ALetter WB ; XX ; Other Old rules with EX: WB13a (AHLetter | Numeric | Katakana | ExtendNumLet) ? ExtendNumLet WB13b ExtendNumLet ? (AHLetter | Numeric | Katakana) ==== Off of the top of my head, perhaps something like: We add: WB ; ML ; Mongolian_Letter WB ; NN ; NNBSP // maybe different name We change the contents of LE and XX to move characters to the two new value sets. Eg, MN gets http://unicode.org/cldr/utility/list-unicodeset.jsp?a= [:scx=/Mong/:]&[:wb=ALetter:] We change the "macro" AHLetter(ALetter | Hebrew_Letter | Mongolian_Letter) *At this point, all behaves the same; that is just a 'refactoring'.* Now we can modify the behavior for sequences with NN adjacent to ML. We add: WB13c Mongolian_Letter ? NNBSP WB13d NNBSP ? Mongolian_Letter *If* we want to also change behavior on the other side of the NNBSP, whenever the Mongolian_Letter and NNBSP occur in sequence, we add 2 additional rules (with the appropriate values for ..., like Numeric) WB13c Mongolian_Letter NNBSP ? (...) WB13d (...) ? NNBSP Mongolian_Letter -------------- next part -------------- An HTML attachment was scrubbed... URL: From lists+unicode at seantek.com Fri Oct 2 23:46:58 2015 From: lists+unicode at seantek.com (Sean Leonard) Date: Fri, 2 Oct 2015 21:46:58 -0700 Subject: Acquiring DIS 10646 Message-ID: <560F5DC2.4080507@seantek.com> As part of yet more research, I would like to get a hold of DIS 10646, aka Draft International Standard ISO/IEC 10646.1 (circa 1990 or 1991). I understand that Draft 2 (10646.2) was accepted and therefore became ISO/IEC 10646-1:1993. Therefore, I am looking for a copy (preferably free, preferably online) of DIS 10646. Maybe also the final one too. Does anyone know how to get it/them? Thank you, Sean From michel at suignard.com Sat Oct 3 00:28:35 2015 From: michel at suignard.com (Michel Suignard) Date: Sat, 3 Oct 2015 05:28:35 +0000 Subject: Acquiring DIS 10646 In-Reply-To: <560F5DC2.4080507@seantek.com> References: <560F5DC2.4080507@seantek.com> Message-ID: ISO never keeps previous versions of standards. You can look into the wg2 web site at dkuug.dk that will give you some versions of these documents (Google or your favorite search engine will be your friend) although all that may disappear any day. If you tell me what you are looking for I can help you. Bear in mind that anything that ISO does is copyrighted. Therefore, forget about a free online version of DIS 10646 of whatever version you are looking for. There is a reason that Unicode (all versions still visible, archive up to 2000 increasingly visible) is a much better source for references. Michel -----Original Message----- From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Sean Leonard Sent: Friday, October 2, 2015 9:47 PM To: unicode at unicode.org Subject: Acquiring DIS 10646 As part of yet more research, I would like to get a hold of DIS 10646, aka Draft International Standard ISO/IEC 10646.1 (circa 1990 or 1991). I understand that Draft 2 (10646.2) was accepted and therefore became ISO/IEC 10646-1:1993. Therefore, I am looking for a copy (preferably free, preferably online) of DIS 10646. Maybe also the final one too. Does anyone know how to get it/them? Thank you, Sean From lists+unicode at seantek.com Sat Oct 3 10:15:55 2015 From: lists+unicode at seantek.com (Sean Leonard) Date: Sat, 3 Oct 2015 08:15:55 -0700 Subject: Acquiring DIS 10646 In-Reply-To: References: <560F5DC2.4080507@seantek.com> Message-ID: <560FF12B.8010105@seantek.com> Thanks. Well, "DIS 10646" is the Draft International Standard, particularly Draft 1, from ~1990 or ~1991. (Sometimes it might have been called 10646.1.) Therefore it would likely only be in print form (or printed and scanned form). It's pretty old. What I understand is that Draft 1 got shot down because it was at variance with the nascent Unicode effort; Draft 2 was eventually adopted as ISO 10646:1993, and is equivalent to Unicode 1.1. (10646-1:1993 plus Amendments 5 to 7 = Unicode 2.0.) Sean On 10/2/2015 10:28 PM, Michel Suignard wrote: > ISO never keeps previous versions of standards. You can look into the wg2 web site at dkuug.dk that will give you some versions of these documents (Google or your favorite search engine will be your friend) although all that may disappear any day. If you tell me what you are looking for I can help you. Bear in mind that anything that ISO does is copyrighted. Therefore, forget about a free online version of DIS 10646 of whatever version you are looking for. > There is a reason that Unicode (all versions still visible, archive up to 2000 increasingly visible) is a much better source for references. > > Michel From doug at ewellic.org Sat Oct 3 13:00:12 2015 From: doug at ewellic.org (Doug Ewell) Date: Sat, 3 Oct 2015 12:00:12 -0600 Subject: Acquiring DIS 10646 In-Reply-To: References: Message-ID: <5EE740EFB8CD44AB92F133A3A5B0D544@DougEwell> Sean Leonard wrote: > What I understand is that Draft 1 got shot down because it was at > variance with the nascent Unicode effort; If I remember correctly, Draft 1 looked a lot like an updated and expanded version of ISO 2022, much more than it did like today's Unicode/10646. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From jsbien at mimuw.edu.pl Sat Oct 3 13:24:29 2015 From: jsbien at mimuw.edu.pl (Janusz S. Bien) Date: Sat, 03 Oct 2015 20:24:29 +0200 Subject: Acquiring DIS 10646 In-Reply-To: <5EE740EFB8CD44AB92F133A3A5B0D544@DougEwell> References: <5EE740EFB8CD44AB92F133A3A5B0D544@DougEwell> Message-ID: <20151003202429.31618900k3rm3hi5@mail.mimuw.edu.pl> Quote/Cytat - Doug Ewell (Sat 03 Oct 2015 08:00:12 PM CEST): > Sean Leonard wrote: > >> What I understand is that Draft 1 got shot down because it was at >> variance with the nascent Unicode effort; > > If I remember correctly, Draft 1 looked a lot like an updated and > expanded version of ISO 2022, much more than it did like today's > Unicode/10646. Rob Pike, Ken Thompson Hello World http://plan9.bell-labs.com/sys/doc/utf.html The draft of ISO 10646 was not very attractive to us. It defined a sparse set of 32-bit characters, which would be hard to implement and have punitive storage requirements. Also, the draft attempted to mollify national interests by allocating 16-bit subspaces to national committees to partition individually. The suggested mode of use was to ??flip?? between separate national standards to implement the international standard. Regards Janusz -- Prof. dr hab. Janusz S. Bie? - Uniwersytet Warszawski (Katedra Lingwistyki Formalnej) Prof. Janusz S. Bie? - University of Warsaw (Formal Linguistics Department) jsbien at uw.edu.pl, jsbien at mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/ From asmus-inc at ix.netcom.com Sat Oct 3 14:28:50 2015 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Sat, 3 Oct 2015 12:28:50 -0700 Subject: Acquiring DIS 10646 In-Reply-To: <560FF12B.8010105@seantek.com> References: <560F5DC2.4080507@seantek.com> <560FF12B.8010105@seantek.com> Message-ID: <56102C72.9010008@ix.netcom.com> An HTML attachment was scrubbed... URL: From lists+unicode at seantek.com Sat Oct 3 14:35:38 2015 From: lists+unicode at seantek.com (Sean Leonard) Date: Sat, 3 Oct 2015 12:35:38 -0700 Subject: Acquiring DIS 10646 In-Reply-To: <20151003202429.31618900k3rm3hi5@mail.mimuw.edu.pl> References: <5EE740EFB8CD44AB92F133A3A5B0D544@DougEwell> <20151003202429.31618900k3rm3hi5@mail.mimuw.edu.pl> Message-ID: <56102E0A.6020503@seantek.com> On 10/3/2015 11:24 AM, Janusz S. Bien wrote: > Quote/Cytat - Doug Ewell (Sat 03 Oct 2015 08:00:12 > PM CEST): > >> Sean Leonard wrote: >> >>> What I understand is that Draft 1 got shot down because it was at >>> variance with the nascent Unicode effort; >> >> If I remember correctly, Draft 1 looked a lot like an updated and >> expanded version of ISO 2022, much more than it did like today's >> Unicode/10646. > > Rob Pike, Ken Thompson > Hello World > > http://plan9.bell-labs.com/sys/doc/utf.html > > The draft of ISO 10646 was not very attractive to us. It defined a > sparse set of 32-bit characters, which would be hard to implement and > have punitive storage requirements. Also, the draft attempted to > mollify national interests by allocating 16-bit subspaces to national > committees to partition individually. The suggested mode of use was to > ??flip?? between separate national standards to implement the > international standard. Yes, that's the one. Sean From lists+unicode at seantek.com Sun Oct 4 07:30:53 2015 From: lists+unicode at seantek.com (Sean Leonard) Date: Sun, 4 Oct 2015 05:30:53 -0700 Subject: Acquiring DIS 10646 In-Reply-To: <56102C72.9010008@ix.netcom.com> References: <560F5DC2.4080507@seantek.com> <560FF12B.8010105@seantek.com> <56102C72.9010008@ix.netcom.com> Message-ID: <56111BFD.4000703@seantek.com> On 10/3/2015 12:28 PM, Asmus Freytag (t) wrote: > On 10/3/2015 8:15 AM, Sean Leonard wrote: >> Thanks. >> >> Well, "DIS 10646" is the Draft International Standard, particularly >> Draft 1, from ~1990 or ~1991. (Sometimes it might have been called >> 10646.1.) Therefore it would likely only be in print form (or printed >> and scanned form). It's pretty old. What I understand is that Draft 1 >> got shot down because it was at variance with the nascent Unicode >> effort; Draft 2 was eventually adopted as ISO 10646:1993, and is >> equivalent to Unicode 1.1. (10646-1:1993 plus Amendments 5 to 7 = >> Unicode 2.0.) > > Sean, > > you never explained your specific interest in this matter. Personal > curiosity? An attempt to write the definite history of character encoding? A long time ago, in a galaxy far, far away.... (Okay it really was not that long ago, and it was pretty close at hand since it was on this list) I proposed adding C1 Control Pictures to Unicode. I am resurrecting that effort, but more slowly this time, with more research and input from implementers. The requirement is that all glyphs for U+0000 - U+00FF be graphically distinct. Debuggers used to do this by referencing the graphemes in the hardware code page, such as Code Page 437, but we have come a long way from 1981, so displaying ? for 0x05 does not make much modern sense. Merely substituting one of the other legacy code pages in for 0x80 - 0x9F does not make sense either. The characters of Code Page 437 overlap with U+00A0 - U+00FF in that range, for example. (Windows-1252 is somewhat more defensible, but Windows-1252 has 5 unassigned code points so it would be incomplete.) Sean From richard.wordingham at ntlworld.com Sun Oct 4 08:02:01 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Sun, 4 Oct 2015 14:02:01 +0100 Subject: Deleting Lone Surrogates Message-ID: <20151004140201.21c9f941@JRWUBU2> In the absence of a specific tailoring, is the combination of a lone surrogate and a combining mark a user-perceived character? Does a lone surrogate constitute a user-perceived character? The problem I have is that because of an application-specific bug, when I attempt to enter the sequence , I appear to be gettig the UTF-16 code unit sequence , which is being interpreted as the codepoint sequence . (The problem seems to arise because I use a sequence of two key strokes to enter candrabindu, and the application or input mechanism has to undo the entry of a supplementary character entered in response to the first keystroke. I've reported the problem as Bug 94753.) Because the lone surrogate is interpreted as the start of a user-perceived character, I can move the cursor to between U+1148F and U+D805. Then pressing the 'delete' key (as opposed to the 'rubout' key) will delete the U+D805. However, if the lone surrogate plus combining mark is a user-perceived character, then all I will be left with is . At present the offending application is treating Tirhuta combining marks as user-perceived characters, but I suspect the application has simply not caught up with Unicode Version 7 yet. Richard. From mark at macchiato.com Sun Oct 4 08:44:32 2015 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Sun, 4 Oct 2015 15:44:32 +0200 Subject: Deleting Lone Surrogates In-Reply-To: <20151004140201.21c9f941@JRWUBU2> References: <20151004140201.21c9f941@JRWUBU2> Message-ID: When I use http://unicode.org/cldr/utility/breaks.jsp, it does show the sequence ????? as just two grapheme clusters. In #29 we are specifically not concerned about ill-formed text (or other degenerate cases). I suppose it would be possible to handle isolated surrogates in different way (eg always breaking) if it represented a common problem, but someone would have to make a very good case for that. Mark *? Il meglio ? l?inimico del bene ?* On Sun, Oct 4, 2015 at 3:02 PM, Richard Wordingham < richard.wordingham at ntlworld.com> wrote: > In the absence of a specific tailoring, is the combination of a lone > surrogate and a combining mark a user-perceived character? Does a lone > surrogate constitute a user-perceived character? > > The problem I have is that because of an application-specific bug, > when I attempt to enter the sequence U+114BA TIRHUTA SIGN CANDRABINDU>, I appear to be gettig the UTF-16 code > unit sequence , which is being interpreted as > the codepoint sequence . > > (The problem seems to arise because I use a sequence of two key strokes > to enter candrabindu, and the application or input mechanism has to undo > the entry of a supplementary character entered in response to the first > keystroke. I've reported the problem as Bug 94753.) > > Because the lone surrogate is interpreted as the start of a > user-perceived character, I can move the cursor to between U+1148F and > U+D805. Then pressing the 'delete' key (as opposed to the 'rubout' > key) will delete the U+D805. However, if the lone surrogate plus > combining mark is a user-perceived character, then all I will be left > with is . At present the offending application is treating > Tirhuta combining marks as user-perceived characters, but I suspect the > application has simply not caught up with Unicode Version 7 yet. > > Richard. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Sun Oct 4 08:56:42 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Sun, 4 Oct 2015 15:56:42 +0200 Subject: Deleting Lone Surrogates In-Reply-To: <20151004140201.21c9f941@JRWUBU2> References: <20151004140201.21c9f941@JRWUBU2> Message-ID: IMHO, isolate surrogates are not valid starters for combining sequences, they must remain isolate : deleting this surrogate in your text editor should not delete the following combining mark which is a separate cluster (even if that cluster is defective before the deletion as it has NO base starter) For default grapheme clusters, it would be helpful to add a rule to force a cluster break before and after any lone surogate (i.e. for grapheme cluster breaking, treat any lone character as if it were a control like NUL U+0000). 2015-10-04 15:02 GMT+02:00 Richard Wordingham < richard.wordingham at ntlworld.com>: > In the absence of a specific tailoring, is the combination of a lone > surrogate and a combining mark a user-perceived character? Does a lone > surrogate constitute a user-perceived character? > > The problem I have is that because of an application-specific bug, > when I attempt to enter the sequence U+114BA TIRHUTA SIGN CANDRABINDU>, I appear to be gettig the UTF-16 code > unit sequence , which is being interpreted as > the codepoint sequence . > > (The problem seems to arise because I use a sequence of two key strokes > to enter candrabindu, and the application or input mechanism has to undo > the entry of a supplementary character entered in response to the first > keystroke. I've reported the problem as Bug 94753.) > > Because the lone surrogate is interpreted as the start of a > user-perceived character, I can move the cursor to between U+1148F and > U+D805. Then pressing the 'delete' key (as opposed to the 'rubout' > key) will delete the U+D805. However, if the lone surrogate plus > combining mark is a user-perceived character, then all I will be left > with is . At present the offending application is treating > Tirhuta combining marks as user-perceived characters, but I suspect the > application has simply not caught up with Unicode Version 7 yet. > > Richard. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From markus.icu at gmail.com Sun Oct 4 12:50:43 2015 From: markus.icu at gmail.com (Markus Scherer) Date: Sun, 4 Oct 2015 10:50:43 -0700 Subject: Deleting Lone Surrogates In-Reply-To: References: <20151004140201.21c9f941@JRWUBU2> Message-ID: I would not spend any time specifying intricate rules for unpaired surrogates in 16-bit strings, or out-of range values in 32-bit strings. Most processing will treat them like unassigned characters, like U+50005, with only default behaviors. markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Sun Oct 4 13:53:25 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Sun, 4 Oct 2015 20:53:25 +0200 Subject: Deleting Lone Surrogates In-Reply-To: References: <20151004140201.21c9f941@JRWUBU2> Message-ID: The default behavior of unassigned characters are to treat them like base characters, so if they are followed by a combining mark, it would create a default grapheme cluster, which is not appropriate here. Surrogates are not chracters (so they cannot have any character properties), but they are assigned and so don't have "default" properties (only meant for *unassigned* codepoints). I still think that it is safer to treat them (for text segmentation purpose as pure isolates i.e. exactly like basic controls such as U+0000 NUL, or such as the U+FFFD replacement control which is typically used as visible placeholders for various errors). For normalisation purpose they should also have combining class 0 (i.e. acting as blockers against reorderings for canonical equivalences), and not as "transparent" (discarded and bypassed as if those surrogates were not present at all). 2015-10-04 19:50 GMT+02:00 Markus Scherer : > I would not spend any time specifying intricate rules for unpaired > surrogates in 16-bit strings, or out-of range values in 32-bit strings. > Most processing will treat them like unassigned characters, like U+50005, > with only default behaviors. > markus > -------------- next part -------------- An HTML attachment was scrubbed... URL: From asmus-inc at ix.netcom.com Sun Oct 4 14:21:11 2015 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Sun, 4 Oct 2015 12:21:11 -0700 Subject: Acquiring DIS 10646 In-Reply-To: <56111BFD.4000703@seantek.com> References: <560F5DC2.4080507@seantek.com> <560FF12B.8010105@seantek.com> <56102C72.9010008@ix.netcom.com> <56111BFD.4000703@seantek.com> Message-ID: <56117C27.6070905@ix.netcom.com> An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: jjfbbfgf.png Type: image/png Size: 1665 bytes Desc: not available URL: From asmus-inc at ix.netcom.com Sun Oct 4 14:30:23 2015 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Sun, 4 Oct 2015 12:30:23 -0700 Subject: Deleting Lone Surrogates In-Reply-To: <20151004140201.21c9f941@JRWUBU2> References: <20151004140201.21c9f941@JRWUBU2> Message-ID: <56117E4F.9010300@ix.netcom.com> An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Sun Oct 4 14:30:25 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Sun, 4 Oct 2015 20:30:25 +0100 Subject: Deleting Lone Surrogates In-Reply-To: References: <20151004140201.21c9f941@JRWUBU2> Message-ID: <20151004203025.605f0ae6@JRWUBU2> On Sun, 4 Oct 2015 15:44:32 +0200 Mark Davis ?? wrote: > When I use http://unicode.org/cldr/utility/breaks.jsp, it does show > the sequence ????? as just two grapheme clusters. But that's the sequence , which has no lone surrogates at all! (I had to look at the raw email file to be sure of what the text was - my email client displays U+FFFD and malformed alleged UTF-8 the same.) I believe I would have a good chance of repairing that by replacing U+FFFD by nothing. It's not even certain that the substitution to replace U+FFFD would work. With a more fully supported script in LibreOffice, I would have to switch 'CTL diacritic' matching off and hope that substitution replaced the shortest match. That currently works for replacing one Thai consonant by another. To systematically replace a non-spacing Thai character by another, I have to resort to 'regular expression' search and replace. I must hope that they never choose to interpret the search as matching extended grapheme clusters. Do all Unicode character properties extend to all codepoints? If not, how does one tell which do and which don't? If the Unicode segmentation algorithms do apply to sequences of codepoints, as opposed to merely to Unicode strings, then indeed is a legacy grapheme cluster. It's an extremely unhelpful one! > In #29 we are specifically not concerned about ill-formed text (or > other degenerate cases). I suppose it would be possible to handle > isolated surrogates in different way (eg always breaking) if it > represented a common problem, but someone would have to make a very > good case for that. I suppose the argument will go that by using rare scripts or obsolete characters, one deserves all the problems that one gets. The only widely used script where one is likely to encounter lone surrogates is CJK, and they are less of a problem there. Ideally, one shouldn't get isolated surrogates, but when one does, the mechanisms intended to prevent them occurring can make dealing with them difficult. Richard. From richard.wordingham at ntlworld.com Sun Oct 4 14:38:02 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Sun, 4 Oct 2015 20:38:02 +0100 Subject: Deleting Lone Surrogates In-Reply-To: References: <20151004140201.21c9f941@JRWUBU2> Message-ID: <20151004203802.2189da64@JRWUBU2> On Sun, 4 Oct 2015 10:50:43 -0700 Markus Scherer wrote: > I would not spend any time specifying intricate rules for unpaired > surrogates in 16-bit strings, or out-of range values in 32-bit > strings. Most processing will treat them like unassigned characters, > like U+50005, with only default behaviors. The core problem here is that many editors will not allow one to delete just a non-initial character from a grapheme cluster. I fear there may be editors that don't even allow one to delete the final character. This may not be a problem when one works with a small set of grapheme clusters, as in French or German, or possibly even Vietnamese, but becomes a problem when working with such a large set that the notion of them being user-perceived characters strains credulity. A stray U+50005 before a combining mark would also be fiddly to get rid of, but even if the editor does not allow the entry of arbitrary scalar values, a user might fix the problem by creating an HTML file containing the character and then copying the character from the HTML file to a find and replace command. This trick is unlikely to work for a lone surrogate. Richard. From verdy_p at wanadoo.fr Sun Oct 4 14:48:12 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Sun, 4 Oct 2015 21:48:12 +0200 Subject: Deleting Lone Surrogates In-Reply-To: <20151004203025.605f0ae6@JRWUBU2> References: <20151004140201.21c9f941@JRWUBU2> <20151004203025.605f0ae6@JRWUBU2> Message-ID: 2015-10-04 21:30 GMT+02:00 Richard Wordingham < richard.wordingham at ntlworld.com>: > On Sun, 4 Oct 2015 15:44:32 +0200 > Mark Davis ?? wrote: > > > When I use http://unicode.org/cldr/utility/breaks.jsp, it does show > > the sequence ????? as just two grapheme clusters. > > But that's the sequence , which has no lone > surrogates at all! (I had to look at the raw email file to be sure of > what the text was - my email client displays U+FFFD and malformed > alleged UTF-8 the same.) Mark just said that it was what was shown, i.e. the lone surrogate got treated as U+FFFD. However my opinion is that ????? (using U+FFFD substitution) gives 2 grapheme clusters, I would prefer a solution that gives 3 grapheme clusters, as if the lone surrogate was a line-break control, so that the third character (combining, but just after the lone surrogate) will not combine with it but will be handled as a defective combining sequence with no starter at all before it. -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Sun Oct 4 16:18:42 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Sun, 4 Oct 2015 22:18:42 +0100 Subject: Deleting Lone Surrogates In-Reply-To: <56117E4F.9010300@ix.netcom.com> References: <20151004140201.21c9f941@JRWUBU2> <56117E4F.9010300@ix.netcom.com> Message-ID: <20151004221842.1b90cdfe@JRWUBU2> On Sun, 4 Oct 2015 12:30:23 -0700 "Asmus Freytag (t)" wrote: > If you have a bug that doesn't let you enter a sequence without > creating a lone surrogate followed by a combining mark, that's a > bug... Unfortunately, the bug appears to be in an ill-defined interface in which I have observed regression even within the BMP. We've discussed the ambiguity of 'delete one character' in the context of normalisation before on this list, and the surest solution seemed to be for the application to surrender some control of its 'backing store' to the input method. It's conceivable that the input methods that are compatible for the BMP are incompatible in the supplementary planes. For now, I'm going to have to either work round the problem by using dead keys instead or be thankful that the application hasn't caught up with Unicode 7.0. Richard. From asmus-inc at ix.netcom.com Sun Oct 4 16:29:16 2015 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Sun, 4 Oct 2015 14:29:16 -0700 Subject: Deleting Lone Surrogates In-Reply-To: <20151004203802.2189da64@JRWUBU2> References: <20151004140201.21c9f941@JRWUBU2> <20151004203802.2189da64@JRWUBU2> Message-ID: <56119A2C.3060907@ix.netcom.com> An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Sun Oct 4 16:35:56 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Sun, 4 Oct 2015 22:35:56 +0100 Subject: Deleting Lone Surrogates In-Reply-To: References: <20151004140201.21c9f941@JRWUBU2> <20151004203025.605f0ae6@JRWUBU2> Message-ID: <20151004223556.770f2c68@JRWUBU2> On Sun, 4 Oct 2015 21:48:12 +0200 Philippe Verdy wrote: > 2015-10-04 21:30 GMT+02:00 Richard Wordingham < > richard.wordingham at ntlworld.com>: > > On Sun, 4 Oct 2015 15:44:32 +0200 > > Mark Davis ?? wrote: > > > When I use http://unicode.org/cldr/utility/breaks.jsp, it does > > > show the sequence ????? as just two grapheme clusters. > > But that's the sequence , which has no > > lone surrogates at all! > Mark just said that it was what was shown, i.e. the lone surrogate got > treated as U+FFFD. That's not what the English says, and I'm surprised if that's what a literal translation into French means. I do half suspect that he actually tried to post a lone surrogate. > However my opinion is that ????? (using U+FFFD substitution) gives 2 > grapheme clusters, I would prefer a solution that gives 3 grapheme > clusters, as if the lone surrogate was a line-break control, so that > the third character (combining, but just after the lone surrogate) > will not combine with it but will be handled as a defective combining > sequence with no starter at all before it. I'd much prefer to be able to delete the first character of a grapheme cluster. It's annoying to have to retype 4 characters because one's mistyped the first of the 4 characters in a grapheme cluster. Removing the restriction would be much more useful. Richard. From asmus-inc at ix.netcom.com Sun Oct 4 17:34:13 2015 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Sun, 4 Oct 2015 15:34:13 -0700 Subject: Deleting Lone Surrogates In-Reply-To: <20151004223556.770f2c68@JRWUBU2> References: <20151004140201.21c9f941@JRWUBU2> <20151004203025.605f0ae6@JRWUBU2> <20151004223556.770f2c68@JRWUBU2> Message-ID: <5611A965.3010304@ix.netcom.com> An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Sun Oct 4 17:54:30 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Sun, 4 Oct 2015 23:54:30 +0100 Subject: NNBSP and Word Boundaries In-Reply-To: References: <20151001172633.2a72f48f@JRWUBU2> Message-ID: <20151004235430.74bca234@JRWUBU2> On Fri, 2 Oct 2015 09:25:01 +0200 Mark Davis ?? wrote: > We add: > > WB13c Mongolian_Letter ? NNBSP > WB13d NNBSP ? Mongolian_Letter > > *If* we want to also change behavior on the other side of the NNBSP, > whenever the Mongolian_Letter and NNBSP occur in sequence, we add 2 > additional rules (with the appropriate values for ..., like Numeric) > > WB13c Mongolian_Letter NNBSP (...) > WB13d (...) ? NNBSP Mongolian_Letter I'll assume the last two are meant to be WB13e and WB13f. We can achieve the effects down to the first WB13d simply by changing NNBSP from XX to MidNumLet. This would also provide a proper "espace fine" for French use within numbers ( https://www.druide.com/enquetes/pour-des-espaces-ins%C3%A9cables-impeccables ) to separate groups of 3 digits. This needs *no* extra rules. Now for combined numbers and letters, we might consider adding the two rules: WB12a Numeric MidNumLet ? AHLetter WB12b Numeric ? MidNumLet AHLetter I think we should go the whole hog, and instead have WB12c (Numeric|AHLetter) MidNumLetQ ? (Numeric|AHLetter) WB12d (Numeric|AHLetter) ? MidNumLetQ (Numeric|AHLetter) Perhaps there are good reasons against them - I'm not aware of any. (I don't think it is wrong to treat "no.2" as a single word.) These rules would make the abbreviated names of a good many Thai forms (e.g. ??.?, a marriage certificate) into a single word. WB12c and WB12d overlap with WB6, WB7, WB11 and WB12, which could be slightly simplified. Richard. From richard.wordingham at ntlworld.com Sun Oct 4 18:14:13 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Mon, 5 Oct 2015 00:14:13 +0100 Subject: Deleting Lone Surrogates In-Reply-To: <56119A2C.3060907@ix.netcom.com> References: <20151004140201.21c9f941@JRWUBU2> <20151004203802.2189da64@JRWUBU2> <56119A2C.3060907@ix.netcom.com> Message-ID: <20151005001413.74e6dae4@JRWUBU2> On Sun, 4 Oct 2015 14:29:16 -0700 "Asmus Freytag (t)" wrote: > On 10/4/2015 12:38 PM, Richard Wordingham wrote: > The problem you are trying to solve is to allow editing on > the code point level, or, if you will, the keystroke level. > Generally, there will be a sweet spot for each language (and each > user) with respect to what to erase or undo. > For sequences that belong to a given language, you can pick the > behavior that makes most sense in them, but for lone surrogates, by > definition you are dealing with broken text that doesn't follow any > conventions. Who's 'you'? Customisation is frequently not available. In fact, I don't recall seeing it on offer. > It should also be something that doesn't occur commonly. So, for all > of those reasons, I see no particular problem with giving that a > "generic" behavior, which could be that of deleting the entire > combining sequence; especially if your interface normally deletes > sequences as a unit. > But in any case, the minimal requirement on an editor is that it lets > you delete (and then retype) enough text to get it back to an > uncorrupted state. In the problem I hit, I would nearly be left with two options - never having CANDRABINDU and always having it preceded by CANDRABINDU. Whenever I enter CANDRABINDU, it is preceded by the lone surrogate. Consequently, the option of retyping the sequence is of no avail. Fortunately, in the application where I met the problem, the lone surrogates, and nothing else, get deleted when the file is saved. The problem could very easily be a lot worse. ---- > Catch-22 here. In filtering input to the dialog to prevent it from > being used to corrupt text, you prevent it from being used to repair > text. Interesting. Not very different to having a very roll-stable aeroplane. If you ever do end up upside-down, you have a big problem. Richard. From asmus-inc at ix.netcom.com Sun Oct 4 18:57:15 2015 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Sun, 4 Oct 2015 16:57:15 -0700 Subject: Deleting Lone Surrogates In-Reply-To: <20151005001413.74e6dae4@JRWUBU2> References: <20151004140201.21c9f941@JRWUBU2> <20151004203802.2189da64@JRWUBU2> <56119A2C.3060907@ix.netcom.com> <20151005001413.74e6dae4@JRWUBU2> Message-ID: <5611BCDB.6090903@ix.netcom.com> An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Sun Oct 4 19:24:40 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Mon, 5 Oct 2015 01:24:40 +0100 Subject: Deleting Lone Surrogates In-Reply-To: <5611A965.3010304@ix.netcom.com> References: <20151004140201.21c9f941@JRWUBU2> <20151004203025.605f0ae6@JRWUBU2> <20151004223556.770f2c68@JRWUBU2> <5611A965.3010304@ix.netcom.com> Message-ID: <20151005012440.11ca7b17@JRWUBU2> On Sun, 4 Oct 2015 15:34:13 -0700 "Asmus Freytag (t)" wrote: > On 10/4/2015 2:35 PM, Richard Wordingham wrote: >> I'd much prefer to be able to delete the first character of a >> grapheme >> cluster. It's annoying to have to retype 4 characters because one's >> mistyped the first of the 4 characters in a grapheme cluster. >> Removing the restriction would be much more useful. > That makes sense for common typos, less so, for uncommon (hopefully) > data corruption. Allowing access within the cluster is generally useful. Providing more access just makes it easier to repair things. One problem is that there isn't a 'suspend shaping' option to allow one to see what one is doing. This matters when canonical combining classes are not available to sort out the ordering of components. > For some languages, you'll be typing several keystrokes, even if it's > a single code point; there seems to be limited desire to allow you to > "edit" the keystrokes. The creators of the application do not know how many keystrokes were used. A multi-platform application is not likely to take note of what keys were pressed even when this information is available. > For other languages I would expect a UI design > to cater to what local custom prefers. Local custom? 'Local custom' is usually one of the following: a) pen and ink, possibly with scraper. b) typewriter and tippex c) Hacked ASCII (and similar) Only with complex ligatures would you not have access to each character. The only parallels to what happens now that I can think of that might count as 'custom' are: 1) European 8-bit codes, where letter plus diacritic is treated as a unit. 2) Korean, where one couldn't chop and change the individual jamo. 3) Thai, where a tone mark can severely restrict what scraping can do. A UI design might respond to loud enough howls of user protest. You may recall Thai howls of protest when the ability to independently delete preposed vowels was lost. Thai may have some complex vowel symbols, but as far as the grapheme clusters go, *Thai* doesn't get more complicated than CVT (consonant, vowel (just one!) and tone). Some of the minority languages in the Thai script might be a bit more complicated. I do recall SIL's split cursor, which attempted to address the difficulties of navigating through a stack of diacritics. I miss it, even though I never got to grips with all its subtleties. What I believe is much more the case is that Unicode encourages 'one size fits all'. There are massive *translation* efforts for user interfaces. As to other parts of the text input/output, they are usually separate from the applications. The keyboard is almost totally independent of the application. Fonts are restricted to attempts to provide adequate coverage, but the ideal is that the user provides his own. I think the LibreOffice search and replace interface says a lot. It has visible support for Japanese - they holler and may well add their own support into the core project - and there are some CTL options which make best sense from the point of view of the Arabic script. The limitations on editing are one of the few places where the UI is under the tight control of the programmers. By and large, they seem to be influenced by a few sources, such as the Unicode technical reports. Refutation awaited. Now an attitude of 'one size fits all' does get things done. It might be a bit rough, but it's a lot better than nothing. Richard. From richard.wordingham at ntlworld.com Sun Oct 4 19:29:05 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Mon, 5 Oct 2015 01:29:05 +0100 Subject: Deleting Lone Surrogates In-Reply-To: <5611BCDB.6090903@ix.netcom.com> References: <20151004140201.21c9f941@JRWUBU2> <20151004203802.2189da64@JRWUBU2> <56119A2C.3060907@ix.netcom.com> <20151005001413.74e6dae4@JRWUBU2> <5611BCDB.6090903@ix.netcom.com> Message-ID: <20151005012905.6fcbb062@JRWUBU2> On Sun, 4 Oct 2015 16:57:15 -0700 "Asmus Freytag (t)" wrote: > On 10/4/2015 4:14 PM, Richard Wordingham wrote: > respect to what to erase or undo. >>> For sequences that belong to a given language, you can pick the >>> behavior that makes most sense in them, but for lone surrogates, by >>> definition you are dealing with broken text that doesn't follow any >>> conventions. >> Who's 'you'? Customisation is frequently not available. In fact, I >> don't recall seeing it on offer. > The UI developer. > And there's nothing Unicode can do about lack of customizability. Actually, there is. I believe suggestions and recommendations in the technical reports are quite influential. Richard. From naz at gassiep.com Mon Oct 5 02:39:23 2015 From: naz at gassiep.com (Naz Gassiep) Date: Mon, 5 Oct 2015 18:39:23 +1100 Subject: Proposals for Arabic honorifics Message-ID: <5612292B.8040208@gassiep.com> Hi all, We are considering writing a proposal for Arabic honorifics which are missing from Unicode. There are already a few in there, notably U+FDFA and U+FDFB. There are two existing proposals, L2/14-147 and L2/14-152, which each propose additions. L2/14-147 proposes seventeen new characters and L2/14-152 proposes a further two. There are a few other characters that are not included in these proposals, and I was considering preparing a proposal of my own. I will work with a team of people who are willing to contribute time to this work. We are considering two options: 1. Prepare an additional proposal for the characters that were missing from the existing spec and also from the two proposals mentioned above. 2. Prepare a collating proposal which rolls the two proposals as well as the others that we feel are missing into a single proposal. Currently, we favour the second option. We would ensure that full descriptions, names, character properties, and detailed examples are provided for each character to substantiate its use in modern plain text. We would also suggest code points in line with the existing proposal L2/14-147. We don't want to step on the toes of the original submitters, Roozbeh Pournader or Lateef Sagar Shaikh. We wish to be clear that we will draw on their existing proposals to the maximum extent possible to ensure that we do not submit a conflicting proposal, but a superset proposal that incorporates their proposals as well as the additional characters we have identified. We have evaluated these two, and a true superset proposal is possible such that no conflicts between either those two proposals or our own will materialize. Are there any issues that we may face in preparing and submitting our proposal? Any guidance from this mailing list would be highly valued. Many thanks, - Naz. From duerst at it.aoyama.ac.jp Mon Oct 5 06:50:14 2015 From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J._D=c3=bcrst?=) Date: Mon, 5 Oct 2015 20:50:14 +0900 Subject: Deleting Lone Surrogates In-Reply-To: <56117E4F.9010300@ix.netcom.com> References: <20151004140201.21c9f941@JRWUBU2> <56117E4F.9010300@ix.netcom.com> Message-ID: <561263F6.1000308@it.aoyama.ac.jp> On 2015/10/05 04:30, Asmus Freytag (t) wrote: > On 10/4/2015 6:02 AM, Richard Wordingham wrote: >> In the absence of a specific tailoring, is the combination of a lone >> surrogate and a combining mark a user-perceived character? Does a lone >> surrogate constitute a user-perceived character? > > In an editing interface, a lone surrogate should be a user perceived character, > as otherwise you won't be able to manually delete it. Markus suggests that it be > treated like an unassigned code point. In an editing tool (of which an editing interface is a part of), a lone surrogate should just be removed! Apparently, that's what happens in Richard's case, but only eventually. Regards, Martin. From samjnaa at gmail.com Mon Oct 5 07:14:52 2015 From: samjnaa at gmail.com (Shriramana Sharma) Date: Mon, 5 Oct 2015 17:44:52 +0530 Subject: Unicode in passwords In-Reply-To: <7BD7361C-AAFD-40D9-9D14-5B41295EAE6D@lboro.ac.uk> References: <4057F09C-5426-40ED-AB3D-356BBBC85276@telia.com> <2E86845B-6A9F-4D5C-A02A-C2C82F4DB28F@gmail.com> <000601d0fbff$42881070$c7983150$@gmail.com> <20151001083322.5440cc2a@JRWUBU2> <7BD7361C-AAFD-40D9-9D14-5B41295EAE6D@lboro.ac.uk> Message-ID: I recently came across this bug report where a filesystem encrypted with a Cyrillic script password could not be decrypted at boot time: https://bugzilla.redhat.com/show_bug.cgi?id=681250 -- Shriramana Sharma ???????????? ???????????? From marc.blanchet at viagenie.ca Mon Oct 5 07:30:58 2015 From: marc.blanchet at viagenie.ca (Marc Blanchet) Date: Mon, 05 Oct 2015 08:30:58 -0400 Subject: Unicode in passwords In-Reply-To: References: <4057F09C-5426-40ED-AB3D-356BBBC85276@telia.com> <2E86845B-6A9F-4D5C-A02A-C2C82F4DB28F@gmail.com> <000601d0fbff$42881070$c7983150$@gmail.com> <20151001083322.5440cc2a@JRWUBU2> <7BD7361C-AAFD-40D9-9D14-5B41295EAE6D@lboro.ac.uk> Message-ID: On 5 Oct 2015, at 8:14, Shriramana Sharma wrote: > I recently came across this bug report where a filesystem encrypted > with a Cyrillic script password could not be decrypted at boot time: > > https://bugzilla.redhat.com/show_bug.cgi?id=681250 And? From what I understand, this is related to the fact that the OS has two levels of boot/console/installation scripts and the first level is very basic regarding i18n (i.e. us-ascii only guaranteed to work). Marc. > > > -- > Shriramana Sharma ???????????? > ???????????? From samjnaa at gmail.com Mon Oct 5 08:42:05 2015 From: samjnaa at gmail.com (Shriramana Sharma) Date: Mon, 5 Oct 2015 19:12:05 +0530 Subject: Unicode in passwords In-Reply-To: References: <4057F09C-5426-40ED-AB3D-356BBBC85276@telia.com> <2E86845B-6A9F-4D5C-A02A-C2C82F4DB28F@gmail.com> <000601d0fbff$42881070$c7983150$@gmail.com> <20151001083322.5440cc2a@JRWUBU2> <7BD7361C-AAFD-40D9-9D14-5B41295EAE6D@lboro.ac.uk> Message-ID: On 10/5/15, Marc Blanchet wrote: > On 5 Oct 2015, at 8:14, Shriramana Sharma wrote: > >> https://bugzilla.redhat.com/show_bug.cgi?id=681250 > > And? Well the OP did say: I'm researching potential problems and best practices for password policies that allow non-Latin-1 Unicode characters. The link seemed valid food for the research as was offerred FWIW. -- Shriramana Sharma ???????????? ???????????? From marc.blanchet at viagenie.ca Mon Oct 5 08:45:17 2015 From: marc.blanchet at viagenie.ca (Marc Blanchet) Date: Mon, 05 Oct 2015 09:45:17 -0400 Subject: Unicode in passwords In-Reply-To: References: <4057F09C-5426-40ED-AB3D-356BBBC85276@telia.com> <2E86845B-6A9F-4D5C-A02A-C2C82F4DB28F@gmail.com> <000601d0fbff$42881070$c7983150$@gmail.com> <20151001083322.5440cc2a@JRWUBU2> <7BD7361C-AAFD-40D9-9D14-5B41295EAE6D@lboro.ac.uk> Message-ID: On 5 Oct 2015, at 9:42, Shriramana Sharma wrote: > On 10/5/15, Marc Blanchet wrote: >> On 5 Oct 2015, at 8:14, Shriramana Sharma wrote: >> >>> https://bugzilla.redhat.com/show_bug.cgi?id=681250 >> >> And? > > Well the OP did say: > > > I'm researching potential problems and best practices for password > policies that allow non-Latin-1 Unicode characters. > > > The link seemed valid food for the research as was offerred FWIW. sure. but roughly one could conclude from the bug report that only allow us-ascii is safe, which may not be what could be ??best practices?? depending on the point of view? Marc. > > -- > Shriramana Sharma ???????????? > ???????????? From samjnaa at gmail.com Mon Oct 5 09:47:03 2015 From: samjnaa at gmail.com (Shriramana Sharma) Date: Mon, 5 Oct 2015 20:17:03 +0530 Subject: Unicode in passwords In-Reply-To: References: <4057F09C-5426-40ED-AB3D-356BBBC85276@telia.com> <2E86845B-6A9F-4D5C-A02A-C2C82F4DB28F@gmail.com> <000601d0fbff$42881070$c7983150$@gmail.com> <20151001083322.5440cc2a@JRWUBU2> <7BD7361C-AAFD-40D9-9D14-5B41295EAE6D@lboro.ac.uk> Message-ID: I had hoped it would be obvious my reply was not intended to the "best practices" part of the OP, but to the "potential problems" part of it... In any case, I have nothing further to say on this topic. -- Shriramana Sharma ???????????? ???????????? From verdy_p at wanadoo.fr Mon Oct 5 09:51:25 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Mon, 5 Oct 2015 16:51:25 +0200 Subject: Deleting Lone Surrogates In-Reply-To: <561263F6.1000308@it.aoyama.ac.jp> References: <20151004140201.21c9f941@JRWUBU2> <56117E4F.9010300@ix.netcom.com> <561263F6.1000308@it.aoyama.ac.jp> Message-ID: Not silently ! Even if this removal is required to go on editing, this must be notified to the user as it may occur in unedited parts of the file (and it may be the sign that the document is not fully plain text, so the user should not save the edited file) If this is caused by a quirk in the user input (defect of the input mode or keyboard layout), there should be a notification. But for a general purpose editor that allows editing files including binary ones (e.g. Emacs), it is best to NOT drop those lone surrogates at all, and effectively treat them in isolation for ALL purposes (the DELETE key should not delete more than this lone surrogate (it may be necessary to adjst the cursor position after the deletion if the editor does not support placing the cursor in the middle of a combining sequence, but a LONE surrogate + a combining character should still be treated as two separate clusters and the cursor or selection should be placable between the lone surrogate and the combining mark.) Note that file formats that contain binary parts and plain text parts do exist, e.g. media files that contain a final plain text section for metadata or for some XML data signature : it is safe to edit that final part in a text editor, provided that it does not silently change the encoding of the binary part. In summary, I do not like the idea of silently dropping lone surrogates in editors. If the editor needs it because it cannot safely handle binary parts, the notification will say to the user that he should not use that editor and choose something else, or it will allow the user to select another appropriate file encoding to edit the file safely. The user should not save the file blindly as it will be corrupted silently. Doing otherwise would be a security issue. And this remark extends to all other protocols using plain text input ; lone surrogates should not be dropped silently (unless explicitly requested for exemple in a maintenance cleanup or repair) : it this lone surrogate violates the further processing, the only safe option is to reject the whole text and report the error if text data is required but missing. 2015-10-05 13:50 GMT+02:00 Martin J. D?rst : > On 2015/10/05 04:30, Asmus Freytag (t) wrote: > >> On 10/4/2015 6:02 AM, Richard Wordingham wrote: >> >>> In the absence of a specific tailoring, is the combination of a lone >>> surrogate and a combining mark a user-perceived character? Does a lone >>> surrogate constitute a user-perceived character? >>> >> >> In an editing interface, a lone surrogate should be a user perceived >> character, >> as otherwise you won't be able to manually delete it. Markus suggests >> that it be >> treated like an unassigned code point. >> > > In an editing tool (of which an editing interface is a part of), a lone > surrogate should just be removed! Apparently, that's what happens in > Richard's case, but only eventually. > > Regards, Martin. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From marc.blanchet at viagenie.ca Mon Oct 5 09:59:38 2015 From: marc.blanchet at viagenie.ca (Marc Blanchet) Date: Mon, 05 Oct 2015 10:59:38 -0400 Subject: Unicode in passwords In-Reply-To: References: <4057F09C-5426-40ED-AB3D-356BBBC85276@telia.com> <2E86845B-6A9F-4D5C-A02A-C2C82F4DB28F@gmail.com> <000601d0fbff$42881070$c7983150$@gmail.com> <20151001083322.5440cc2a@JRWUBU2> <7BD7361C-AAFD-40D9-9D14-5B41295EAE6D@lboro.ac.uk> Message-ID: On 5 Oct 2015, at 10:47, Shriramana Sharma wrote: > I had hoped it would be obvious my reply was not intended to the "best > practices" part of the OP, but to the "potential problems" part of > it... sure. my comment was also just informative, not targeting to your comment, but targeting the fact that ??best practices?? may not be ??us-ascii?? only if you want to be i18n. Marc. > In any case, I have nothing further to say on this topic. > > -- > Shriramana Sharma ???????????? > ???????????? From bortzmeyer at nic.fr Mon Oct 5 10:12:00 2015 From: bortzmeyer at nic.fr (Stephane Bortzmeyer) Date: Mon, 5 Oct 2015 17:12:00 +0200 Subject: Unicode in passwords In-Reply-To: <2E86845B-6A9F-4D5C-A02A-C2C82F4DB28F@gmail.com> References: <4057F09C-5426-40ED-AB3D-356BBBC85276@telia.com> <2E86845B-6A9F-4D5C-A02A-C2C82F4DB28F@gmail.com> Message-ID: <20151005151200.GA7379@laperouse.bortzmeyer.org> On Wed, Sep 30, 2015 at 04:15:30PM -0700, Clark S. Cox III wrote a message of 73 lines which said: > You really wouldn?t want ?Schl?ssel? and ?Schl?ssel? being different > passwords, would you? (assuming that my mail client and/or OS is not > interfering, the first is NFC, while the second is NFD) Hence the RFC 7613, mentioned already here by Marc Blanchet, that you must really read if you're interesed in Unicode passwords. In that case, the RFC is clear: NFC mandatory (and UTF-8 encoding). 4. Normalization Rule: Unicode Normalization Form C (NFC) MUST be applied to all characters. From doug at ewellic.org Mon Oct 5 10:24:57 2015 From: doug at ewellic.org (Doug Ewell) Date: Mon, 05 Oct 2015 08:24:57 -0700 Subject: Acquiring DIS 10646 Message-ID: <20151005082457.665a7a7059d7ee80bb4d670165c8327d.c32c0ba233.wbe@email03.secureserver.net> I too am puzzled as to what DIS 10646 and C1 control pictures have to do with each other. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From asmus-inc at ix.netcom.com Mon Oct 5 10:57:52 2015 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Mon, 5 Oct 2015 08:57:52 -0700 Subject: Deleting Lone Surrogates In-Reply-To: References: <20151004140201.21c9f941@JRWUBU2> <56117E4F.9010300@ix.netcom.com> <561263F6.1000308@it.aoyama.ac.jp> Message-ID: <56129E00.40909@ix.netcom.com> On 10/5/2015 7:51 AM, Philippe Verdy wrote: > Not silently ! Even if this removal is required to go on editing, this > must be notified to the user as it may occur in unedited parts of the > file (and it may be the sign that the document is not fully plain > text, so the user should not save the edited file) > If this is caused by a quirk in the user input (defect of the input > mode or keyboard layout), there should be a notification. As long as we are discussing, as Richard is, recommendations for implementers, I fully agree with Phillipe. Manually editing surrogate corruptions might be something that could be relegated to an "expert mode", but automatic correction without user confirmation ("May we clean up your file?") would indeed be spooky and dangerous. A./ -------------- next part -------------- An HTML attachment was scrubbed... URL: From kenwhistler at att.net Mon Oct 5 12:11:39 2015 From: kenwhistler at att.net (Ken Whistler) Date: Mon, 5 Oct 2015 10:11:39 -0700 Subject: Scope of Unicode Character Properties (was: Re: Deleting Lone Surrogates) In-Reply-To: <20151004203025.605f0ae6@JRWUBU2> References: <20151004140201.21c9f941@JRWUBU2> <20151004203025.605f0ae6@JRWUBU2> Message-ID: <5612AF4B.9040406@att.net> Section 3.5, Properties, of the standard attempts to address this. "Code point properties" are properties of the code points, per se, and clearly do have all code points (U+0000..U+10FFFF) in their scope. An example is the Surrogate code point property, which wouldn't make much sense if it didn't apply to surrogate code points! "Encoded character properties" are properties of the characters themselves -- attributes like Ideographic or Numeric_Value. For completeness, those are given *default* values for all reserved code points (and for noncharacter and PUA code points). In principle, the scope should be all Unicode scalar values: U+0000..U+D7FF, U+E000..U+10FFFF, because it doesn't make much sense to talk about character properties for code points that are ill-formed and which cannot ever actually represent a character. However, in practice, it is simplest to extend the *default* values of encoded character properties to the surrogate code points, so that in the cases where they occur in ill-formed text, APIs and applications have some hope of doing something useful, rather than just reacting exceptionally to featureless singularities embedded in text. Hence, the bullet in the text in the standard: * For each encoded character property there is a mapping from every code point to some value in the set of values associated with that property. There is nothing in the standard, as I read it, that imposes a conformance requirement on any process that would *require* it to interpret an isolated surrogate code point and give it a particular property value. However, it would be reasonable (and permitted) for an API to actually report a default value for a surrogate code point (i.e., treating it more or less like the reserved code point U+50005 that Marcus mentioned). Such behavior in a character property API is likely to result in more graceful behavior than simply throwing exceptions. --Ken On 10/4/2015 12:30 PM, Richard Wordingham wrote: > Do all Unicode character properties extend to all codepoints? If not, > how does one tell which do and which don't? ... > Richard. From kenwhistler at att.net Mon Oct 5 14:32:45 2015 From: kenwhistler at att.net (Ken Whistler) Date: Mon, 5 Oct 2015 12:32:45 -0700 Subject: Why Nothing Ever Goes Away (was: Re: Acquiring DIS 10646) In-Reply-To: <20151005082457.665a7a7059d7ee80bb4d670165c8327d.c32c0ba233.wbe@email03.secureserver.net> References: <20151005082457.665a7a7059d7ee80bb4d670165c8327d.c32c0ba233.wbe@email03.secureserver.net> Message-ID: <5612D05D.7000407@att.net> On 10/5/2015 8:24 AM, Doug Ewell wrote: > I too am puzzled as to what DIS 10646 and C1 control pictures have to do > with each other. > What an *excellent* cue to start a riff on arcane Unicode history! First, let me explain what I think Sean Leonard's concern here is. 1. On 10/4/2015 5:30 AM, Sean wrote: "I proposed adding C1 Control Pictures to Unicode. ... The requirement is that all glyphs for U+0000 - U+00FF be graphically distinct." Ah, but Sean has noticed that of all the representative glyphs we have use in the current code charts for C1 control codes, exactly *3* of them share an odd glyph. U+0080, U+0081, and U+0099 use the same dotted box with an "XXX" in it. That creates a conflict with the requirement that Sean has stated for glyphs for *graphic symbols for* control codes, presumably for addition the to 2400 Control Pictures block and some extensions elsewhere, each with a visually distinct representation. 2. The Unicode code charts are (deliberately) vague about U+0080, U+0081, and U+0099. All other C1 control codes have aliases to the ISO 6429 set of control functions, but in ISO 6429, those three control codes don't have any assigned functions (or names). And because the C1 aliases in the Unicode code charts are (deliberately) based on ISO 6429, U+0080, U+0081, and U+0099 are only identified as "", with no alias in the charts, and with an arbitrary "XXX" box glyph. 3. Concerned about this gap, Sean did some due diligence research on the web, and turned up documentation pages such as: http://utopia.knoware.nl/users/eprebel/Communication/CharacterSets/Controls.html Pertinent to this discussion is the section for C1 on that page which (incorrectly) includes "DIS 10646" in the list of "Standards". More to the point, the entries for the 3 C1 code points in question are documented as: 08/00 ... PAD PADding character (only in DIS 10646) 08/01 ... HOP High Octet Preset (only in DIS 10646) ... 09/09 ... SGCI Single Graphic Character Introducer (only in DIS 10646) Aha! Hence the need to track down a copy of DIS 10646 (meaning in actuality, the appropriately numbered WG2 N666, "DIS 10646", dated November 4, 1990). That was actually what became DIS 1, the DIS that failed, the DIS that led to the *second* DIS 10646, which was the basis of the Unicode/10646 merger. But I digress... ;-) 4. O.k., so with that connection out of the way, I can proceed to the topic of this thread: Why Nothing Ever Goes Away. PAD, HOP, and SGCI were arcane, proposed architectural additions to the early drafts of 10646, from the days when 10646 was still slavishly following the ISO 2022 framework, and was avoiding C0 and C1 byte values in all representations, including single-, double-, triple-, and quadruple-byte forms for characters. HOP was one of those half-baked terminal protocol byte compression concoctions. The idea was that since some commonly used blocks of characters would require double-byte representation, but would all have the same "high octet", you could send a HOP, and then a bunch of low octets down the line. In effect, it was intended as a script switcher. SGCI was complementary to that. It would let you introduce a sequence of multiple octets for a single character, without having to switch out of your high octet preset mode. PAD I forget the exact details of. Something to do with padding out character representations into fixed length. All of these were firmly rejected in the merger discussions and the failed DIS vote. Actually, they were down in the noise compared to major issues like CJK plane swapping and such, but there clearly was no need for 10646 to invent new control functions like these, and the early drafts of the Unicode Standard had nothing of the sort. So these were gone in DIS 1.2 for 10646. They were *never* published as part of ISO 10646-1:1993 (or any later edition). Nor were they ever published in an ISO control function standard. Nor were they ever published in the Unicode Standard, of course. They were never standard *anything* -- just ill-advised concept functions that later got dropped in the drafts. But wait! If these disappeared from any standard draft way back in 1991(!), why are we still talking about them? Why are they still documented on web pages for C1 control characters in 2015, 24 years later? Funny you should ask! The problem is that they went viral. And that in an age before anybody really knew what "going viral" even meant. ;-) The first problem is that a bunch of mnemonics for characters were published in an RFC. And those mnemonics included characters from early drafts of 10646. The notorious document in question is RFC 1345: Simonsen, K., "Character Mnemonics & Character Sets", June 1992. Go ahead, it is still there: https://tools.ietf.org/rfc/rfc1345.txt And that has entries for the non-existent control codes, which by the time RFC 1345 was published, had *already* been removed from the 10646 drafts. To wit: PA 0080 PADDING CHARACTER (PAD) HO 0081 HIGH OCTET PRESET (HOP) GC 0099 SINGLE GRAPHIC CHARACTER INTRODUCER (SGCI) RFC 1345 was, in turn, referenced by other important IETF documents, including the important RFC 2070, "Internationalization of the Hypertext Markup Language", which defines the syntax for character entity names. Entity names for PAD, HOP, and SGCI then found their way into Java and other implementations. They ended up referenced in tables supporting regular expressions. And so on. Somehow they had become the walking dead control functions. This came back around to the Unicode Standard about the time the U+1F514 BELL and U+0007 alias BELL name collision issue hit the fan. The UTC response to this problem was to augment the formal name aliases to include all widely used control function names and abbreviations, so that testing for name collisions in that name space would prevent any future BELL/BELL issues. See, in particular, the related PRI on this topic for Unicode 6.1.0: http://www.unicode.org/review/pri202/ which explicitly mentions U+0080, U+0081, and U+0099 and their aliases, because of a need for backward compatibility to then-existing usage in Perl 5. The outcome of that PRI was to add a bunch of formal name aliases, *including* ones for PAD, HOP, and SGCI (or SGC). To wit, from NameAliases.txt: ======================================================= # PADDING CHARACTER and HIGH OCTET PRESET represent # architectural concepts initially proposed for early # drafts of ISO/IEC 10646-1. They were never actually # approved or standardized: hence their designation # here as the "figment" type. Formal name aliases # (and corresponding abbreviations) for these code # points are included here because these names leaked # out from the draft documents and were published in # at least one RFC whose names for code points was # implemented in Perl regex expressions. 0080;PADDING CHARACTER;figment 0080;PAD;abbreviation 0081;HIGH OCTET PRESET;figment 0081;HOP;abbreviation # SINGLE GRAPHIC CHARACTER INTRODUCER is another # architectural concept from early drafts of ISO/IEC 10646-1 # which was never approved and standardized. 0099;SINGLE GRAPHIC CHARACTER INTRODUCER;figment 0099;SGC;abbreviation ============================================================= Because of stability guarantees, however, NameAliases.txt is a write-once, read-only, unerasable file. For better or for worse, we are now stuck forever with those name aliases for U+0080, U+0081, and U+0099, *even though* the relevant control functions were never, ever actually standardized or used anywhere. Think of them as just part of the arcane mysteries now: odd labels for the three code points, which (nearly) nobody understands. Another of the many Unicode just so stories. :-) --Ken From richard.wordingham at ntlworld.com Mon Oct 5 14:58:48 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Mon, 5 Oct 2015 20:58:48 +0100 Subject: Deleting Lone Surrogates In-Reply-To: References: <20151004140201.21c9f941@JRWUBU2> <56117E4F.9010300@ix.netcom.com> <561263F6.1000308@it.aoyama.ac.jp> Message-ID: <20151005205848.1b622a08@JRWUBU2> On Mon, 5 Oct 2015 16:51:25 +0200 Philippe Verdy wrote: > 2015-10-05 13:50 GMT+02:00 Martin J. D?rst : > > > In an editing tool (of which an editing interface is a part of), a > > lone surrogate should just be removed! Apparently, that's what > > happens in Richard's case, but only eventually. > Not silently ! Even if this removal is required to go on editing, > this must be notified to the user as it may occur in unedited parts > of the file (and it may be the sign that the document is not fully > plain text, so the user should not save the edited file) > If this is caused by a quirk in the user input (defect of the input > mode or keyboard layout), there should be a notification. The lone surrogates (as I surmise) in this case are caused by the user input being misinterpreted. The sequence of strings delivered to a program running X receiving the same sequence of keystrokes is U+1148F, U+114C0, U+0008, U+114BF, and I have no reason to doubt that the offending program is receiving the same sequence. My working hypothesis is that this is being simplified to U+1148F, U+D805, U+114BF; the presence of U+D805 is a program error. I can reproduce the problem in a previously empty file. Now, on Windows, old MS keyboards at least deliver supplementary characters in a pair of WM_CHAR messages. If one of these ligatures were corrupted so that only the first of the messages was delivered, it is not obvious to me how a program would readily detect the omission. It would only become obvious when the start of the next *character* was received. Richard. From verdy_p at wanadoo.fr Mon Oct 5 17:37:31 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Tue, 6 Oct 2015 00:37:31 +0200 Subject: Why Nothing Ever Goes Away (was: Re: Acquiring DIS 10646) In-Reply-To: <5612D05D.7000407@att.net> References: <20151005082457.665a7a7059d7ee80bb4d670165c8327d.c32c0ba233.wbe@email03.secureserver.net> <5612D05D.7000407@att.net> Message-ID: 2015-10-05 21:32 GMT+02:00 Ken Whistler : > > On 10/5/2015 8:24 AM, Doug Ewell wrote: > >> I too am puzzled as to what DIS 10646 and C1 control pictures have to do >> with each other. >> >> > What an *excellent* cue to start a riff on arcane Unicode history! > > First, let me explain what I think Sean Leonard's concern here is. > > 1. On 10/4/2015 5:30 AM, Sean wrote: "I proposed adding C1 Control > Pictures to > Unicode. ... The requirement is that all glyphs for U+0000 - U+00FF be > graphically distinct." > > Ah, but Sean has noticed that of all the representative glyphs we have use > in the current code charts for C1 control codes, exactly *3* of them share > an odd glyph. U+0080, U+0081, and U+0099 use the same dotted box > with an "XXX" in it. That creates a conflict with the requirement that > Sean has stated for glyphs for *graphic symbols for* control codes, > presumably for addition the to 2400 Control Pictures block and some > extensions elsewhere, each with a visually distinct representation. Good remark, but that does not mean that we really need to encode new code points for C1 control pictures. What is really needed is to change their representative glyph in charts: their dotted box should better include "0080", "0081" and "0099" in them rather than "XXX", if those C1 positions don't have any *agreed* ASCII-letters aliases (though their common abbreviations are listed in the English Wikipedia article as "PAD", "HOP", and "SGCI" respectively) https://en.wikipedia.org/wiki/C0_and_C1_control_codes Note this old L2 discussion note for their unspecified aliases by Ken Whistler: http://www.unicode.org/L2/L2011/11281-control-aliases.txt -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Mon Oct 5 17:57:23 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Tue, 6 Oct 2015 00:57:23 +0200 Subject: Why Nothing Ever Goes Away (was: Re: Acquiring DIS 10646) In-Reply-To: References: <20151005082457.665a7a7059d7ee80bb4d670165c8327d.c32c0ba233.wbe@email03.secureserver.net> <5612D05D.7000407@att.net> Message-ID: Also the aliases for C1 controls were formally registered in 1983 only for the two ranges U+0084..U+0097 and U+009B..U+009F for ISO 6429. So the abbreviation (and names) aliases given to: - U+0082 (BPH =BREAK PERMITTED HERE), - U+0083 (NBH = NO BREAK HERE), - U+0098 (SOS=START OF STRING) and - U+009A (SCI=SINGLE CHARACTER INTRODUCER) are also discutable (but they may have other sources than just ISO 6429, probably from IBM for its proprietary EBCDIC-based systems). In that case the same sources could have given names/abbreviations to U+0080, U+0081 and U+0099. The problem could be that their late mapping from EBCDIC to some ISO8859-compatible encoding was still fuzzy before some date, or was also fuzzy within EBCDIC-based encodings themselves across their versions or in their implementations and applications on those systems. Anyway those aliased names and abbreviations have been published by Unicode and should remain stable now. 2015-10-06 0:37 GMT+02:00 Philippe Verdy : > 2015-10-05 21:32 GMT+02:00 Ken Whistler : > >> >> On 10/5/2015 8:24 AM, Doug Ewell wrote: >> >>> I too am puzzled as to what DIS 10646 and C1 control pictures have to do >>> with each other. >>> >>> >> What an *excellent* cue to start a riff on arcane Unicode history! >> >> First, let me explain what I think Sean Leonard's concern here is. >> >> 1. On 10/4/2015 5:30 AM, Sean wrote: "I proposed adding C1 Control >> Pictures to >> Unicode. ... The requirement is that all glyphs for U+0000 - U+00FF be >> graphically distinct." >> >> Ah, but Sean has noticed that of all the representative glyphs we have use >> in the current code charts for C1 control codes, exactly *3* of them share >> an odd glyph. U+0080, U+0081, and U+0099 use the same dotted box >> with an "XXX" in it. That creates a conflict with the requirement that >> Sean has stated for glyphs for *graphic symbols for* control codes, >> presumably for addition the to 2400 Control Pictures block and some >> extensions elsewhere, each with a visually distinct representation. > > > Good remark, but that does not mean that we really need to encode new code > points for C1 control pictures. > > What is really needed is to change their representative glyph in charts: > their dotted box should better include "0080", "0081" and "0099" in them > rather than "XXX", if those C1 positions don't have any *agreed* > ASCII-letters aliases (though their common abbreviations are listed in the > English Wikipedia article as "PAD", "HOP", and "SGCI" respectively) > > https://en.wikipedia.org/wiki/C0_and_C1_control_codes > > Note this old L2 discussion note for their unspecified aliases by Ken > Whistler: > > http://www.unicode.org/L2/L2011/11281-control-aliases.txt > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Mon Oct 5 18:26:16 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Tue, 6 Oct 2015 01:26:16 +0200 Subject: Scope of Unicode Character Properties (was: Re: Deleting Lone Surrogates) In-Reply-To: <5612AF4B.9040406@att.net> References: <20151004140201.21c9f941@JRWUBU2> <20151004203025.605f0ae6@JRWUBU2> <5612AF4B.9040406@att.net> Message-ID: 2015-10-05 19:11 GMT+02:00 Ken Whistler : > However, it would be reasonable (and permitted) for an API to actually > report a default value for a surrogate code point (i.e., treating it more > or less like the reserved code point U+50005 that Marcus mentioned). > Unassigned (reserved) code points, when followed by an assigned combining mark would still be treated as starters of a combining sequence by default. This is not (IMHO) desirable for lone surrogates that should better be handled in isolation independantly of what follows them. My opinion is that they should be treated like new line controls, so that the combining mark after it will also be separated into a defective combining sequence without any starter (e.g. 000A 0302 creates two clusters, this should be the same for D800 0302. D800 will have no defined glyph to render, but the glyph for U+FFFD may be displayed, or just a ".notdef" tofu box). Now for break opportunities, those lone surrogates should not create a newline or paragraph break opportunity, but they may create a word break opportunity to allow their easy separation and selection by a double-click on this tofu in an editor; they may even create a syllable break opportunity before and after them to allow wrapping long lines there). Those adaptations however are not described at all in annexes speaking about text segmentations. So those surrogates (which are permanently assigned) could have their own code point properties more formally defined. In my opinion handling them like U+0000 is much better than handling thme like U+50005, which should stay reserved and handled as standard starters with default combining class 0. Also those lone surrogates should be Bidi-neutral (imagine they occur in the middle of some Arabic text, they should probably not change the direction of the surrounding text and should not alter the embedding context). -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Mon Oct 5 19:08:26 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Tue, 6 Oct 2015 02:08:26 +0200 Subject: Unicode in passwords In-Reply-To: <20151005151200.GA7379@laperouse.bortzmeyer.org> References: <4057F09C-5426-40ED-AB3D-356BBBC85276@telia.com> <2E86845B-6A9F-4D5C-A02A-C2C82F4DB28F@gmail.com> <20151005151200.GA7379@laperouse.bortzmeyer.org> Message-ID: NFC is probably not the best choice for passwords. It should probably be NFKC Look also in the recent proposed update for UAX #31, and consider the special case where an application does not want passwords to be case-significant, but accepts using something else than just ASCII letters: it will be then necessry to apply some closure for NFKC. Finally note that passwords are not necessarily single identifiers (whitespaces and word separators are accepted, but whitespaces should require special handling with trimming (at both ends) and compression of multiple occurences. It would also be necessay to make sure that acceptable passwords at least begin with an XID_Start character. May be all this discussion could be a new section in UAX #31 to take into account the possible presence of whitespaces (for "pass phrases" which are not really "identifiers") in "Medial" positions : define a profile as described in UAX #31 to add whitespaces in "Medial" and remove them from excluded characters, and possibly extend the set of "Start" to more than just XID_Start (e.g. you could use some punctuation like '!' or mathematical sign like '+', and possibly also accept non-decimal digits that are preserved after NFKC closure) 2015-10-05 17:12 GMT+02:00 Stephane Bortzmeyer : > On Wed, Sep 30, 2015 at 04:15:30PM -0700, > Clark S. Cox III wrote > a message of 73 lines which said: > > > You really wouldn?t want ?Schl?ssel? and ?Schl?ssel? being different > > passwords, would you? (assuming that my mail client and/or OS is not > > interfering, the first is NFC, while the second is NFD) > > Hence the RFC 7613, mentioned already here by Marc Blanchet, that you > must really read if you're interesed in Unicode passwords. > > In that case, the RFC is clear: NFC mandatory (and UTF-8 encoding). > > 4. Normalization Rule: Unicode Normalization Form C (NFC) MUST be > applied to all characters. > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Mon Oct 5 19:23:45 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Tue, 6 Oct 2015 02:23:45 +0200 Subject: Unicode in passwords In-Reply-To: References: <4057F09C-5426-40ED-AB3D-356BBBC85276@telia.com> <2E86845B-6A9F-4D5C-A02A-C2C82F4DB28F@gmail.com> <20151005151200.GA7379@laperouse.bortzmeyer.org> Message-ID: Also some people may want to use now emojis within their passwords or pass phrases (they are now very common on most smartphones and layouts for tactile screens or in instant messaging applications used on desktops, using mouse clicks or taps for selecting them). But I would not recommend them for encrypting bootable disks or in BIOS/UEFI boot environments without support for extended input methods and rich graphics to render them on basic text consoles, unless they are part of a national encoding standard and supported natively). For boot environments, you'll be limited by the local hardware support, but if there's such a support (keyboard or font), it may be helpful to include some extra symbols, to block remote accesses without this native support (e.g. on Japanese systems, you could use the extra keys found only on Japanese keyboards and you won't be able to control the system without the appropriate device recognized in the booting environment). 2015-10-06 2:08 GMT+02:00 Philippe Verdy : > NFC is probably not the best choice for passwords. It should probably be > NFKC > > Look also in the recent proposed update for UAX #31, and consider the > special case where an application does not want passwords to be > case-significant, but accepts using something else than just ASCII letters: > it will be then necessry to apply some closure for NFKC. > Finally note that passwords are not necessarily single identifiers > (whitespaces and word separators are accepted, but whitespaces should > require special handling with trimming (at both ends) and compression of > multiple occurences. It would also be necessay to make sure that acceptable > passwords at least begin with an XID_Start character. > > May be all this discussion could be a new section in UAX #31 to take into > account the possible presence of whitespaces (for "pass phrases" which are > not really "identifiers") in "Medial" positions : define a profile as > described in UAX #31 to add whitespaces in "Medial" and remove them from > excluded characters, and possibly extend the set of "Start" to more than > just XID_Start (e.g. you could use some punctuation like '!' or > mathematical sign like '+', and possibly also accept non-decimal digits > that are preserved after NFKC closure) > > > > 2015-10-05 17:12 GMT+02:00 Stephane Bortzmeyer : > >> On Wed, Sep 30, 2015 at 04:15:30PM -0700, >> Clark S. Cox III wrote >> a message of 73 lines which said: >> >> > You really wouldn?t want ?Schl?ssel? and ?Schl?ssel? being different >> > passwords, would you? (assuming that my mail client and/or OS is not >> > interfering, the first is NFC, while the second is NFD) >> >> Hence the RFC 7613, mentioned already here by Marc Blanchet, that you >> must really read if you're interesed in Unicode passwords. >> >> In that case, the RFC is clear: NFC mandatory (and UTF-8 encoding). >> >> 4. Normalization Rule: Unicode Normalization Form C (NFC) MUST be >> applied to all characters. >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From duerst at it.aoyama.ac.jp Mon Oct 5 22:37:58 2015 From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J._D=c3=bcrst?=) Date: Tue, 6 Oct 2015 12:37:58 +0900 Subject: Unicode in passwords In-Reply-To: References: <4057F09C-5426-40ED-AB3D-356BBBC85276@telia.com> <2E86845B-6A9F-4D5C-A02A-C2C82F4DB28F@gmail.com> <000601d0fbff$42881070$c7983150$@gmail.com> Message-ID: <56134216.3040401@it.aoyama.ac.jp> Some additional concerns: - Input methods for Chinese, Japanese,... need visual feedback to check that the correct Han character was selected. That may show (some parts of) the password to bystanders. - Length limitations of 8 bytes are few and far between these days, but they still exist. Even where they are gone, they may have been replaced with "safe" limitations, say e.g. 50 bytes. That may still be pretty restrictive for some languages when using UTF-8. - There may occasionally be different length limitations for different kinds of access with the same password. That can create very difficult situations where the length limitation cuts off part of a UTF-8 byte sequence. - Some interfaces try to estimate the 'quality' of a password on password creation. Short passwords, or passwords with only lower-case Latin may be rejected, others labeled as 'medium safe', and so on. A password with lots of bytes may be labeled as 'excellent' even though it consists of characters all taken from the same small script, and thus has rather low entropy. Of course, there's the effect that at least for a while, the bad guys may think it's too bothersome to try non-ASCII passwords, so that may temporarily make them somewhat safer. Regards, Martin. On 2015/10/01 14:01, Mark Davis ?? wrote: > I've heard some concerns, mostly around the UI for people typing in > passwords; that they get frustrated when they have to type their password > on different devices: > > 1. A device may not have keyboard mappings with all the keys for their > language. > 2. The keyboard mappings across devices vary where they put keys, > especially for minority script characters using some pattern of > shift/alt/option/etc.. So the pattern of keys that they use on one may be > different than on another. > 3. People are often 'blind' to the characters being entered: they just > see a dot, for example. If the keyboards for their language are not > standard, then that makes it difficult. > 4. Even if they see, for an instant, the character they type, if the > device doesn't have a font for their language's characters, it may be just > a box. > 5. Even if those are not true, the glyph may not be distinctive enough > if the size is too small. > > > > Mark > > *? Il meglio ? l?inimico del bene ?* > > On Thu, Oct 1, 2015 at 6:11 AM, Jonathan Rosenne > wrote: > >> For languages such as Java, passwords should be handled as byte arrays >> rather than strings. This may make it difficult to apply normalization. >> >> >> >> Jonathan Rosenne >> >> >> >> *From:* Unicode [mailto:unicode-bounces at unicode.org] *On Behalf Of *Clark >> S. Cox III >> *Sent:* Thursday, October 01, 2015 2:16 AM >> *To:* Hans ?berg >> *Cc:* unicode at unicode.org; John O'Conner >> *Subject:* Re: Unicode in passwords >> >> >> >> >> >> On 2015/09/30, at 13:29, Hans ?berg wrote: >> >> >> >> >> >> On 30 Sep 2015, at 18:33, John O'Conner wrote: >> >> Can you recommend any documents to help me understand potential issues (if >> any) for password policies and validation methods that allow characters >> from more "exotic" portions of the Unicode space? >> >> >> On UNIX computers, one computes a hash (like SHA-256), which is then used >> to authenticate the password up to a high probability. The hash is stored >> in the open, but it is not known how to compute the password from the hash, >> so knowing the hash does not easily allow authentication. >> >> So if the password is >> >> >> >> ? normalized and then ? >> >> >> >> encoded in say UTF-8 and then hashed, it would seem to take care of most >> problems. >> >> >> >> You really wouldn?t want ?Schl?ssel? and ?Schl?ssel? being different >> passwords, would you? (assuming that my mail client and/or OS is not >> interfering, the first is NFC, while the second is NFD) >> > From duerst at it.aoyama.ac.jp Mon Oct 5 22:39:36 2015 From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J._D=c3=bcrst?=) Date: Tue, 6 Oct 2015 12:39:36 +0900 Subject: Unicode in passwords In-Reply-To: <000601d0fbff$42881070$c7983150$@gmail.com> References: <4057F09C-5426-40ED-AB3D-356BBBC85276@telia.com> <2E86845B-6A9F-4D5C-A02A-C2C82F4DB28F@gmail.com> <000601d0fbff$42881070$c7983150$@gmail.com> Message-ID: <56134278.6010508@it.aoyama.ac.jp> On 2015/10/01 13:11, Jonathan Rosenne wrote: > For languages such as Java, passwords should be handled as byte arrays rather than strings. This may make it difficult to apply normalization. Well, they should be received from the user interface as strings, then normalized, then converted to byte arrays using a well-defined single encoding. Somewhat tedious, but hopefully not difficult. Regards, Martin. From yoriyuki.yamagata at aist.go.jp Mon Oct 5 22:57:51 2015 From: yoriyuki.yamagata at aist.go.jp (Yoriyuki Yamagata) Date: Tue, 6 Oct 2015 12:57:51 +0900 Subject: Unicode in passwords In-Reply-To: References: Message-ID: <57223A23-037F-4925-B2B1-EA1F4930E3ED@aist.go.jp> Dear John, FYI, IETF is working on this issue. See Internet Draft https://tools.ietf.org/html/draft-ietf-precis-saslprepbis-17 based on PRECIS framework RFC 7564 https://tools.ietf.org/html/rfc7564 Best, > 2015/10/01 1:33?John O'Conner ????? > > I'm researching potential problems and best practices for password policies that allow non-Latin-1 Unicode characters. My searching of the unicode.org site showed me a general security considerations document (UTR #36) but nothing specific for password policies using Unicode. > > Can you recommend any documents to help me understand potential issues (if any) for password policies and validation methods that allow characters from more "exotic" portions of the Unicode space? > > Best regards, > John O'Conner > ? Yoriyuki Yamagata National Institute of Advanced Science and Technology (AIST), Senior Researcher http://staff.aist.go.jp/yoriyuki.yamagata/en/ From bortzmeyer at nic.fr Tue Oct 6 03:48:14 2015 From: bortzmeyer at nic.fr (Stephane Bortzmeyer) Date: Tue, 6 Oct 2015 10:48:14 +0200 Subject: Unicode in passwords In-Reply-To: <57223A23-037F-4925-B2B1-EA1F4930E3ED@aist.go.jp> References: <57223A23-037F-4925-B2B1-EA1F4930E3ED@aist.go.jp> Message-ID: <20151006084814.GA17135@laperouse.bortzmeyer.org> On Tue, Oct 06, 2015 at 12:57:51PM +0900, Yoriyuki Yamagata wrote a message of 33 lines which said: > FYI, IETF is working on this issue. See Internet Draft > https://tools.ietf.org/html/draft-ietf-precis-saslprepbis-17 based > on PRECIS framework RFC 7564 https://tools.ietf.org/html/rfc7564 As alreday mentioned on that list, the draft is no longer a draft, it was published as a RFC, RFC 7613, two months ago From mark at macchiato.com Tue Oct 6 04:21:42 2015 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Tue, 6 Oct 2015 11:21:42 +0200 Subject: Unicode in passwords In-Reply-To: <20151006084814.GA17135@laperouse.bortzmeyer.org> References: <57223A23-037F-4925-B2B1-EA1F4930E3ED@aist.go.jp> <20151006084814.GA17135@laperouse.bortzmeyer.org> Message-ID: While I think that RFC is useful, it has been interesting just how many of the problems recounted on this list go far beyond it, often having to do with UI issues. It would be useful to have a paper somewhere that organizes all of the problems presented here, and maybe makes a stab at describing techniques for handling them. Mark *? Il meglio ? l?inimico del bene ?* On Tue, Oct 6, 2015 at 10:48 AM, Stephane Bortzmeyer wrote: > On Tue, Oct 06, 2015 at 12:57:51PM +0900, > Yoriyuki Yamagata wrote > a message of 33 lines which said: > > > FYI, IETF is working on this issue. See Internet Draft > > https://tools.ietf.org/html/draft-ietf-precis-saslprepbis-17 based > > on PRECIS framework RFC 7564 https://tools.ietf.org/html/rfc7564 > > As alreday mentioned on that list, the draft is no longer a draft, it > was published as a RFC, RFC 7613, two months ago > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jcb+unicode at inf.ed.ac.uk Tue Oct 6 05:25:40 2015 From: jcb+unicode at inf.ed.ac.uk (Julian Bradfield) Date: Tue, 6 Oct 2015 11:25:40 +0100 Subject: Unicode in passwords References: <4057F09C-5426-40ED-AB3D-356BBBC85276@telia.com> <2E86845B-6A9F-4D5C-A02A-C2C82F4DB28F@gmail.com> <20151005151200.GA7379@laperouse.bortzmeyer.org> Message-ID: On 2015-10-06, Philippe Verdy wrote: > Finally note that passwords are not necessarily single identifiers > (whitespaces and word separators are accepted, but whitespaces should > require special handling with trimming (at both ends) and compression of > multiple occurences. Why would you trim or compress whitespace? Using multiple spaces seems a perfectly legitimate way of making a password harder to guess. -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. From lists+unicode at seantek.com Tue Oct 6 06:14:15 2015 From: lists+unicode at seantek.com (Sean Leonard) Date: Tue, 6 Oct 2015 04:14:15 -0700 Subject: Why Nothing Ever Goes Away In-Reply-To: <5612D05D.7000407@att.net> References: <20151005082457.665a7a7059d7ee80bb4d670165c8327d.c32c0ba233.wbe@email03.secureserver.net> <5612D05D.7000407@att.net> Message-ID: <5613AD07.9060401@seantek.com> On 10/5/2015 12:32 PM, Ken Whistler wrote: > > On 10/5/2015 8:24 AM, Doug Ewell wrote: >> I too am puzzled as to what DIS 10646 and C1 control pictures have to do >> with each other. >> > > What an *excellent* cue to start a riff on arcane Unicode history! > > First, let me explain what I think Sean Leonard's concern here is. > > 1. On 10/4/2015 5:30 AM, Sean wrote: "I proposed adding C1 Control > Pictures to > Unicode. ... The requirement is that all glyphs for U+0000 - U+00FF be > graphically distinct." > [...] > 2. The Unicode code charts are (deliberately) vague about U+0080, U+0081, > and U+0099. > [...] > 3. Concerned about this gap, Sean did some due diligence research on the > web[...] Hence the need to track down a copy of DIS 10646 (meaning in > actuality, the appropriately numbered WG2 N666, "DIS 10646", dated > November 4, 1990). ??????????!!! ??????????!???????????????? ???????????? -Sean From lists+unicode at seantek.com Tue Oct 6 07:24:06 2015 From: lists+unicode at seantek.com (Sean Leonard) Date: Tue, 6 Oct 2015 05:24:06 -0700 Subject: Why Nothing Ever Goes Away In-Reply-To: References: <20151005082457.665a7a7059d7ee80bb4d670165c8327d.c32c0ba233.wbe@email03.secureserver.net> <5612D05D.7000407@att.net> Message-ID: <5613BD66.3080707@seantek.com> > 2. The Unicode code charts are (deliberately) vague about U+0080, U+0081, > and U+0099. All other C1 control codes have aliases to the ISO 6429 > set of control functions, but in ISO 6429, those three control codes > don't > have any assigned functions (or names). On 10/5/2015 3:57 PM, Philippe Verdy wrote: > Also the aliases for C1 controls were formally registered in 1983 only > for the two ranges U+0084..U+0097 and U+009B..U+009F for ISO 6429. If I may, I would appreciate another history lesson: In ISO 2022 / 6429 land, it is apparent that the C1 controls are mainly aliases for ESC 4/0 - 5/15. ( @ through _ ) This might vary depending on what is loaded into the C1 register, but overall, it just seems like saving one byte. Why was C1 invented in the first place? And, why did Unicode deem it necessary to replicate the C1 block at 0x80-0x9F, when all of the control characters (codes) were equally reachable via ESC 4/0 - 5/15? I understand why it is desirable to align U+0000 - U+007F with ASCII, and maybe even U+0000 - U+00FF with Latin-1 (ISO-8859-1). But maybe Windows-1252, MacRoman, and all the other non-ISO-standardized 8-bit encodings got this much right: duplicating control codes is basically a waste of very precious character code real estate. Sean PS I was not able to turn up ISO 6429:1983, but I did find ECMA-48, 4th Ed., December 1986, which has the following text: *** 5.4 Elements of the C1 Set These control functions are represented: - In a 7-bit code by 2-character escape sequences of the form ESC Fe, where ESC is represented by bit combination 01/11 and Fe is represented by a bit combination from 04/00 to 05/15. - In an 8-bit code by bit combinations from 08/00 to 09/15. *** This text is seemingly repeated in many analogous standards ca. ~1974 - ~1992. PPS I happen to have a copy of ANSI X3.41-1974 "American National Standard Code Extension Techniques for Use with the 7-Bit Coded Character Set of [ASCII]". The invention/existence of C1 goes back to this time, as does the use of ESC Fe to invoke C1 characters in a 7-bit code, and 0x80-0x9F to invoke C1 characters in an 8-bit code. (See, in particular, Clauses 5.3.3.1 and 5.3.6). In particular, Clause 7.3.1.2 says: "The use of ESC Fe sequence in an 8-bit environment is contrary to the intention of this standard but, should they occur, their meaning is the same as in the 7-bit environment." I can appreciate why it was desirable to "fold" C1 in an 8-bit environment into a 7-bit environment with ESC Fe. (If, in fact, that was the direction of standardization: invent a new thing and then devise a coding to express the new thing in the old thing.) It is less obvious why Unicode adopted C1, however, when the trend was to jettison the 94-character Tetris block assignments in favor of a wide-open field for character assignment. Except for the trend in Unicode to "avoid assigning characters when explicitly asked, unless someone implements them without asking, and the implementation catches on, and then just assign the whole lot of them, even when they overlap with existing assignments, and then invent composite characters, which further compound the possible overlapping combinations". ?? From verdy_p at wanadoo.fr Tue Oct 6 08:04:44 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Tue, 6 Oct 2015 15:04:44 +0200 Subject: Unicode in passwords In-Reply-To: <56134278.6010508@it.aoyama.ac.jp> References: <4057F09C-5426-40ED-AB3D-356BBBC85276@telia.com> <2E86845B-6A9F-4D5C-A02A-C2C82F4DB28F@gmail.com> <000601d0fbff$42881070$c7983150$@gmail.com> <56134278.6010508@it.aoyama.ac.jp> Message-ID: Note that Java strings DO allow the presence of lone surrogates, as well as non-characters , because Java strings are unrestricted vectors of 16-bit code units (non-BMP characters are handled as pairs of surrogates). In those conditions, normalizing the Java string will leave those lone surrogates (and non-characters) as is, or will throw an exception, depending on the API used. Java strings do not have any implied encoding (their "char" members are also unrestricted 16-bit code units, they have some basic properties but only in BMP, defined in the builtin Character class API: properties for non-BMP characters require using a library to provide them, such as ICU4J). This is essentially the same kind as C/C++ "wide" strings using 16-bit wchar_t, except that: - C/C++ wide strings do not allow the inclusion of U+0000 which is a terminator, unless you use a string class keeping the actual string length (and not just the allocated buffer length which may be larger). - Java strings, including litterals, are immutable, and optionally atomized into a global dictionary, which includes all string litterals to share the storage space of multiple instances with equal contents, including across distinct classes from distinct packages. - This also true for string literals (which are all immutable and atomized, and initialized from the compiled bytecode of classes using a modified version of UTF-8 that preserves all 16-bit code units (including lone surrogates and non-characters like U+FFFF), but also store U+0000 as <0xC0,0x80>. This modified UTF-8 encoding is also what you get if you use the JNI interface version with 8-bit string (this internally requires a conversion by JNI, using a temporary buffer); if you use the JNI interface version with 16-bit strings, you work directly with the internal 16-bit java strings and there's no conversion: you'll also get the lone surrogates and all non-characters and you are not restricted to only valid UTF-16. - Java strings are commonly used for fast initialization of large immutable binary arrays because the conversion from Modified-UTF-8 to 16-bit strings does not require running any comp?led bytecode (this is not true for other static arrays which requires large code for array litterals and not warrantied to be immutable: the alternative to this large compiled code is to initialize those large static arrays by I*/O *from an external stream, such as a file beside the class in the same package, and possibly packed in the same JAR). Java passwords are "strings" but then still allow them to include arbitrary 16-bit code units, even if they violate UTF-16 restrictions. You will not get much difference is you use byte arrays, the only change being the difference of size of code units. Between those two representation you are free to convert them with ANY encodings pair, and not just assuming UTF-8<>UTF-16. However, for security reasons, it's best to avoid string litterals for passwords, because they can be enumerated from the global dictionnary of atomized strings, or directly by reading the byte code of the compiled class where they are sored in modified-UTF-8 but loaded and used as arbitrary 16-bit strings (but the same is true if you use a byte array literal ! you can just parse the initilization byte code to get the list of bytes). If passwords or authorization keys are stored somewhere (as strings or as byte arrays) they should be encrypted into a safe storage and not in static string litterals or byte array initializers (they will BOTH be clear text in the bytecode of the compiled class). In both cases, there is NO normalization applied implicitly or checked/enforced by the API (the only check that occurs is at class loading time for the Modified-UTF-8 encoding for string literals: if it is wrong the class will not load at all, you'll get an invalid class exception; there's no such ckeck at all for the encoding of byte array initializers, the only checks are the validity of the java initializer byte code and bounds of array indexes used by the initiliazer code). 2015-10-06 5:39 GMT+02:00 Martin J. D?rst : > On 2015/10/01 13:11, Jonathan Rosenne wrote: > >> For languages such as Java, passwords should be handled as byte arrays >> rather than strings. This may make it difficult to apply normalization. >> > > Well, they should be received from the user interface as strings, then > normalized, then converted to byte arrays using a well-defined single > encoding. Somewhat tedious, but hopefully not difficult. > > Regards, Martin. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Tue Oct 6 08:13:25 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Tue, 6 Oct 2015 15:13:25 +0200 Subject: Unicode in passwords In-Reply-To: References: <4057F09C-5426-40ED-AB3D-356BBBC85276@telia.com> <2E86845B-6A9F-4D5C-A02A-C2C82F4DB28F@gmail.com> <20151005151200.GA7379@laperouse.bortzmeyer.org> Message-ID: I don't think it is a good idea for tectual passwords to make differences based on the number of spaces. Being plain text they are likely to be displayed in utser interfaces in a way that the user will not see. Without trimming, users won't see the initial or final space, and the password input method may not display them as well (e.g. in an HTML input form or when using a button to generate passphrases that users must then copy-paste to their password manager or to some private text document). Some password storages also will implicitly trim and compress those strings (e.g. in a fixed-width column of a table in a database). There's also frequently no visual hint when entering or displaying those spaces and compression occurs implicitly, or pass phrases may be line wrapped in the middle where you won't see the number of spaces. 2015-10-06 12:25 GMT+02:00 Julian Bradfield : > On 2015-10-06, Philippe Verdy wrote: > > Finally note that passwords are not necessarily single identifiers > > (whitespaces and word separators are accepted, but whitespaces should > > require special handling with trimming (at both ends) and compression of > > multiple occurences. > > Why would you trim or compress whitespace? Using multiple spaces seems a > perfectly legitimate way of making a password harder to guess. > > -- > The University of Edinburgh is a charitable body, registered in > Scotland, with registration number SC005336. > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Tue Oct 6 08:27:36 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Tue, 6 Oct 2015 15:27:36 +0200 Subject: Unicode in passwords In-Reply-To: <20151006084814.GA17135@laperouse.bortzmeyer.org> References: <57223A23-037F-4925-B2B1-EA1F4930E3ED@aist.go.jp> <20151006084814.GA17135@laperouse.bortzmeyer.org> Message-ID: And there are severe issues in this RFC for its case mapping profile: it requires converting "uppercase" characters to "lowercase", but these properties are not stable (see for example the history of Cherokee letters, changed from gc=Lo to gc=Lu when lowercase letters were added and with case pairs added at the same time, see also the addition of the capital sharp S for German). That RFC should used used the Unicode "Case Folding" algorithm which is stable (case folded strings are NOT necessarily all lowercase, they are just warrantied to keep a single case variant, and case folding implies the use of compatibility normalization forms, i.e. NFKC or NFKD, to get the correct closure: the standard Unicode normalizations are also stable) ! 2015-10-06 10:48 GMT+02:00 Stephane Bortzmeyer : > On Tue, Oct 06, 2015 at 12:57:51PM +0900, > Yoriyuki Yamagata wrote > a message of 33 lines which said: > > > FYI, IETF is working on this issue. See Internet Draft > > https://tools.ietf.org/html/draft-ietf-precis-saslprepbis-17 based > > on PRECIS framework RFC 7564 https://tools.ietf.org/html/rfc7564 > > As alreday mentioned on that list, the draft is no longer a draft, it > was published as a RFC, RFC 7613, two months ago > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Tue Oct 6 08:57:37 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Tue, 6 Oct 2015 15:57:37 +0200 Subject: Why Nothing Ever Goes Away In-Reply-To: <5613BD66.3080707@seantek.com> References: <20151005082457.665a7a7059d7ee80bb4d670165c8327d.c32c0ba233.wbe@email03.secureserver.net> <5612D05D.7000407@att.net> <5613BD66.3080707@seantek.com> Message-ID: 2015-10-06 14:24 GMT+02:00 Sean Leonard : > 2. The Unicode code charts are (deliberately) vague about U+0080, U+0081, >> and U+0099. All other C1 control codes have aliases to the ISO 6429 >> set of control functions, but in ISO 6429, those three control codes don't >> have any assigned functions (or names). >> > > On 10/5/2015 3:57 PM, Philippe Verdy wrote: > >> Also the aliases for C1 controls were formally registered in 1983 only >> for the two ranges U+0084..U+0097 and U+009B..U+009F for ISO 6429. >> > > If I may, I would appreciate another history lesson: > In ISO 2022 / 6429 land, it is apparent that the C1 controls are mainly > aliases for ESC 4/0 - 5/15. ( @ through _ ) This might vary depending on > what is loaded into the C1 register, but overall, it just seems like saving > one byte. > > Why was C1 invented in the first place? > Look for the history of EBCDIC and its adaptation/conversion with ASCII compatible encodings: round trip conversion wasneeded (using a only a simple reordering of byte values, with no duplicates). EBCDIC has used many controls that were not part of C0 and were kept in the C1 set. Ignore the 7-bit compatiblity encoding using pairs, they were only needed for ISO 2022, but ISO 6429 defines a profile where those longer sequences are not needed and even forbidden in 8-bit contexts or in contexts where aliases are undesirable and invalidated, such as security environments. With your thoughts, I would conclude that assigning characters in the G1 set was also a duplicate, because it is reachable with a C0 "shifting" control + a position of the G0 set. In that case ISO 8859-1 or Windows 1252 was also an unneeded duplication ! And we would live today in a 7-bit only world. C1 controls have their own identity. The 7-bit encoding using ESC is just a hack to make them fit in 7-bit and it only works where the ESC control is assumed to play this function according to ISO 2022, ISO 6429, or other similar old 7-bit protocols such as Videotext (which was widely used in France with the free "Minitel" terminal, long before the introduction of the Internet to the general public around 1992-1995). Today Videotext is definitely dead (the old call numbers for this slow service are now definitely defunct, the Minitels are recycled wastes, they stopped being distributed and replaced by applications on PC connected to the Internet, but now all the old services are directly on the internet and none of them use 7-bit encodings for their HTML pages, or their mobile applications). France has also definitely abandoned its old French version of ISO 646, there are no longer any printer supporting versions of ISO 646 other than ASCII, but they still support various 8-bit encodings. 7-bit encodings are things of the past (they were only justified at times where communication links were slow and generated lots of transmission errors, and the only implemented mecanism to check them was to use a single parity bit per character. Today we transmit long datagrams and prefer using checks codes for the whole (such as CRC, or autocorrecting codes). 8-bit encodings are much easier and faster to process for transmitting not just text but also binary data. Let's forget the 7-bit world definitely. We have also abandonned the old UTF-7 in Unicode ! I've not seen it used anywhere except in a few old emails sent at end of the 90's, because many mail servers were still not 8-bit clean and silently transformed non-ASCII bytes in unpredictable ways or using unspecified encodings, or just siltently dropped the high bit, assuming it was just a parity bit : at that time, emails were not sent with SMTP, but with the old UUCP protocol and could take weeks to be delivered to the final recipient, as there was still no global routing infrastructure and many hops were necessary via non-permanent modem links. My opinion of UTF-7 is that it was just a temporary and experimental solution to help system admins and developers adopt the new UCS, including for their old 7-bit environments. -------------- next part -------------- An HTML attachment was scrubbed... URL: From jcb+unicode at inf.ed.ac.uk Tue Oct 6 09:31:22 2015 From: jcb+unicode at inf.ed.ac.uk (Julian Bradfield) Date: Tue, 6 Oct 2015 15:31:22 +0100 (BST) Subject: Unicode in passwords References: <4057F09C-5426-40ED-AB3D-356BBBC85276@telia.com> <2E86845B-6A9F-4D5C-A02A-C2C82F4DB28F@gmail.com> <20151005151200.GA7379@laperouse.bortzmeyer.org> Message-ID: On 2015-10-06, Philippe Verdy wrote: > I don't think it is a good idea for tectual passwords to make differences > based on the number of spaces. Being plain text they are likely to be > displayed in utser interfaces in a way that the user will not see. Without This is true of all passwords. Passwords have to be typed by finger memory, not by looking at them (unless you're the type who puts them on sticky notes, in which case you type by looking at the text on the note). One doesn't normally see the characters, at best a count of characters. > trimming, users won't see the initial or final space, and the password > input method may not display them as well (e.g. in an HTML input form or All browsers I use display spaces in input boxes, and put blobs for hidden fields. Do you have evidence for broken input fields? > when using a button to generate passphrases that users must then copy-paste > to their password manager or to some private text document). Copy-paste works on all my systems, too - do you have evidence of broken copy-paste in this way? > Some password > storages also will implicitly trim and compress those strings (e.g. in a If it compresses it on setting, but doesn't compress it on testing, or vice versa, then that's a bug. If it does the same for setting and testing, it doesn't matter (except to compromise the crack-resistance of the password). > fixed-width column of a table in a database). There's also frequently no > visual hint when entering or displaying those spaces and compression occurs Evidence? Maybe if you're typing a password into a Word document it's hard to count spaces, but why would you be doing that? -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. From asmus-inc at ix.netcom.com Tue Oct 6 10:33:51 2015 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Tue, 6 Oct 2015 08:33:51 -0700 Subject: Why Nothing Ever Goes Away In-Reply-To: <5613BD66.3080707@seantek.com> References: <20151005082457.665a7a7059d7ee80bb4d670165c8327d.c32c0ba233.wbe@email03.secureserver.net> <5612D05D.7000407@att.net> <5613BD66.3080707@seantek.com> Message-ID: <5613E9DF.8020406@ix.netcom.com> An HTML attachment was scrubbed... URL: From doug at ewellic.org Tue Oct 6 11:02:57 2015 From: doug at ewellic.org (Doug Ewell) Date: Tue, 06 Oct 2015 09:02:57 -0700 Subject: Why Nothing Ever Goes Away Message-ID: <20151006090257.665a7a7059d7ee80bb4d670165c8327d.f7c4b8601c.wbe@email03.secureserver.net> Asmus Freytag (t) wrote: > Nobody wanted to follow the IBM code page 437 (then still the most > widely used single byte vendor standard). Although to this day, the UN/LOCODE manual [1] still refers to 437 as "the standard United States character set" and claims that it "conforms to these ISO standards" (8859-1:1987 and 10646-1:1993). [1] http://www.unece.org/fileadmin/DAM/cefact/locode/2015-1_UNLOCODE_SecretariatNotes.pdf > Also, the overloading of 0x80-0xFF by Windows did not happen all at > once, earlier versions had left much of that space open, And it's still not completely filled, in any of the 125x code pages except for the quirky 1256. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From asmus-inc at ix.netcom.com Tue Oct 6 12:14:06 2015 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Tue, 6 Oct 2015 10:14:06 -0700 Subject: Unicode in passwords In-Reply-To: References: <4057F09C-5426-40ED-AB3D-356BBBC85276@telia.com> <2E86845B-6A9F-4D5C-A02A-C2C82F4DB28F@gmail.com> <20151005151200.GA7379@laperouse.bortzmeyer.org> Message-ID: <5614015E.3010302@ix.netcom.com> An HTML attachment was scrubbed... URL: From unicode at lindenbergsoftware.com Tue Oct 6 12:39:08 2015 From: unicode at lindenbergsoftware.com (Norbert Lindenberg) Date: Tue, 6 Oct 2015 10:39:08 -0700 Subject: Unicode in passwords In-Reply-To: References: <4057F09C-5426-40ED-AB3D-356BBBC85276@telia.com> <2E86845B-6A9F-4D5C-A02A-C2C82F4DB28F@gmail.com> <000601d0fbff$42881070$c7983150$@gmail.com> <56134278.6010508@it.aoyama.ac.jp> Message-ID: > On Oct 6, 2015, at 6:04 , Philippe Verdy wrote: > > In those conditions, normalizing the Java string will leave those lone surrogates (and non-characters) as is, or will throw an exception, depending on the API used. Java strings do not have any implied encoding (their "char" members are also unrestricted 16-bit code units, they have some basic properties but only in BMP, defined in the builtin Character class API: properties for non-BMP characters require using a library to provide them, such as ICU4J). The Java Character class was enhanced in J2SE 5.0 to support supplementary characters. The String class was specified to be based on UTF-16, and string processing throughout the platform was updated to support supplementary characters based on UTF-16. These changes have been available to the public since 2004. For a summary, see http://www.oracle.com/technetwork/articles/java/supplementary-142654.html Norbert From jcb+unicode at inf.ed.ac.uk Tue Oct 6 14:13:12 2015 From: jcb+unicode at inf.ed.ac.uk (Julian Bradfield) Date: Tue, 6 Oct 2015 20:13:12 +0100 (BST) Subject: Unicode in passwords References: <4057F09C-5426-40ED-AB3D-356BBBC85276@telia.com> <2E86845B-6A9F-4D5C-A02A-C2C82F4DB28F@gmail.com> <20151005151200.GA7379@laperouse.bortzmeyer.org> <5614015E.3010302@ix.netcom.com> Message-ID: On 2015-10-06, Asmus Freytag (t) wrote: > All browsers I use display spaces in input boxes, and put blobs for > hidden fields. Do you have evidence for broken input fields? > >
> Network keys. That interface seems to consistently give people a > choice to reveal the key.
? That's not broken in the way Philippe was discussing. > Copy-paste works on all my systems, too - do you have evidence of > broken copy-paste in this way? > >
> I've seen input fields where sites don't allow paste on the second > copy (the confirmation copy).
>
> Even for non-password things.
That's not relevantly broken, either - it's a design feature, to make sure you can type the password again (from finger memory!). -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. From richard.wordingham at ntlworld.com Tue Oct 6 14:19:27 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Tue, 6 Oct 2015 20:19:27 +0100 Subject: Unicode in passwords In-Reply-To: References: <57223A23-037F-4925-B2B1-EA1F4930E3ED@aist.go.jp> <20151006084814.GA17135@laperouse.bortzmeyer.org> Message-ID: <20151006201927.603269d9@JRWUBU2> On Tue, 6 Oct 2015 11:21:42 +0200 Mark Davis ?? wrote: > While I think that RFC is useful, it has been interesting just how > many of the problems recounted on this list go far beyond it, often > having to do with UI issues. It would be useful to have a paper > somewhere that organizes all of the problems presented here, and > maybe makes a stab at describing techniques for handling them. Indeed, there are several different scenarios. The most prototypical are: 1) Initial access to a stand-alone computing device, the conventional logging on. In this case, it is usually risky to use anything but printable ASCII. 2) Internet passwords for use in privacy. Basically any non-trivial combination of characters should be acceptable, provided it will not be mangled in transmission. Under the rules of Unicode, this means that the text should be normalised before becoming a mere sequence of bytes. Note that in the second scenario, there is normally an 'administrator' who can put things right. Richard. From richard.wordingham at ntlworld.com Tue Oct 6 14:57:34 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Tue, 6 Oct 2015 20:57:34 +0100 Subject: Unicode in passwords In-Reply-To: References: <4057F09C-5426-40ED-AB3D-356BBBC85276@telia.com> <2E86845B-6A9F-4D5C-A02A-C2C82F4DB28F@gmail.com> <20151005151200.GA7379@laperouse.bortzmeyer.org> <5614015E.3010302@ix.netcom.com> Message-ID: <20151006205734.038a869f@JRWUBU2> On Tue, 6 Oct 2015 20:13:12 +0100 (BST) Julian Bradfield wrote: > On 2015-10-06, Asmus Freytag (t) wrote: > > All browsers I use display spaces in input boxes, and put blobs for > > hidden fields. Do you have evidence for broken input fields? > > > >
> > Network keys. That interface seems to consistently give people a > > choice to reveal the key.
> > ? That's not broken in the way Philippe was discussing. No, but if you make the password up as you type it, you might not then notice that one accidentally typed a double space. > > Copy-paste works on all my systems, too - do you have evidence of > > broken copy-paste in this way? > > > >
> > I've seen input fields where sites don't allow paste on the > > second copy (the confirmation copy).
> >
> > Even for non-password things.
> > That's not relevantly broken, either - it's a design feature, to make > sure you can type the password again (from finger memory!). It's an interesting issue for a password that one can't type. It's by no means a guarantee, either. I once specified a new a password that changed case in the middle not realising that I had started with caps lock on. Consequently, both copies has the wrong capitalisation. I was using a wireless keyboard, which to conserve battery power doesn't have a caps lock indicator. (In the old days, caps lock would have physically locked, but that's not how keyboard drivers work nowadays.) It took a little while before it occurred to me that I might have had a problem with caps lock. Richard. From richard.wordingham at ntlworld.com Tue Oct 6 15:14:59 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Tue, 6 Oct 2015 21:14:59 +0100 Subject: Why Nothing Ever Goes Away In-Reply-To: References: <20151005082457.665a7a7059d7ee80bb4d670165c8327d.c32c0ba233.wbe@email03.secureserver.net> <5612D05D.7000407@att.net> <5613BD66.3080707@seantek.com> Message-ID: <20151006211459.0c9a4399@JRWUBU2> On Tue, 6 Oct 2015 15:57:37 +0200 Philippe Verdy wrote: > My opinion of UTF-7 is that > it was just a temporary and experimental solution to help system > admins and developers adopt the new UCS, including for their old > 7-bit environments. If you have a human controlling the interpretation, UTF-7 was a good way of avoiding data being mangled by interfaces that insisted that unlabelled (indeed, sometimes, unlabellable) 8-bit text was UTF-8 or conversely, Latin-1 or code page 1252. The old Yahoo groups web interface for senders was pretty much restricted to 8-bit ISO-2022 encodings without it. C1 characters would be converted to Latin-1 on the assumption that they were Windows 1252. Browsers dropping UTF-7 support was a major inconvenience. Richard. From verdy_p at wanadoo.fr Tue Oct 6 15:43:55 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Tue, 6 Oct 2015 22:43:55 +0200 Subject: Unicode in passwords In-Reply-To: References: <4057F09C-5426-40ED-AB3D-356BBBC85276@telia.com> <2E86845B-6A9F-4D5C-A02A-C2C82F4DB28F@gmail.com> <20151005151200.GA7379@laperouse.bortzmeyer.org> Message-ID: 2015-10-06 16:31 GMT+02:00 Julian Bradfield : > On 2015-10-06, Philippe Verdy wrote: > > I don't think it is a good idea for tectual passwords to make differences > > based on the number of spaces. Being plain text they are likely to be > > displayed in utser interfaces in a way that the user will not see. > Without > > This is true of all passwords. Passwords have to be typed by finger > memory, not by looking at them (unless you're the type who puts them > on sticky notes, in which case you type by looking at the text on the > note). One doesn't normally see the characters, at best a count of > characters. > > > trimming, users won't see the initial or final space, and the password > > input method may not display them as well (e.g. in an HTML input form or > > All browsers I use display spaces in input boxes, and put blobs for > hidden fields. Do you have evidence for broken input fields? > I was speaking of OUTPUT fields : you want to display passwords that are stored somewhere (including in a text document stored in some safe place such as an external flash drive). People can't remember many passwords. Hiding them on screen is a fake security, what we need is complex passwords (difficult to memoize so we need a wallet to store them but people will also **printing** them and not store them in a electronic format), and many passwords (one for each site or application requiring one). But they also want to be able to type them correctly: long passwords hidden on screen will not help much (Hidden passwords in input forms is just to avoid some spying eyes on your screen, but people can still pay on your keystrokes...) If people are concerned by eyes, they'll need to hide their keyboard input (notably on touch screens!) but also their screen by first making sure there's nobody around to look at what you do. If there's a camera, hiding the password on screen will also no help, it will also be easy to see your keystrokes. Biometric identification is also another fake security (because it is immutable, when passwords can be and should be changed regularly) and it is extremely easy to duplicate a biometric data record (to be more effective, the physical captor device should be internally secured and its internal data instantly flushed in case of intrusion, and this device should be securely authenticated in addition to performing the biometric check, but the biometric data should not be transmitted, instead it should be used to compute a secure hash from the hidden biometric data and negociated and checked unique randomized data from the source requesting the access, it should use public key encryption with a couple of public/private key pairs, not symetric keys, or triple key pairs if using another independant third party: the private keys will never be exchanged or duplicated). But some time you'll need to reset those keys and the only tool you'll have will be to use cleartext pass phrases, even if there's a physical device identification, encryption with key pairs and the extremely private biometric data. Unfortunately biometric data is now shared with governmental third parties, and even exchanged internationally (they are present on passports and biometric passports are now mandatory for any one taking a plane to/from/via the United States and now in many European countries as well; DNA tracks are also very easyto capture. Biometric data is no longer a private property, they cannot be used as secrets for access authentication or signatures). There's still nothing to replace pass phrases and those need to be user friendly for their legitimate owners. -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Tue Oct 6 15:53:00 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Tue, 6 Oct 2015 22:53:00 +0200 Subject: Unicode in passwords In-Reply-To: <20151006205734.038a869f@JRWUBU2> References: <4057F09C-5426-40ED-AB3D-356BBBC85276@telia.com> <2E86845B-6A9F-4D5C-A02A-C2C82F4DB28F@gmail.com> <20151005151200.GA7379@laperouse.bortzmeyer.org> <5614015E.3010302@ix.netcom.com> <20151006205734.038a869f@JRWUBU2> Message-ID: 2015-10-06 21:57 GMT+02:00 Richard Wordingham < richard.wordingham at ntlworld.com>: > It's an interesting issue for a password that one can't type. It's by > no means a guarantee, either. I once specified a new a password that > changed case in the middle not realising that I had started with caps > lock on. Consequently, both copies has the wrong capitalisation. I > was using a wireless keyboard, which to conserve battery power doesn't > have a caps lock indicator. (In the old days, caps lock would have > physically locked, but that's not how keyboard drivers work nowadays.) > It took a little while before it occurred to me that I might have had a > problem with caps lock. > This is a demonstration that using case differences to add more combinations in short passwords is a bad design. As well hiding typed input is not a good idea: we need at least a pressable button to look/confirm what we are typing. Instead of lettercase combinations limited to ASCII, it is highly preferable to extend the character repertoire to Unicode and accept letters in NFKC form and unified by case folding (NOT conversion to lowercase or uppercase, as it is not stable across Unicode versions). So we should define here the usable set of characters (and define characters that should be ignored and discarded if present on input). This should be a profile in UAX #31 (and we should issue a strong warning against the recent RFC that forgot the issue : its case-insensitive profile based on NFC and conversion to lowercase is definitely broken !) -------------- next part -------------- An HTML attachment was scrubbed... URL: From naz at gassiep.com Tue Oct 6 22:28:54 2015 From: naz at gassiep.com (Naz Gassiep) Date: Wed, 7 Oct 2015 14:28:54 +1100 Subject: Proposals for Arabic honorifics In-Reply-To: <5612292B.8040208@gassiep.com> References: <5612292B.8040208@gassiep.com> Message-ID: <56149176.5060404@gassiep.com> If there are no comments on this specific issue, could someone care to comment on the idea of writing a proposal that extends and existing proposal? Is this considered bad form, or is it OK so long as it doesn't unnecessarily raise conflicting proposals? - Naz. On 5/10/2015 6:39 PM, Naz Gassiep wrote: > Hi all, > We are considering writing a proposal for Arabic honorifics which are > missing from Unicode. There are already a few in there, notably U+FDFA > and U+FDFB. > > There are two existing proposals, L2/14-147 and L2/14-152, which each > propose additions. L2/14-147 proposes seventeen new characters and > L2/14-152 proposes a further two. > > There are a few other characters that are not included in these > proposals, and I was considering preparing a proposal of my own. I > will work with a team of people who are willing to contribute time to > this work. We are considering two options: > > 1. Prepare an additional proposal for the characters that were missing > from the existing spec and also from the two proposals mentioned above. > 2. Prepare a collating proposal which rolls the two proposals as well > as the others that we feel are missing into a single proposal. > > Currently, we favour the second option. We would ensure that full > descriptions, names, character properties, and detailed examples are > provided for each character to substantiate its use in modern plain > text. We would also suggest code points in line with the existing > proposal L2/14-147. > > We don't want to step on the toes of the original submitters, Roozbeh > Pournader or Lateef Sagar Shaikh. We wish to be clear that we will > draw on their existing proposals to the maximum extent possible to > ensure that we do not submit a conflicting proposal, but a superset > proposal that incorporates their proposals as well as the additional > characters we have identified. We have evaluated these two, and a true > superset proposal is possible such that no conflicts between either > those two proposals or our own will materialize. > > Are there any issues that we may face in preparing and submitting our > proposal? > Any guidance from this mailing list would be highly valued. > Many thanks, > - Naz. From lisam at us.ibm.com Tue Oct 6 23:02:32 2015 From: lisam at us.ibm.com (Lisa Moore) Date: Tue, 6 Oct 2015 21:02:32 -0700 Subject: Proposals for Arabic honorifics In-Reply-To: <56149176.5060404@gassiep.com> References: <5612292B.8040208@gassiep.com> <56149176.5060404@gassiep.com> Message-ID: <201510070402.t9742crx026410@d01av03.pok.ibm.com> Hello Naz, Thank you for discussing your proposal on the unicode list. Not all experts monitor that list. That said, feel free to submit a proposal to "docsubmit at unicode.org". Look forward to seeing your proposal. Lisa From: Naz Gassiep To: unicode at unicode.org Date: 10/06/2015 08:50 PM Subject: Re: Proposals for Arabic honorifics Sent by: "Unicode" If there are no comments on this specific issue, could someone care to comment on the idea of writing a proposal that extends and existing proposal? Is this considered bad form, or is it OK so long as it doesn't unnecessarily raise conflicting proposals? - Naz. On 5/10/2015 6:39 PM, Naz Gassiep wrote: > Hi all, > We are considering writing a proposal for Arabic honorifics which are > missing from Unicode. There are already a few in there, notably U+FDFA > and U+FDFB. > > There are two existing proposals, L2/14-147 and L2/14-152, which each > propose additions. L2/14-147 proposes seventeen new characters and > L2/14-152 proposes a further two. > > There are a few other characters that are not included in these > proposals, and I was considering preparing a proposal of my own. I > will work with a team of people who are willing to contribute time to > this work. We are considering two options: > > 1. Prepare an additional proposal for the characters that were missing > from the existing spec and also from the two proposals mentioned above. > 2. Prepare a collating proposal which rolls the two proposals as well > as the others that we feel are missing into a single proposal. > > Currently, we favour the second option. We would ensure that full > descriptions, names, character properties, and detailed examples are > provided for each character to substantiate its use in modern plain > text. We would also suggest code points in line with the existing > proposal L2/14-147. > > We don't want to step on the toes of the original submitters, Roozbeh > Pournader or Lateef Sagar Shaikh. We wish to be clear that we will > draw on their existing proposals to the maximum extent possible to > ensure that we do not submit a conflicting proposal, but a superset > proposal that incorporates their proposals as well as the additional > characters we have identified. We have evaluated these two, and a true > superset proposal is possible such that no conflicts between either > those two proposals or our own will materialize. > > Are there any issues that we may face in preparing and submitting our > proposal? > Any guidance from this mailing list would be highly valued. > Many thanks, > - Naz. -------------- next part -------------- An HTML attachment was scrubbed... URL: From jcb+unicode at inf.ed.ac.uk Wed Oct 7 04:59:57 2015 From: jcb+unicode at inf.ed.ac.uk (Julian Bradfield) Date: Wed, 7 Oct 2015 10:59:57 +0100 Subject: Unicode in passwords References: <4057F09C-5426-40ED-AB3D-356BBBC85276@telia.com> <2E86845B-6A9F-4D5C-A02A-C2C82F4DB28F@gmail.com> <20151005151200.GA7379@laperouse.bortzmeyer.org> Message-ID: On 2015-10-06, Philippe Verdy wrote: > I was speaking of OUTPUT fields : you want to display passwords that are > stored somewhere (including in a text document stored in some safe place > such as an external flash drive). People can't remember many passwords. Again, output fields (such as in the Firefox password manager), in my experience, display the text that is in them, not a stripped and compressed version. If they don't, it's a bug. If you start using passwords including NBSP and EM-DASH, then it's going to get a bit awkward - but you should know you're doing that, and take measures accordingly. > Hiding them on screen is a fake security, what we need is complex passwords > (difficult to memoize so we need a wallet to store them but people will > also **printing** them and not store them in a electronic format), and many It's questionable whether there is ever a need to print a password, except in the case of an automatically generated hard-copy password reset. My digital will (if I'd produced one) would need about half a dozen passwords, mainly the master password for the password manager, plus some sensitive finance and system admin ones. That's few enough to write down by hand (or type by hand into a text file), with appropriate notes. > passwords (one for each site or application requiring one). But they also > want to be able to type them correctly: long passwords hidden on screen Most of our students seem (when I see them logging in to give presentations) to have long passwords - 20-30 characters - and they don't seem to have a problem. This also illustrates why defaulting to hidden passwords is useful. > Biometric identification is also another fake security (because it is Not sure what this has to do with Unicode in passwords. > immutable, when passwords can be and should be changed regularly) and it is Bruce Schneier is one of the best known and most respected security researchers around today, and here's his advice: So in general: you don't need to regularly change the password to your computer or online financial accounts (including the accounts at retail sites); definitely not for low-security accounts. You should change your corporate login password occasionally, and you need to take a good hard look at your friends, relatives, and paparazzi before deciding how often to change your Facebook password. But if you break up with someone you've shared a computer with, change them all. ( https://www.schneier.com/blog/archives/2010/11/changing_passwo.html ) -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. From bortzmeyer at nic.fr Wed Oct 7 06:16:16 2015 From: bortzmeyer at nic.fr (Stephane Bortzmeyer) Date: Wed, 7 Oct 2015 13:16:16 +0200 Subject: Unicode in passwords In-Reply-To: References: <2E86845B-6A9F-4D5C-A02A-C2C82F4DB28F@gmail.com> <20151005151200.GA7379@laperouse.bortzmeyer.org> <5614015E.3010302@ix.netcom.com> <20151006205734.038a869f@JRWUBU2> Message-ID: <20151007111615.GA25230@nic.fr> On Tue, Oct 06, 2015 at 10:53:00PM +0200, Philippe Verdy wrote a message of 72 lines which said: > it is highly preferable to extend the character repertoire to > Unicode and accept letters in NFKC form and unified by case folding As I said before, "the ship has sailed". RFC 7613 has been published, and uses NFC and case preservation. It is IMHO useless to reopen this discussion. > the recent RFC that forgot the issue : its case-insensitive profile > based on NFC and conversion to lowercase is definitely broken !) What is broken is your analysis. RFC 7613 does not convert passwords to lowercase. Indeed, it says exactly the opposite, which seems to indicate that you did not read it before calling it broken: Case-Mapping Rule: Uppercase and titlecase characters MUST NOT be mapped to their lowercase equivalents. From verdy_p at wanadoo.fr Wed Oct 7 06:46:06 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Wed, 7 Oct 2015 13:46:06 +0200 Subject: Unicode in passwords In-Reply-To: <20151007111615.GA25230@nic.fr> References: <2E86845B-6A9F-4D5C-A02A-C2C82F4DB28F@gmail.com> <20151005151200.GA7379@laperouse.bortzmeyer.org> <5614015E.3010302@ix.netcom.com> <20151006205734.038a869f@JRWUBU2> <20151007111615.GA25230@nic.fr> Message-ID: 2015-10-07 13:16 GMT+02:00 Stephane Bortzmeyer : > On Tue, Oct 06, 2015 at 10:53:00PM +0200, > Philippe Verdy wrote > a message of 72 lines which said: > > > it is highly preferable to extend the character repertoire to > > Unicode and accept letters in NFKC form and unified by case folding > > As I said before, "the ship has sailed". RFC 7613 has been published, > and uses NFC and case preservation. It is IMHO useless to reopen this > discussion. > Reread the RFC, it discusses the case-insensitive profile using NFC and conversion to lowercase, this is the bug. > > > the recent RFC that forgot the issue : its case-insensitive profile > > based on NFC and conversion to lowercase is definitely broken !) > > What is broken is your analysis. RFC 7613 does not convert passwords > to lowercase. Indeed, it says exactly the opposite, which seems to > indicate that you did not read it before calling it broken: > > Case-Mapping Rule: Uppercase and titlecase characters MUST NOT be > mapped to their lowercase equivalents. > You are reading the other section for the case-sensitive profile (in SASLprep, section 6.1), which is absolutely not forbidden for user names, and already an established practice since too many decennial (email addresses, local user names in Windows...), and this very new RFC will not change this practice before very long. -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Wed Oct 7 11:10:14 2015 From: doug at ewellic.org (Doug Ewell) Date: Wed, 07 Oct 2015 09:10:14 -0700 Subject: Unicode in passwords Message-ID: <20151007091014.665a7a7059d7ee80bb4d670165c8327d.79ab53c881.wbe@email03.secureserver.net> Philippe Verdy wrote: > This is a demonstration that using case differences to add more > combinations in short passwords is a bad design. But more and more organizations and banks and supermarket rewards programs are demanding it, along with "at least one digit" and "at least one 'special' character" and "at least N characters in length" and "must change every N days" -- regardless of what Bruce Schneier or anyone else says. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From doug at ewellic.org Wed Oct 7 11:11:52 2015 From: doug at ewellic.org (Doug Ewell) Date: Wed, 07 Oct 2015 09:11:52 -0700 Subject: Why Nothing Ever Goes Away Message-ID: <20151007091152.665a7a7059d7ee80bb4d670165c8327d.4dce1a05ee.wbe@email03.secureserver.net> Richard Wordingham wrote: > Browsers dropping UTF-7 support was a major inconvenience. Especially when the real problem with cross-site scripting was *auto-detection* of UTF-7. Requiring users to override the encoding and select UTF-7 manually would have solved most problems. Dropping UTF-7 entirely was not necessary. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From verdy_p at wanadoo.fr Thu Oct 8 11:14:38 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Thu, 8 Oct 2015 18:14:38 +0200 Subject: Unicode in passwords In-Reply-To: <20151007091014.665a7a7059d7ee80bb4d670165c8327d.79ab53c881.wbe@email03.secureserver.net> References: <20151007091014.665a7a7059d7ee80bb4d670165c8327d.79ab53c881.wbe@email03.secureserver.net> Message-ID: They demand such passwords only for their web services, which are accessed by web browsers. Not for booting devices. Being on the web, the protocols are based on HTML and web browsers. As the web is now Unicode with UTF-8 in a vast majority of contents, those web services are already UTF-8 ready (it is also a requirement on those web insterfaces used by banks to have javascript support). So restricting those web passwords to only ASCII is a bad choice : to extend the usable charset, forcing the inclusion of ASCII capitals and punctuation is not sufficient, There is certainly a better way to extend the set to include as well all characters supported by browser input methods for the targeted languages, and that are still easy to type in on most devices (this means not adding characters not suppoted by old versions of Windows or by basic smartphones). With an extended repertoire (not restricted to ASCII, thanks to UTF-8 on the web), password lengths could remain relatively short and easy to type and remember (the alternative using passphrases also requires being able to type words in the local language in its basic orthography, and some compatibility normalization, as well as case folding will be helpful to provide good interoperability across client devices, where typing letters with mised case is frequently very inconvenient on touche devices, as well for people with disabilities and that type with only one finger). Still, there are still many banks whose passwords are limited to only basic decimal digits, and limited to at most 8 of them. As this is not enough, the input forms will also request other numbers that people frequently cannot easily remember. others will use two-factor authentication using mobile phones and confirmation codes sent by SMS, or will send an additional code in physical letters, they will take footprints of the browser or IMEI code of the smartphone used and preapproval required before trusting devices, or giving the number of a physical credit card by procesing a ?0.00 online payment with it, and some pseudo "secret" questions (social security number, identity card/passport/driver licence number...) but some are very week and ask for something that is rarely secret such as the birth date (Facebook initially published it by default to anyone without asking when you create the account, now it is private by default, except for the birthday application enabled by default and notifying all "friends". But too late for those that had created their account years ago, it is now public for eternity even if it can be hidden on the current version of profiles... similar fake secrets are names of family members and pets, as all the info is) 2015-10-07 18:10 GMT+02:00 Doug Ewell : > Philippe Verdy wrote: > > > This is a demonstration that using case differences to add more > > combinations in short passwords is a bad design. > > But more and more organizations and banks and supermarket rewards > programs are demanding it, along with "at least one digit" and "at least > one 'special' character" and "at least N characters in length" and "must > change every N days" -- regardless of what Bruce Schneier or anyone else > says. > > -- > Doug Ewell | http://ewellic.org | Thornton, CO ???? > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Thu Oct 8 15:09:07 2015 From: doug at ewellic.org (Doug Ewell) Date: Thu, 08 Oct 2015 13:09:07 -0700 Subject: Unicode in passwords Message-ID: <20151008130907.665a7a7059d7ee80bb4d670165c8327d.df792f40da.wbe@email03.secureserver.net> Philippe Verdy wrote: > They demand such passwords only for their web services, which are > accessed by web browsers. Not for booting devices. My company enforces all of the password restrictions I listed, as well as "ASCII only," for access both to individual PCs and to the company network. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From jsoconner at gmail.com Fri Oct 9 12:26:02 2015 From: jsoconner at gmail.com (John O'Conner) Date: Fri, 09 Oct 2015 17:26:02 +0000 Subject: Unicode Digest, Vol 22, Issue 9 In-Reply-To: References: Message-ID: As a response to all the issues I've learned from everyone, my immediate recommendation for my company's current policy is to constrain our passwords to printable ASCII now in order to buy time to learn more about all the issues that you and others have mentioned. I appreciate all the feedback on the topic. Clearly there are issues to consider, and I'll make an effort to gather up all the issues everyone mentioned for a single, consolidated list. On Fri, Oct 9, 2015 at 10:00 AM wrote: > > From: Doug Ewell > To: verdy_p at wanadoo.fr > Cc: Unicode Mailing List > Date: Thu, 08 Oct 2015 13:09:07 -0700 > Subject: RE: Unicode in passwords > Philippe Verdy wrote: > > > They demand such passwords only for their web services, which are > > accessed by web browsers. Not for booting devices. > > My company enforces all of the password restrictions I listed, as well > as "ASCII only," for access both to individual PCs and to the company > network. > > -- > Doug Ewell | http://ewellic.org | Thornton, CO ???? > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From lists+unicode at seantek.com Fri Oct 9 13:05:54 2015 From: lists+unicode at seantek.com (Sean Leonard) Date: Fri, 9 Oct 2015 11:05:54 -0700 Subject: Why Nothing Ever Goes Away In-Reply-To: <5613E9DF.8020406@ix.netcom.com> References: <20151005082457.665a7a7059d7ee80bb4d670165c8327d.c32c0ba233.wbe@email03.secureserver.net> <5612D05D.7000407@att.net> <5613BD66.3080707@seantek.com> <5613E9DF.8020406@ix.netcom.com> Message-ID: <56180202.4010905@seantek.com> Satisfactory answers, thank you very much. Going back to doing more research. (Silence does not imply abandoning the C1 Control Pictures project; just a lot to synthesize.) Regarding the three points U+0080, U+0081, and U+0099: the fact that Unicode defers mostly to ISO 6429 and other standards before its time (e.g., ANSI X3.32 / ISO 2047 / ECMA-17) means that it is not particularly urgent that those code points get Unicode names. I also do not find that their lack of definition precludes pictorial representations. In the current U+2400 block, the Standard says: "The diagonal lettering glyphs are only exemplary; alternate representations may be, and often are used in the visible display of control codes", and, Section 22.7. I am now in possession of a copy of ANSI X3.32-1973 and ECMA-17:1968 (the latter is available on ECMA's website). I find it worthwhile to point out that the Transmission Controls and Format Effectors were not standardized by the time of ECMA-17:1968, but the symbols are the same nonetheless. ANSI X3.32-1973 has the standardized control names for those characters. Sean On 10/6/2015 6:57 AM, Philippe Verdy wrote: > > 2015-10-06 14:24 GMT+02:00 Sean Leonard >: > > 2. The Unicode code charts are (deliberately) vague about > U+0080, U+0081, > and U+0099. All other C1 control codes have aliases to the ISO > 6429 > set of control functions, but in ISO 6429, those three control > codes don't > have any assigned functions (or names). > > > On 10/5/2015 3:57 PM, Philippe Verdy wrote: > > Also the aliases for C1 controls were formally registered in > 1983 only for the two ranges U+0084..U+0097 and U+009B..U+009F > for ISO 6429. > > > If I may, I would appreciate another history lesson: > In ISO 2022 / 6429 land, it is apparent that the C1 controls are > mainly aliases for ESC 4/0 - 5/15. ( @ through _ ) This might vary > depending on what is loaded into the C1 register, but overall, it > just seems like saving one byte. > > Why was C1 invented in the first place? > > > Look for the history of EBCDIC and its adaptation/conversion with > ASCII compatible encodings: round trip conversion wasneeded (using a > only a simple reordering of byte values, with no duplicates). EBCDIC > has used many controls that were not part of C0 and were kept in the > C1 set. Ignore the 7-bit compatiblity encoding using pairs, they were > only needed for ISO 2022, but ISO 6429 defines a profile where those > longer sequences are not needed and even forbidden in 8-bit contexts > or in contexts where aliases are undesirable and invalidated, such as > security environments. > > With your thoughts, I would conclude that assigning characters in the > G1 set was also a duplicate, because it is reachable with a C0 > "shifting" control + a position of the G0 set. In that case ISO 8859-1 > or Windows 1252 was also an unneeded duplication ! And we would live > today in a 7-bit only world. > > C1 controls have their own identity. The 7-bit encoding using ESC is > just a hack to make them fit in 7-bit and it only works where the ESC > control is assumed to play this function according to ISO 2022, ISO > 6429, or other similar old 7-bit protocols such as Videotext (which > was widely used in France with the free "Minitel" terminal, long > before the introduction of the Internet to the general public around > 1992-1995). > > Today Videotext is definitely dead (the old call numbers for this slow > service are now definitely defunct, the Minitels are recycled wastes, > they stopped being distributed and replaced by applications on PC > connected to the Internet, but now all the old services are directly > on the internet and none of them use 7-bit encodings for their HTML > pages, or their mobile applications). France has also definitely > abandoned its old French version of ISO 646, there are no longer any > printer supporting versions of ISO 646 other than ASCII, but they > still support various 8-bit encodings. > > 7-bit encodings are things of the past (they were only justified at > times where communication links were slow and generated lots of > transmission errors, and the only implemented mecanism to check them > was to use a single parity bit per character. Today we transmit long > datagrams and prefer using checks codes for the whole (such as CRC, or > autocorrecting codes). 8-bit encodings are much easier and faster to > process for transmitting not just text but also binary data. > > Let's forget the 7-bit world definitely. We have also abandonned the > old UTF-7 in Unicode ! I've not seen it used anywhere except in a few > old emails sent at end of the 90's, because many mail servers were > still not 8-bit clean and silently transformed non-ASCII bytes in > unpredictable ways or using unspecified encodings, or just siltently > dropped the high bit, assuming it was just a parity bit : at that > time, emails were not sent with SMTP, but with the old UUCP protocol > and could take weeks to be delivered to the final recipient, as there > was still no global routing infrastructure and many hops were > necessary via non-permanent modem links. My opinion of UTF-7 is that > it was just a temporary and experimental solution to help system > admins and developers adopt the new UCS, including for their old 7-bit > environments. On 10/6/2015 8:33 AM, Asmus Freytag (t) wrote: > On 10/6/2015 5:24 AM, Sean Leonard wrote: >> And, why did Unicode deem it necessary to replicate the C1 block at >> 0x80-0x9F, when all of the control characters (codes) were equally >> reachable via ESC 4/0 - 5/15? I understand why it is desirable to >> align U+0000 - U+007F with ASCII, and maybe even U+0000 - U+00FF with >> Latin-1 (ISO-8859-1). But maybe Windows-1252, MacRoman, and all the >> other non-ISO-standardized 8-bit encodings got this much right: >> duplicating control codes is basically a waste of very precious >> character code real estate > > Because Unicode aligns with ISO 8859-1, so that transcoding from that > was a simple zero-fill to 16 bits. > > 8859-1 was the most widely used single byte (full 8-bit) ISO standard > at the time, and making that transition easy was beneficial, both > practically and politically. > > Vendor standards all disagreed on the upper range, and it would not > have been feasible to single out any of them. Nobody wanted to follow > the IBM code page 437 (then still the most widely used single byte > vendor standard). > > > Note, that by "then" I refer to dates earlier than the dates of the > final drafts, because may of those decisions date back to earlier > periods where the drafts were first developed.Also, the overloading of > 0x80-0xFF by Windows did not happen all at once, earlier versions had > left much of that space open, but then people realized that as long as > you were still limited to 8 bits, throwing away 32 codes was an issue. > > Now, for Unicode, 32 out of 64K values (initially) or 1114112 (now), > don't matter, so being "clean" didn't cost much. (Note that even for > UTF-8, there's no special benefit of a value being inside that second > range of 128 codes. > > Finally, even if the range had not been dedicated to C1, the 32 codes > would have had to be given space, because the translation into ESC > sequences is not universal, so, in transcoding data you needed to have > a way to retain the difference between the raw code and the ESC > sequence, or your round-trip would not be lossless. > > A./ -------------- next part -------------- An HTML attachment was scrubbed... URL: From lists+unicode at seantek.com Fri Oct 9 13:32:38 2015 From: lists+unicode at seantek.com (Sean Leonard) Date: Fri, 9 Oct 2015 11:32:38 -0700 Subject: Pictorial Representations of BS and DEL Message-ID: <56180846.3030504@seantek.com> Hello: As we continue to riff on the history of character encodings, I am searching for the most accurate standards-based pictorial representations of BS (U+0008) and DEL (U+007F) in Unicode. ECMA-17:1968 and ANSI X3.32-1973 depict U+0008 as an arrow pointing from the bottom-right to the top-left, slightly arced upwards. They depict U+007F as a filled box symbol comprised of five diagonal slashes oriented from bottom-left to top-right, with no border. All of the other control pictures (from those standards) have specific code point assignments in Unicode. Whether those glyphs are used for U+2400 et. seq. is, of course, up to the font designer. But it's nice to know they are there as fallbacks. What are the most accurate pictorial representations in the existing Unicode Standard for BS and DEL? Frankly there are so many arrows that it's hard to make heads or tails (pun intended) out of which one is the best. However, in all the arrows I looked for, I did not see one that was a sufficiently close match. There is another standard governing these sorts of things, namely ISO 9995. I would not be surprised if it has something to say about Backspace, as the Backspace keytop is standardized to look like: Backspace <--- Note that there are still many left-pointing arrows in the Unicode standard, so which Unicode left-pointing arrow is the closest one to the one typically printed on a keytop? Regarding DEL: ? U+25A8 SQUARE WITH UPPER RIGHT TO LOWER LEFT FILL is close, but it has a black box border. ? U+2425 SYMBOL FOR DELETE FORM TWO is depicted as three slashes in the middle, not five slashes, and is from ISO 9995-7. It is a symbol for "undoable delete". I presume that the omission of the fourth and fifth slashes is intentional. ? U+2302 HOUSE is the corresponding grapheme in Code Page 437, and so many people would probably be familiar with using this to depict U+007F. But we are trying to bury Code Page 437. (Note: this is relevant to the C1 control character CCH U+0094, which is intended to eliminate ambiguity about the meaning of BS. Arguably ? U+232B ERASE TO THE LEFT is the most appropriate for CCH, but it could also be used for BS, and that is the problem because BS is more nebulous but far, far more ubiquitous than CCH.) Sean From bugraaydin1999 at gmail.com Sat Oct 10 07:59:14 2015 From: bugraaydin1999 at gmail.com (patapatachakapon .) Date: Sat, 10 Oct 2015 15:59:14 +0300 Subject: Rights to the Emoji Message-ID: Hello, I work for a small company in Turkey. We would like to import/sell products that have pictures of Emoji on them (such as keychains, cups etc.) , here in Turkey. The Emoji we would like to use on our products are the ones that are titled Native on the chart that I've attached to this email. I would like to know whether or not it's required to buy the rights these Emoji. Are Emoji copyrighted, or can they be used by anyone for design purposes? Thanks so much in advance! -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: emoji.jpg Type: image/jpeg Size: 84967 bytes Desc: not available URL: From magnus at bodin.org Sat Oct 10 12:25:59 2015 From: magnus at bodin.org (=?UTF-8?Q?Magnus_Bodin_=E2=98=80?=) Date: Sat, 10 Oct 2015 19:25:59 +0200 Subject: Rights to the Emoji In-Reply-To: References: Message-ID: This might shed some light: http://words.steveklabnik.com/emoji-licensing On Sat, Oct 10, 2015 at 2:59 PM, patapatachakapon . < bugraaydin1999 at gmail.com> wrote: > Hello, > > I work for a small company in Turkey. We would like to import/sell > products that have pictures of Emoji on them (such as keychains, cups etc.) > , here in Turkey. The Emoji we would like to use on our products are the > ones that are titled Native on the chart that I've attached to this email. > I would like to know whether or not it's required to buy the rights these > Emoji. Are Emoji copyrighted, or can they be used by anyone for design > purposes? > > Thanks so much in advance! > -------------- next part -------------- An HTML attachment was scrubbed... URL: From magnus at bodin.org Sat Oct 10 12:28:55 2015 From: magnus at bodin.org (=?UTF-8?Q?Magnus_Bodin_=E2=98=80?=) Date: Sat, 10 Oct 2015 19:28:55 +0200 Subject: Rights to the Emoji In-Reply-To: References: Message-ID: Here is an emoji that is CC-licensed. https://signalvnoise.com/posts/3395-neckbeard Let us know when you sell neckbeard pillows. On Sat, Oct 10, 2015 at 7:25 PM, Magnus Bodin ? wrote: > This might shed some light: > > http://words.steveklabnik.com/emoji-licensing > > On Sat, Oct 10, 2015 at 2:59 PM, patapatachakapon . < > bugraaydin1999 at gmail.com> wrote: > >> Hello, >> >> I work for a small company in Turkey. We would like to import/sell >> products that have pictures of Emoji on them (such as keychains, cups etc.) >> , here in Turkey. The Emoji we would like to use on our products are the >> ones that are titled Native on the chart that I've attached to this email. >> I would like to know whether or not it's required to buy the rights these >> Emoji. Are Emoji copyrighted, or can they be used by anyone for design >> purposes? >> >> Thanks so much in advance! >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From wjgo_10009 at btinternet.com Sat Oct 10 05:14:50 2015 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Sat, 10 Oct 2015 11:14:50 +0100 (BST) Subject: How can my research become implemented in a standardized manner? In-Reply-To: <33365178.33329.1444395933996.JavaMail.defaultUser@defaultHost> References: <33365178.33329.1444395933996.JavaMail.defaultUser@defaultHost> Message-ID: <6041047.15822.1444472090118.JavaMail.defaultUser@defaultHost> Please note that I am on moderated post, so if this post does get sent to the Unicode mailing list it will be because the moderator has kindly agreed to it being circulated. I have recently made significant progress with my research in communication through the language barrier. The capabilities are greatly improved. On 7 October 2015 I submitted a document, hoping that it would become included in the Unicode Document Register. I have been informed that a group of people have examined the document and determined that it is out of scope for UTC. I am not seeking to question that decision. As an independent researcher, not representing an organization, nor in fact employed by any organization at all, I am trying to get the system standardized as an international standard. I feel that trying to produce first a widely-used system using a Private Use Area encoding is not a realistic practical goal and even if it were practical the result would be lots of legacy data. I feel that to become successful the system needs standardization and implementation to go forward together. So what to do? More generally, how are the format and the encoding of tagspaces to be carried out in the future? The document is available on the web at the present time in two places. There is a file available for download as an attachment in a forum post of 8 October 2015 in the High-Logic Gallery forum. Adding a direct link to the post is not at present possible using the particular email system that I am using. There is direct access in my family webspace. www.users.globalnet.co.uk/~ngo/two_tagspaces.pdf In addition I have deposited the document at the British Library. William Overington 10 October 2015 -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Sun Oct 11 16:20:34 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Sun, 11 Oct 2015 22:20:34 +0100 Subject: Counting Codepoints Message-ID: <20151011222034.2a1348ae@JRWUBU2> Is the number of codepoints in a UTF-16 string well defined? For example, which of the following two statements are true? (a) The ill-formed three code-unit Unicode 16-bit string <0xDC00, 0xD800, 0xDC20> contains two codepoints, U+DC00 and U+10020. (b) The ill-formed three code-unit Unicode 16-bit string <0xDC00, 0xD800, 0xDC20> contains three codepoints, U+DC00, U+D800 and U+DC20. Statement (a) is probably more useful, but I couldn't find anything to rule that statement (b) is false. Richard. From c933103 at gmail.com Sun Oct 11 18:03:18 2015 From: c933103 at gmail.com (gfb hjjhjh) Date: Mon, 12 Oct 2015 07:03:18 +0800 Subject: How can my research become implemented in a standardized manner? In-Reply-To: <6041047.15822.1444472090118.JavaMail.defaultUser@defaultHost> References: <33365178.33329.1444395933996.JavaMail.defaultUser@defaultHost> <6041047.15822.1444472090118.JavaMail.defaultUser@defaultHost> Message-ID: I believe using markup languages would be a better approach than getting some new character. 2015/10/12 4:27 "William_J_G Overington" : > Please note that I am on moderated post, so if this post does get sent to > the Unicode mailing list it will be because the moderator has kindly agreed > to it being circulated. > > I have recently made significant progress with my research in > communication through the language barrier. The capabilities are > greatly improved. > > On 7 October 2015 I submitted a document, hoping that it would become > included in the Unicode Document Register. > > I have been informed that a group of people have examined the document and > determined that it is out of scope for UTC. > > I am not seeking to question that decision. > > As an independent researcher, not representing an organization, nor in > fact employed by any organization at all, I am trying to get the system > standardized as an international standard. > > I feel that trying to produce first a widely-used system using a Private > Use Area encoding is not a realistic practical goal and even if it were > practical the result would be lots of legacy data. I feel that to become > successful the system needs standardization and implementation to go > forward together. > > So what to do? > > More generally, how are the format and the encoding of tagspaces to be > carried out in the future? > > The document is available on the web at the present time in two places. > > There is a file available for download as an attachment in a forum post of > 8 October 2015 in the High-Logic Gallery forum. > > Adding a direct link to the post is not at present possible using the > particular email system that I am using. > > There is direct access in my family webspace. > > www.users.globalnet.co.uk/~ngo/two_tagspaces.pdf > > In addition I have deposited the document at the British Library. > > William Overington > > 10 October 2015 > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Sun Oct 11 18:08:23 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Mon, 12 Oct 2015 01:08:23 +0200 Subject: Counting Codepoints In-Reply-To: <20151011222034.2a1348ae@JRWUBU2> References: <20151011222034.2a1348ae@JRWUBU2> Message-ID: Both statements are false. The ill-fomed sequence <0xDC00, 0xD800, 0xDC20> in invalid for UTF-16, because it contains 1 invalid code unit for UTF-16 (the unpaired surrogate 0xDC00), followed by a single code point (U+10020). All 3 surrogate codepoints U+DC00, U+DC00 and U+DC20 are NOT encoded (as they are not representable in valid UTF-16). The number of codepoints in a **valid** UTF-16 string is perfectly well defined. If the encoded string is not valid UTF-16, then the number of codepoints in it is NOT defined (whever the invalid code units will be dropped or replaced, and the number of replacement codepoints can also vary depending on implementation, but an implementation can also consider the whole string as invalid and will return no code points at all or could stop returning any code point after the first error encountered and drop all the rest, or substitute all the rest with a single replacement character). Only the number of 16-bit code units is defined (this number does not depend on UTF-16 validity). 2015-10-11 23:20 GMT+02:00 Richard Wordingham < richard.wordingham at ntlworld.com>: > Is the number of codepoints in a UTF-16 string well defined? > > For example, which of the following two statements are true? > > (a) The ill-formed three code-unit Unicode 16-bit string <0xDC00, > 0xD800, 0xDC20> contains two codepoints, U+DC00 and U+10020. > > (b) The ill-formed three code-unit Unicode 16-bit string <0xDC00, > 0xD800, 0xDC20> contains three codepoints, U+DC00, U+D800 and U+DC20. > > Statement (a) is probably more useful, but I couldn't find anything to > rule that statement (b) is false. > > Richard. > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Sun Oct 11 18:32:28 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Mon, 12 Oct 2015 01:32:28 +0200 Subject: How can my research become implemented in a standardized manner? In-Reply-To: References: <33365178.33329.1444395933996.JavaMail.defaultUser@defaultHost> <6041047.15822.1444472090118.JavaMail.defaultUser@defaultHost> Message-ID: In fact this is not just inventing new characters, all this personal research is about inventing a new human language as well ! This cannot be done alone without people interested in communicatiung in that language. This is also more than new characters for the orthography, it is also about creating a grammar, and defining usages. All this work will not succeed without first developping a glossary (later a dictionnary, not necessarily bilingual) and educational supports, and opening it to discussions and evolutions. Consider the hard work that was done for creating Esperanto (even if it did not require inventing a new script, some characters were invented using uncommon combinations of existing Latin letters and diacritics), this is a very long way before the communciation can become really useful and starts dissimenating and being used to create real text with it. What is strange with that language for now is that it is only meant to be read, but it has no phonology at all. This is then very far from a human language (whose primary support has always been oral first before being written). Without the oral form, the language will not succeed * Consider Esperanto, it also has an oral form, more or less based on Polish and German phonologies, even if there are variable accents, but this is more or less stabilized by using a formal phonology, simplifying the actual phonetics for minimal mutual understanding by educated differenciation of groups of related phonems). * Consider Emojis: they basically represent basic nouns in Japanese or English, and they are more or less translatable. They also include a few some "adjectives" (e.g. skin color), and a basic syntax for them. There are some forms of compound nouns (e.g. FAMILY or COUPLE) linked by ZWJ rather than hyphens but their complete meaning is based on their component nouns (MAN, WOMAN, BOY, GIRL). They are successful because they adequately represent common nouns or expressions in many languages with more or less equal meaning, so they are easily read orally. 2015-10-12 1:03 GMT+02:00 gfb hjjhjh : > I believe using markup languages would be a better approach than getting > some new character. > 2015/10/12 4:27 "William_J_G Overington" : > > Please note that I am on moderated post, so if this post does get sent to >> the Unicode mailing list it will be because the moderator has kindly agreed >> to it being circulated. >> >> I have recently made significant progress with my research in >> communication through the language barrier. The capabilities are >> greatly improved. >> >> On 7 October 2015 I submitted a document, hoping that it would become >> included in the Unicode Document Register. >> >> I have been informed that a group of people have examined the document >> and determined that it is out of scope for UTC. >> >> I am not seeking to question that decision. >> >> As an independent researcher, not representing an organization, nor in >> fact employed by any organization at all, I am trying to get the system >> standardized as an international standard. >> >> I feel that trying to produce first a widely-used system using a Private >> Use Area encoding is not a realistic practical goal and even if it were >> practical the result would be lots of legacy data. I feel that to become >> successful the system needs standardization and implementation to go >> forward together. >> >> So what to do? >> >> More generally, how are the format and the encoding of tagspaces to be >> carried out in the future? >> >> The document is available on the web at the present time in two places. >> >> There is a file available for download as an attachment in a forum post >> of 8 October 2015 in the High-Logic Gallery forum. >> >> Adding a direct link to the post is not at present possible using the >> particular email system that I am using. >> >> There is direct access in my family webspace. >> >> www.users.globalnet.co.uk/~ngo/two_tagspaces.pdf >> >> In addition I have deposited the document at the British Library. >> >> William Overington >> >> 10 October 2015 >> >> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From shervinafshar at gmail.com Sun Oct 11 19:51:05 2015 From: shervinafshar at gmail.com (Shervin Afshar) Date: Sun, 11 Oct 2015 17:51:05 -0700 Subject: Rights to the Emoji In-Reply-To: References: Message-ID: Those listed in the column titled "Native" come from the operating system (in your case, Mac OS X) and/or browser you are viewing that page on. One can assume that the right to those belong to the entity who develops those software. A safer approach for you would be to use symbols from Emoji One[1]; if you can attribute that project on your products, you can use them for free; if you can not do that, they require that you contact them for a custom paid license [2]. Also, with the paid license you are helping a project publishing content under Creative Common license. [1]: http://emojione.com/ [2]: http://emojione.com/faq#faq5 ? Shervin On Sat, Oct 10, 2015 at 5:59 AM, patapatachakapon . < bugraaydin1999 at gmail.com> wrote: > Hello, > > I work for a small company in Turkey. We would like to import/sell > products that have pictures of Emoji on them (such as keychains, cups etc.) > , here in Turkey. The Emoji we would like to use on our products are the > ones that are titled Native on the chart that I've attached to this email. > I would like to know whether or not it's required to buy the rights these > Emoji. Are Emoji copyrighted, or can they be used by anyone for design > purposes? > > Thanks so much in advance! > -------------- next part -------------- An HTML attachment was scrubbed... URL: From kenwhistler at att.net Sun Oct 11 23:36:49 2015 From: kenwhistler at att.net (Ken Whistler) Date: Sun, 11 Oct 2015 21:36:49 -0700 Subject: Counting Codepoints In-Reply-To: <20151011222034.2a1348ae@JRWUBU2> References: <20151011222034.2a1348ae@JRWUBU2> Message-ID: <561B38E1.5070007@att.net> On 10/11/2015 2:20 PM, Richard Wordingham wrote: > Is the number of codepoints in a UTF-16 string well defined? > > For example, which of the following two statements are true? > > (a) The ill-formed three code-unit Unicode 16-bit string <0xDC00, > 0xD800, 0xDC20> contains two codepoints, U+DC00 and U+10020. > > (b) The ill-formed three code-unit Unicode 16-bit string <0xDC00, > 0xD800, 0xDC20> contains three codepoints, U+DC00, U+D800 and U+DC20. > > Statement (a) is probably more useful, but I couldn't find anything to > rule that statement (b) is false. I think the correct answer is probably: (c) The ill-formed three code unit Unicode 16-bit string <0xDC00, 0xD800, 0xDC20> contains one code point, U+10020 and one uninterpreted (and uninterpretable) high surrogate code unit 0xDC00. In other words, I don't think it is useful or helpful to map isolated, uninterpretable surrogate code units *to* surrogate code points. Surrogate code points are an artifact of the code architecture. They are code points in the code space which *cannot* be represented in UTF-16, by definition. Any discussion about properties for surrogate code points is a matter of designing graceful API fallback for instances which have to deal with ill-formed strings and do *something*. I don't think that should extend to treating isolated surrogate code units as having interpretable status, *as if* they were valid code points represented in the string. It might be easier to get a handle on this if folks were to ask, instead how many code points are in the ill-formed Unicode 8-bit string <0x61, 0xF4, 0x90, 0x90, 0x90, 0x61>. 6 code units, yes, but how many code points? I'd say two code points and 4 uninterpretable, ill-formed UTF-8 code units, rather than any other possible answer. Basically, you get the same kind of answer if the ill-formed string were, instead <0x61, 0xED, 0xA0, 0x80, 0x61>. Two code points and 3 uninterpretable, ill-formed UTF-8 code units. That is a better answer than trying to map 0xED 0xA0 0x80 to U+D800 and then saying, oh, that is a surrogate code *point*. --Ken From duerst at it.aoyama.ac.jp Sun Oct 11 23:58:35 2015 From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J._D=c3=bcrst?=) Date: Mon, 12 Oct 2015 13:58:35 +0900 Subject: Rights to the Emoji In-Reply-To: References: Message-ID: <561B3DFB.2070605@it.aoyama.ac.jp> You can also design your own version of the emoji you want to use. [I'm not a lawyer, but as far as I understand,] what's protected is the individual design, not the idea of a "donut" or "frowning face" emoji as such. Regards, Martin. On 2015/10/12 09:51, Shervin Afshar wrote: > Those listed in the column titled "Native" come from the operating system > (in your case, Mac OS X) and/or browser you are viewing that page on. One > can assume that the right to those belong to the entity who develops those > software. > > A safer approach for you would be to use symbols from Emoji One[1]; if you > can attribute that project on your products, you can use them for free; if > you can not do that, they require that you contact them for a custom paid > license [2]. > > Also, with the paid license you are helping a project publishing content > under Creative Common license. > > [1]: http://emojione.com/ > [2]: http://emojione.com/faq#faq5 > > ? Shervin > > On Sat, Oct 10, 2015 at 5:59 AM, patapatachakapon . < > bugraaydin1999 at gmail.com> wrote: > >> Hello, >> >> I work for a small company in Turkey. We would like to import/sell >> products that have pictures of Emoji on them (such as keychains, cups etc.) >> , here in Turkey. The Emoji we would like to use on our products are the >> ones that are titled Native on the chart that I've attached to this email. >> I would like to know whether or not it's required to buy the rights these >> Emoji. Are Emoji copyrighted, or can they be used by anyone for design >> purposes? >> >> Thanks so much in advance! >> > From mark at macchiato.com Mon Oct 12 00:46:51 2015 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Mon, 12 Oct 2015 07:46:51 +0200 Subject: Rights to the Emoji In-Reply-To: References: Message-ID: The twitter images are open sourced, I believe. {phone} On Oct 12, 2015 02:56, "Shervin Afshar" wrote: > Those listed in the column titled "Native" come from the operating system > (in your case, Mac OS X) and/or browser you are viewing that page on. One > can assume that the right to those belong to the entity who develops those > software. > > A safer approach for you would be to use symbols from Emoji One[1]; if you > can attribute that project on your products, you can use them for free; if > you can not do that, they require that you contact them for a custom paid > license [2]. > > Also, with the paid license you are helping a project publishing content > under Creative Common license. > > [1]: http://emojione.com/ > [2]: http://emojione.com/faq#faq5 > > ? Shervin > > On Sat, Oct 10, 2015 at 5:59 AM, patapatachakapon . < > bugraaydin1999 at gmail.com> wrote: > >> Hello, >> >> I work for a small company in Turkey. We would like to import/sell >> products that have pictures of Emoji on them (such as keychains, cups etc.) >> , here in Turkey. The Emoji we would like to use on our products are the >> ones that are titled Native on the chart that I've attached to this email. >> I would like to know whether or not it's required to buy the rights these >> Emoji. Are Emoji copyrighted, or can they be used by anyone for design >> purposes? >> >> Thanks so much in advance! >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From c933103 at gmail.com Mon Oct 12 06:45:37 2015 From: c933103 at gmail.com (gfb hjjhjh) Date: Mon, 12 Oct 2015 19:45:37 +0800 Subject: How can my research become implemented in a standardized manner? In-Reply-To: <10696749.12128.1444639121696.JavaMail.defaultUser@defaultHost> References: <33365178.33329.1444395933996.JavaMail.defaultUser@defaultHost> <6041047.15822.1444472090118.JavaMail.defaultUser@defaultHost> <10696749.12128.1444639121696.JavaMail.defaultUser@defaultHost> Message-ID: This proposal is, in my opinion, similar to another discussion about giving unicode character to food allergy symbol that happened few months ago on this mailing list, which both idea want to use unicode characters to overcome language barrier, just that that proposal were about those icon while this one is about written text.If my memory is now working perfectly(or you can read through the archive for the list), back then people mentioned that to get something new into Unicode you need to first make it into an international standard and make it being used. Therefore to get a character into unicode traction is needed, and if you don't want PUA then using markup languages is something that pop out of my mind that can help develop those tractions. International standard does not necessarily mean ISO standard. And for encoding that also depend on other sources, the current system for flag would be dependent on update of operating system or font file in operating system to reflect the change in software, and that is if those system/font developers really update their files as soon as source files being changed. 2015/10/12 16:38 "William_J_G Overington" : > > I believe using markup languages would be a better approach than getting > some new character. > > Thank you for posting. > > That would make an interesting discussion, yet is off-topic for this > thread. > > The topic for this thread is about the encoding process, not about the > merits or otherwise of the particular encoding proposal. > > The flags tagspace was encoded by reference to an existing ISO standard. > > http://www.unicode.org/L2/L2015/15145r-add-regional-ind.pdf > > Yet if the tagspace for a new proposal needs to be defined and there is no > existing ISO standard to which reference can be made, how is that tagspace > to become defined, by what process, by which committee, already existing or > new? > > Also, if the complete encoding depends on both of the encoding of a base > character into Unicode and of the encoding of a tagspace, so that both > items can be applied together by an end user, what is the infrastructure > mechanism to be so the complete encoding can take place? > > William Overington > > 12 October 2015 > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mark at macchiato.com Mon Oct 12 07:42:57 2015 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Mon, 12 Oct 2015 14:42:57 +0200 Subject: Counting Codepoints In-Reply-To: <561B38E1.5070007@att.net> References: <20151011222034.2a1348ae@JRWUBU2> <561B38E1.5070007@att.net> Message-ID: I agree with Ken on "Any discussion about properties for surrogate code points is a matter of designing graceful API fallback for instances which have to deal with ill-formed strings and do *something*.", and here's be my advice based on that. You want the code point count to reflect the same count that you would get if you were to "sanitize" the string by fixing the isolated surrogates when converting to valid UTF-16 from a a 16-bit Unicode String. Sanitizing *never* should involve deletion (for security reasons). The best practice is to replace them by FFFD, according to the guidelines in TUS Chapter 3. Constraints on Conversion Processes And you want it to reflect the same code point count that you would get in common APIs that traverse 16-bit Unicode String. And I don't know of any code point iterators that just *skip* the isolates; they are typically returned as single code points. If these are not all aligned, then all heck breaks loose: you are letting yourself in for code breakage and/or security problems. So the corresponding code point count would just return a count of 1 for an isolated surrogate. UTF-8 is gummier. I'd return according to whatever the standard practice in the programming environment for "sanitizing" output is. That could be the "maximal subpart" approach in TUS Ch. 3, or it could be an alternative approach: consistency with the approach in use is the most important feature. Mark *? Il meglio ? l?inimico del bene ?* On Mon, Oct 12, 2015 at 6:36 AM, Ken Whistler wrote: > > > On 10/11/2015 2:20 PM, Richard Wordingham wrote: > >> Is the number of codepoints in a UTF-16 string well defined? >> >> For example, which of the following two statements are true? >> >> (a) The ill-formed three code-unit Unicode 16-bit string <0xDC00, >> 0xD800, 0xDC20> contains two codepoints, U+DC00 and U+10020. >> >> (b) The ill-formed three code-unit Unicode 16-bit string <0xDC00, >> 0xD800, 0xDC20> contains three codepoints, U+DC00, U+D800 and U+DC20. >> >> Statement (a) is probably more useful, but I couldn't find anything to >> rule that statement (b) is false. >> > > I think the correct answer is probably: > > (c) The ill-formed three code unit Unicode 16-bit string > <0xDC00, 0xD800, 0xDC20> contains one code point, U+10020 and > one uninterpreted (and uninterpretable) high surrogate > code unit 0xDC00. > > In other words, I don't think it is useful or helpful to map isolated, > uninterpretable surrogate code units *to* surrogate code points. > Surrogate code points are an artifact of the code architecture. They > are code points in the code space which *cannot* be represented > in UTF-16, by definition. > > Any discussion about properties for surrogate code points is a > matter of designing graceful API fallback for instances which > have to deal with ill-formed strings and do *something*. I don't > think that should extend to treating isolated surrogate code > units as having interpretable status, *as if* they were valid > code points represented in the string. > > It might be easier to get a handle on this if folks were to ask, instead > how many code points are in the ill-formed Unicode 8-bit > string <0x61, 0xF4, 0x90, 0x90, 0x90, 0x61>. 6 code units, yes, > but how many code points? I'd say two code points and > 4 uninterpretable, ill-formed UTF-8 code units, rather than > any other possible answer. > > Basically, you get the same kind of answer if the ill-formed string > were, instead <0x61, 0xED, 0xA0, 0x80, 0x61>. Two code points > and 3 uninterpretable, ill-formed UTF-8 code units. That is a > better answer than trying to map 0xED 0xA0 0x80 to U+D800 > and then saying, oh, that is a surrogate code *point*. > > --Ken > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From shervinafshar at gmail.com Mon Oct 12 09:07:13 2015 From: shervinafshar at gmail.com (Shervin Afshar) Date: Mon, 12 Oct 2015 07:07:13 -0700 Subject: Rights to the Emoji In-Reply-To: References: Message-ID: Twemoji are Open Source, but published under CC-BY and that license requires attribution which might be challenging in this specific use case. On Oct 11, 2015 10:46 PM, "Mark Davis ??" wrote: > The twitter images are open sourced, I believe. > > {phone} > On Oct 12, 2015 02:56, "Shervin Afshar" wrote: > >> Those listed in the column titled "Native" come from the operating system >> (in your case, Mac OS X) and/or browser you are viewing that page on. One >> can assume that the right to those belong to the entity who develops those >> software. >> >> A safer approach for you would be to use symbols from Emoji One[1]; if >> you can attribute that project on your products, you can use them for free; >> if you can not do that, they require that you contact them for a custom >> paid license [2]. >> >> Also, with the paid license you are helping a project publishing content >> under Creative Common license. >> >> [1]: http://emojione.com/ >> [2]: http://emojione.com/faq#faq5 >> >> ? Shervin >> >> On Sat, Oct 10, 2015 at 5:59 AM, patapatachakapon . < >> bugraaydin1999 at gmail.com> wrote: >> >>> Hello, >>> >>> I work for a small company in Turkey. We would like to import/sell >>> products that have pictures of Emoji on them (such as keychains, cups etc.) >>> , here in Turkey. The Emoji we would like to use on our products are the >>> ones that are titled Native on the chart that I've attached to this email. >>> I would like to know whether or not it's required to buy the rights these >>> Emoji. Are Emoji copyrighted, or can they be used by anyone for design >>> purposes? >>> >>> Thanks so much in advance! >>> >> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From nikiselken at gmail.com Mon Oct 12 09:17:55 2015 From: nikiselken at gmail.com (Nicole Selken) Date: Mon, 12 Oct 2015 10:17:55 -0400 Subject: Rights to the Emoji In-Reply-To: References: Message-ID: I would contact Apple about it. Many Ads on TV etc... are using this Emoji set. So there must be a way to get access, or they do not care. Thanks, Niki Selken Working on: www.nikiselken.com On Mon, Oct 12, 2015 at 10:07 AM, Shervin Afshar wrote: > Twemoji are Open Source, but published under CC-BY and that license > requires attribution which might be challenging in this specific use case. > On Oct 11, 2015 10:46 PM, "Mark Davis ??" wrote: > >> The twitter images are open sourced, I believe. >> >> {phone} >> On Oct 12, 2015 02:56, "Shervin Afshar" wrote: >> >>> Those listed in the column titled "Native" come from the operating >>> system (in your case, Mac OS X) and/or browser you are viewing that page >>> on. One can assume that the right to those belong to the entity who >>> develops those software. >>> >>> A safer approach for you would be to use symbols from Emoji One[1]; if >>> you can attribute that project on your products, you can use them for free; >>> if you can not do that, they require that you contact them for a custom >>> paid license [2]. >>> >>> Also, with the paid license you are helping a project publishing content >>> under Creative Common license. >>> >>> [1]: http://emojione.com/ >>> [2]: http://emojione.com/faq#faq5 >>> >>> ? Shervin >>> >>> On Sat, Oct 10, 2015 at 5:59 AM, patapatachakapon . < >>> bugraaydin1999 at gmail.com> wrote: >>> >>>> Hello, >>>> >>>> I work for a small company in Turkey. We would like to import/sell >>>> products that have pictures of Emoji on them (such as keychains, cups etc.) >>>> , here in Turkey. The Emoji we would like to use on our products are the >>>> ones that are titled Native on the chart that I've attached to this email. >>>> I would like to know whether or not it's required to buy the rights these >>>> Emoji. Are Emoji copyrighted, or can they be used by anyone for design >>>> purposes? >>>> >>>> Thanks so much in advance! >>>> >>> >>> -------------- next part -------------- An HTML attachment was scrubbed... URL: From wjgo_10009 at btinternet.com Mon Oct 12 03:38:41 2015 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Mon, 12 Oct 2015 09:38:41 +0100 (BST) Subject: How can my research become implemented in a standardized manner? In-Reply-To: References: <33365178.33329.1444395933996.JavaMail.defaultUser@defaultHost> <6041047.15822.1444472090118.JavaMail.defaultUser@defaultHost> Message-ID: <10696749.12128.1444639121696.JavaMail.defaultUser@defaultHost> > I believe using markup languages would be a better approach than getting some new character. Thank you for posting. That would make an interesting discussion, yet is off-topic for this thread. The topic for this thread is about the encoding process, not about the merits or otherwise of the particular encoding proposal. The flags tagspace was encoded by reference to an existing ISO standard. http://www.unicode.org/L2/L2015/15145r-add-regional-ind.pdf Yet if the tagspace for a new proposal needs to be defined and there is no existing ISO standard to which reference can be made, how is that tagspace to become defined, by what process, by which committee, already existing or new? Also, if the complete encoding depends on both of the encoding of a base character into Unicode and of the encoding of a tagspace, so that both items can be applied together by an end user, what is the infrastructure mechanism to be so the complete encoding can take place? William Overington 12 October 2015 -------------- next part -------------- An HTML attachment was scrubbed... URL: From wjgo_10009 at btinternet.com Mon Oct 12 03:48:31 2015 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Mon, 12 Oct 2015 09:48:31 +0100 (BST) Subject: How can my research become implemented in a standardized manner? In-Reply-To: References: <33365178.33329.1444395933996.JavaMail.defaultUser@defaultHost> <6041047.15822.1444472090118.JavaMail.defaultUser@defaultHost> Message-ID: <19969314.13377.1444639711548.JavaMail.defaultUser@defaultHost> Bonjour Philippe Thank you for posting. > In fact this is not just inventing new characters, all this personal research is about inventing a new human language as well ! Actually it is not. An end user would only need to use his or her own language using cascading menus. Everything else would be automated by software. However, whilst that would make an interesting discussion and I have answered, it is off-topic for this thread. The topic for this thread is about the encoding process, not about the merits or otherwise of the particular encoding proposal. The flags tagspace was encoded by reference to an existing ISO standard. http://www.unicode.org/L2/L2015/15145r-add-regional-ind.pdf Yet if the tagspace for a new proposal needs to be defined and there is no existing ISO standard to which reference can be made, how is that tagspace to become defined, by what process, by which committee, already existing or new? Also, if the complete encoding depends on both of the encoding of a base character into Unicode and of the encoding of a tagspace, so that both items can be applied together by an end user, what is the infrastructure mechanism to be so the complete encoding can take place? William Overington 12 October 2015 -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Mon Oct 12 10:29:13 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Mon, 12 Oct 2015 17:29:13 +0200 Subject: Counting Codepoints In-Reply-To: References: <20151011222034.2a1348ae@JRWUBU2> <561B38E1.5070007@att.net> Message-ID: 2015-10-12 14:42 GMT+02:00 Mark Davis ?? : > If these are not all aligned, then all heck breaks loose: you are letting > yourself in for code breakage and/or security problems. > > So the corresponding code point count would just return a count of 1 for > an isolated surrogate. > But the behavior in this case is absolutely not defined, and applications are free to do what they want when they encounter them. There's not even any warranty that any further (correctly encoded) code point will be returned, even if a replacement character like U+FFFE is returned, it could replace all the rest. So the count of 1 is possible for the first isolated surrogate but all the rest count count as 0 as well, or all the further characters could be replaced by U+FFFE independantly of what they initially represented. This would also be a "sanitized" result. TUS gives freedom of choice in application. There's absolutely no warranty that all possible "sanitized" results will be the same for all applications, and TUS does not even mandate which replacement character to use (not necessarily U+FFFE, it could as well be an ASCII '?' character or a C0 or control, when further processed to an application converting the result to some legacy 7-bit or 8-bit charset). My opinion is that the only really safe result is to not return any count of code points but instead throw an error (counting code points and with a function returning an integer is only valid if the UTF-16 input is actually a valid representation of code points, you cannot return a single integer as the application using that integer could expect to allocate some processing buffer, and then get this exact number of code points when reading the data into some processing buffer, and could leave initialized some positions in that buffer, or the application could assume that the input was left untouched and could then get an unexpected mismatch of digital signature). If your function counting codepoints and returning an integer counts those lone surrogates as 1, it assumes that exactly one codepoint will be returned for each lone surrogate, and it should document that clearly, meaning that the result is only valid if this matches the results of the actual input scanner. In that case that function will never fail and throw an exception. But between two implementations the result of the scanner could still be different because the replacement character is not specified. If that result "sanitized" string is then used to generate an URI, the URI is also unpredictable and will vary between implementations, as well as its effective length. If it is used to generate an identifier granting some new access, such as a user name, several new user names could be generated from the same input. So in all cases using replacements will also create security problems. This will not happen if you don't return any result but throw an exception (that counting function should document this exception so that it is not unexpectedly thrown and left unhandled, causing the program to abort prematurely in an unsafe state including loosing other data or transaction elsewhere in an incoherent state). For all programs taking some standard UTF input, the input scanner or processing functions MUST be prepared to handle the encoding error exception, which is an result expected equally to the return of a value or the execution of some code ! Sanitization is possible, but not described in the standard, and there are several conflict ways of doing it, it should be a separate subprocess documented separately. -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Mon Oct 12 10:33:07 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Mon, 12 Oct 2015 17:33:07 +0200 Subject: Counting Codepoints In-Reply-To: References: <20151011222034.2a1348ae@JRWUBU2> <561B38E1.5070007@att.net> Message-ID: Replace U+FFFE by U+FFFD in my message (but there are applications that also prefer using non-characters for those replacements, this is also an additional alternative, as U+FFFE has a valid representation as well in UTF-16). U+FFFD is not the only possible replacement even if it is recommended (by a "best practrice", which is not a "requirement" for conformance purpose). 2015-10-12 17:29 GMT+02:00 Philippe Verdy : > 2015-10-12 14:42 GMT+02:00 Mark Davis ?? : > >> If these are not all aligned, then all heck breaks loose: you are letting >> yourself in for code breakage and/or security problems. >> >> So the corresponding code point count would just return a count of 1 for >> an isolated surrogate. >> > > But the behavior in this case is absolutely not defined, and applications > are free to do what they want when they encounter them. There's not even > any warranty that any further (correctly encoded) code point will be > returned, even if a replacement character like U+FFFE is returned, it could > replace all the rest. > > So the count of 1 is possible for the first isolated surrogate but all the > rest count count as 0 as well, or all the further characters could be > replaced by U+FFFE independantly of what they initially represented. This > would also be a "sanitized" result. > > TUS gives freedom of choice in application. There's absolutely no warranty > that all possible "sanitized" results will be the same for all > applications, and TUS does not even mandate which replacement character to > use (not necessarily U+FFFE, it could as well be an ASCII '?' character or > a C0 or control, when further processed to an application > converting the result to some legacy 7-bit or 8-bit charset). > > My opinion is that the only really safe result is to not return any count > of code points but instead throw an error (counting code points and with a > function returning an integer is only valid if the UTF-16 input is actually > a valid representation of code points, you cannot return a single integer > as the application using that integer could expect to allocate some > processing buffer, and then get this exact number of code points when > reading the data into some processing buffer, and could leave initialized > some positions in that buffer, or the application could assume that the > input was left untouched and could then get an unexpected mismatch of > digital signature). > > If your function counting codepoints and returning an integer counts those > lone surrogates as 1, it assumes that exactly one codepoint will be > returned for each lone surrogate, and it should document that clearly, > meaning that the result is only valid if this matches the results of the > actual input scanner. In that case that function will never fail and throw > an exception. But between two implementations the result of the scanner > could still be different because the replacement character is not > specified. If that result "sanitized" string is then used to generate an > URI, the URI is also unpredictable and will vary between implementations, > as well as its effective length. If it is used to generate an identifier > granting some new access, such as a user name, several new user names > could be generated from the same input. > > So in all cases using replacements will also create security problems. > This will not happen if you don't return any result but throw an exception > (that counting function should document this exception so that it is not > unexpectedly thrown and left unhandled, causing the program to abort > prematurely in an unsafe state including loosing other data or transaction > elsewhere in an incoherent state). > > For all programs taking some standard UTF input, the input scanner or > processing functions MUST be prepared to handle the encoding error > exception, which is an result expected equally to the return of a value or > the execution of some code ! Sanitization is possible, but not described in > the standard, and there are several conflict ways of doing it, it should be > a separate subprocess documented separately. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Mon Oct 12 14:38:18 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Mon, 12 Oct 2015 20:38:18 +0100 Subject: Counting Codepoints In-Reply-To: <561B38E1.5070007@att.net> References: <20151011222034.2a1348ae@JRWUBU2> <561B38E1.5070007@att.net> Message-ID: <20151012203818.7fe468d3@JRWUBU2> On Sun, 11 Oct 2015 21:36:49 -0700 Ken Whistler wrote: > I think the correct answer is probably: > > (c) The ill-formed three code unit Unicode 16-bit string > <0xDC00, 0xD800, 0xDC20> contains one code point, U+10020 and > one uninterpreted (and uninterpretable) high surrogate > code unit 0xDC00. > > In other words, I don't think it is useful or helpful to map isolated, > uninterpretable surrogate code units *to* surrogate code points. > Surrogate code points are an artifact of the code architecture. They > are code points in the code space which *cannot* be represented > in UTF-16, by definition. > > Any discussion about properties for surrogate code points is a > matter of designing graceful API fallback for instances which > have to deal with ill-formed strings and do *something*. I don't > think that should extend to treating isolated surrogate code > units as having interpretable status, *as if* they were valid > code points represented in the string. Graceful fallback is exactly where the issue arises. Throwing an exception is not a useful answer to the question of how many code points a 'Unicode string' (not a 'UTF-16 string') contains. The question can arise when one is following an instruction to advance x codepoints; the usual presumption is that the preferred response is to advance exactly x scalar values and not advance over anything else. > It might be easier to get a handle on this if folks were to ask, > instead how many code points are in the ill-formed Unicode 8-bit > string <0x61, 0xF4, 0x90, 0x90, 0x90, 0x61>. 6 code units, yes, > but how many code points? I'd say two code points and > 4 uninterpretable, ill-formed UTF-8 code units, rather than > any other possible answer. In this case I'd say three 'somethings', and define 'something' accordingly. There are different ideas as to what a 'something' should be. Having a clear definition matters when moving backwards and forwards through a Unicode 8-bit string. > Basically, you get the same kind of answer if the ill-formed string > were, instead <0x61, 0xED, 0xA0, 0x80, 0x61>. Two code points > and 3 uninterpretable, ill-formed UTF-8 code units. That is a > better answer than trying to map 0xED 0xA0 0x80 to U+D800 > and then saying, oh, that is a surrogate code *point*. A simple scenario is a filter that takes in a single byte (or EOF) at a time and returns a scalar value, 'no character yet', 'corrupt' or 'end of text'. It is a significant complication for it to have to emit sequences of values indicating uninterpretable bytes. I've found it much easier to treat bad sequences of UTF-8 code units that are bad by reason of their length and indicated scalar value as a single entity. This simplifies moving forwards and backwards through strings to just detecting non-continuation bytes and limiting traversal through runs of continuation bytes. Otherwise, one must also check the following continuation byte for a valid range. For example, if one starts at position 5 in your first example, just before the second 'A', one faces the following logic when moving back one codepoint. 1) Provisionally back up to position 1, just before 0xF4. 2) Confirm that one has skipped no more than 3 continuation bytes. 3) Conform that at least 3 continuation bytes follow the 0xF4. 4) Examine the first continuation byte, 0x90, and realise that it is not a legal value there. 5) Change to moving back one byte, arriving at position 4, just before the last 0x90. It gets even more complicated if one follows the "maximal subpart" approach of TUS Ch. 3. By contrast, one can even report the bad sequences in a 21-bit extension of Unicode. For example, one could use bits 20:16 to encode the problem, e.g.: 0-16 => Valid scalar value (excludes 0xD800 to 0xDFFF) 1) Numbers that look like scalar values: 1.1) Value not a scalar value: 17 => 11xxxx (start F4 9y) 18 => 12xxxx (start F4 Ay) 19 => 13xxxx (start F4 By) 20 => Surrogate codepoint (start ED Ay or ED By) (2^11 seqq.) 1.2) Non-shortest form: 21 => 4 bytes long (start F0 8y) (image of BMP) 22 => 3 bytes long (start E0 8y or E0 9y) (2^11 seqq.) 23 => 2 bytes long (start C0 or C1) (image of ASCII)* 2) Uninterpretable sequences: 24 => Declared length 4 but actually 3 long (5 * 2^12 seqq.) 25 => Declared length 4 but actually 2 long (5 * 2^6 seqq.) 26 => Declared length 3 but actually 2 long (2^10 seqq.) 27 => Non-ASCII lone bytes (2^7 seqq.)* * Not necessarily composed of UTF-8 code units. 17 => 11xxxx (start F4 9y) In this scheme, <0x61, 0xF4, 0x90, 0x90, 0x90, 0x61> would be analysed as , and the application could decide what to do with V+110410. It'd probably just be replaced by U+FFFD. Richard. From richard.wordingham at ntlworld.com Mon Oct 12 15:23:09 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Mon, 12 Oct 2015 21:23:09 +0100 Subject: Counting Codepoints In-Reply-To: References: <20151011222034.2a1348ae@JRWUBU2> <561B38E1.5070007@att.net> Message-ID: <20151012212309.42dd707c@JRWUBU2> On Mon, 12 Oct 2015 17:29:13 +0200 Philippe Verdy wrote: > But between two implementations > the result of the scanner could still be different because the > replacement character is not specified. If that result "sanitized" > string is then used to generate an URI, the URI is also unpredictable > and will vary between implementations, as well as its effective > length. If it is used to generate an identifier granting some new > access, such as a user name, several new user names could be > generated from the same input. TUS 8.0 Section 3 Requirement C10 has the following, wise words in its final paragraph: "However, such repair of mangled data is a special case, and it must not be used in circumstances where it would cause security problems." Richard. From verdy_p at wanadoo.fr Mon Oct 12 17:49:29 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Tue, 13 Oct 2015 00:49:29 +0200 Subject: Counting Codepoints In-Reply-To: <20151012203818.7fe468d3@JRWUBU2> References: <20151011222034.2a1348ae@JRWUBU2> <561B38E1.5070007@att.net> <20151012203818.7fe468d3@JRWUBU2> Message-ID: 2015-10-12 21:38 GMT+02:00 Richard Wordingham < richard.wordingham at ntlworld.com>: > On Sun, 11 Oct 2015 21:36:49 -0700 > Ken Whistler wrote: > > > I think the correct answer is probably: > > > > (c) The ill-formed three code unit Unicode 16-bit string > > <0xDC00, 0xD800, 0xDC20> contains one code point, U+10020 and > > one uninterpreted (and uninterpretable) high surrogate > > code unit 0xDC00. > > > > In other words, I don't think it is useful or helpful to map isolated, > > uninterpretable surrogate code units *to* surrogate code points. > > Surrogate code points are an artifact of the code architecture. They > > are code points in the code space which *cannot* be represented > > in UTF-16, by definition. > > > > Any discussion about properties for surrogate code points is a > > matter of designing graceful API fallback for instances which > > have to deal with ill-formed strings and do *something*. I don't > > think that should extend to treating isolated surrogate code > > units as having interpretable status, *as if* they were valid > > code points represented in the string. > > Graceful fallback is exactly where the issue arises. Throwing an > exception is not a useful answer to the question of how many code > points a 'Unicode string' (not a 'UTF-16 string') contains. > It really is a **useful** answer because there's actually no correct answer, unless you assume some (not clearly defined) sanitization process (removal or part of the text means you give an answer about a different text, substitution is also not clearly defined, you could remove everything after the first error encountered). If you get an invalid UTF-16 string, and caught an exception, this is a sign that it is not UTF-16, and very frequently something else. The application may want to retry with another encoding, possibly using heuristic guessers, but the heuristic will only give a *probable answer*. If this probable answer is still UTF-16, the application may or may not want to alter the input text and instruct the function to perform a specific "sanitization", but this process is NOT defined in the UTF-16 specification itself, the result will be a local-only decision, which may not match what other systems will do (other systemls may fallback to an encoding that produces no error at all such as ISO8859-1 or a default encoding of the system such as CP437. But as this wil frequently produce "mojibake", it is best to notice it, log that for later manual processing (if needed) and discard that text completely as invalid (the standard behavior for UTF-16 for conforming applications). Any sanitization will be errorprone as it will always be an heuristic, users should have some visible notification that the input was invalid, and the "correction" should not be automated unless the users really ask for it and the application offers a choice of options. The minimum being that the application should offer a visual inspection to the user for each option. But we are then completely out of scope of the UTF-16 standard itself. -------------- next part -------------- An HTML attachment was scrubbed... URL: From petercon at microsoft.com Mon Oct 12 18:00:38 2015 From: petercon at microsoft.com (Peter Constable) Date: Mon, 12 Oct 2015 23:00:38 +0000 Subject: Rights to the Emoji In-Reply-To: <561B3DFB.2070605@it.aoyama.ac.jp> References: <561B3DFB.2070605@it.aoyama.ac.jp> Message-ID: Exactly: specific designs are subject to license terms determined by the original designer, which are liberal in some cases and not in others. But the concept of a such-and-such emoji and it's encoded representation are not an issue. Peter -----Original Message----- From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Martin J. D?rst Sent: Sunday, October 11, 2015 9:59 PM To: patapatachakapon . Cc: Shervin Afshar ; unicode at unicode.org Subject: Re: Rights to the Emoji You can also design your own version of the emoji you want to use. [I'm not a lawyer, but as far as I understand,] what's protected is the individual design, not the idea of a "donut" or "frowning face" emoji as such. Regards, Martin. On 2015/10/12 09:51, Shervin Afshar wrote: > Those listed in the column titled "Native" come from the operating > system (in your case, Mac OS X) and/or browser you are viewing that > page on. One can assume that the right to those belong to the entity > who develops those software. > > A safer approach for you would be to use symbols from Emoji One[1]; if > you can attribute that project on your products, you can use them for > free; if you can not do that, they require that you contact them for a > custom paid license [2]. > > Also, with the paid license you are helping a project publishing > content under Creative Common license. > > [1]: http://emojione.com/ > [2]: http://emojione.com/faq#faq5 > > ? Shervin > > On Sat, Oct 10, 2015 at 5:59 AM, patapatachakapon . < > bugraaydin1999 at gmail.com> wrote: > >> Hello, >> >> I work for a small company in Turkey. We would like to import/sell >> products that have pictures of Emoji on them (such as keychains, cups >> etc.) , here in Turkey. The Emoji we would like to use on our >> products are the ones that are titled Native on the chart that I've attached to this email. >> I would like to know whether or not it's required to buy the rights >> these Emoji. Are Emoji copyrighted, or can they be used by anyone for >> design purposes? >> >> Thanks so much in advance! >> > From prosfilaes at gmail.com Mon Oct 12 18:35:32 2015 From: prosfilaes at gmail.com (David Starner) Date: Mon, 12 Oct 2015 23:35:32 +0000 Subject: Counting Codepoints In-Reply-To: <20151012212309.42dd707c@JRWUBU2> References: <20151011222034.2a1348ae@JRWUBU2> <561B38E1.5070007@att.net> <20151012212309.42dd707c@JRWUBU2> Message-ID: Any system that exposes Unicode strings (not UTF-16 strings) cannot have two surrogates merge when two strings are appended. There's nothing in the Unicode standard that says that should happen for a string in an arbitrary format, and it's unreasonable behavior for a string. Thus a Unicode string simply can't be in UTF-16 format internally with unpaired surrogates; a Unicode string in a programmer opaque format must do something with broken data on input. On 1:27pm, Mon, Oct 12, 2015 Richard Wordingham < richard.wordingham at ntlworld.com> wrote: > On Mon, 12 Oct 2015 17:29:13 +0200 > Philippe Verdy wrote: > > > But between two implementations > > the result of the scanner could still be different because the > > replacement character is not specified. If that result "sanitized" > > string is then used to generate an URI, the URI is also unpredictable > > and will vary between implementations, as well as its effective > > length. If it is used to generate an identifier granting some new > > access, such as a user name, several new user names could be > > generated from the same input. > > TUS 8.0 Section 3 Requirement C10 has the following, wise words in its > final paragraph: > > "However, such repair of mangled data is a special case, and it must > not be used in circumstances where it would cause security problems." > > Richard. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Tue Oct 13 01:36:30 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Tue, 13 Oct 2015 07:36:30 +0100 Subject: Counting Codepoints In-Reply-To: References: <20151011222034.2a1348ae@JRWUBU2> <561B38E1.5070007@att.net> <20151012203818.7fe468d3@JRWUBU2> Message-ID: <20151013073630.7af12df6@JRWUBU2> On Tue, 13 Oct 2015 00:49:29 +0200 Philippe Verdy wrote: > 2015-10-12 21:38 GMT+02:00 Richard Wordingham < > richard.wordingham at ntlworld.com>: > > Graceful fallback is exactly where the issue arises. Throwing an > > exception is not a useful answer to the question of how many code > > points a 'Unicode string' (not a 'UTF-16 string') contains. > If you get an invalid UTF-16 string, and caught an exception, this is > a sign that it is not UTF-16, and very frequently something else. The > application may want to retry with another encoding, possibly using > heuristic guessers, but the heuristic will only give a *probable > answer*. On Mon, 12 Oct 2015 23:35:32 +0000 David Starner wrote: > Thus a Unicode string simply can't be in UTF-16 format > internally with unpaired surrogates; a Unicode string in a programmer > opaque format must do something with broken data on input. You're assuming that the source of the non-conformance is external to the program. In the case that has caused me to ask about lone surrogates, they were actually caused by a faulty character deletion function within the program itself. Despite this fault, the program remains usable - it's little worse than a word processor that insists on autocorrupting 'GHz' and 'MHz' to 'Ghz' and 'Mhz'. I presume you are expecting input of fractional characters to be buffered until there is a whole character to add to a string. For example, a MSKLC keyboard will deliver a supplementary character in two WM_CHAR messages, one for the high surrogate and one for the low surrogate. Returning to the original questions, it would seem that there is not a unique answer to the question of how many codepoints a Unicode 16-bit string contains. Rather the question must be the unwieldy one of how many scalar values and lone surrogates it contains in total. Richard. From verdy_p at wanadoo.fr Tue Oct 13 05:17:43 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Tue, 13 Oct 2015 12:17:43 +0200 Subject: Counting Codepoints In-Reply-To: <20151013073630.7af12df6@JRWUBU2> References: <20151011222034.2a1348ae@JRWUBU2> <561B38E1.5070007@att.net> <20151012203818.7fe468d3@JRWUBU2> <20151013073630.7af12df6@JRWUBU2> Message-ID: 2015-10-13 8:36 GMT+02:00 Richard Wordingham < richard.wordingham at ntlworld.com>: > For > example, a MSKLC keyboard will deliver a supplementary character in > two WM_CHAR messages, one for the high surrogate and one for the low > surrogate. > I have not tested the actual behavior in 64-bit versions of Windows : is the message field of the WM_CHAR returned by the 64-bit version of the API still requires returning two messages and not a single one if that field has been extended to 64-bit ? In that case, no surrogates would be returned, but directly the supplementary character. But may be this has not changed so that the predefined Windows type for wide characters remains 16-bit (otherwise even in the 32-bit version of the API, a single message would have been enough with a 32-bit message data field): the "Unicode" version of the API's assume everywhere a 16-bit encoding of strings and the event message most probably uses the same size of code units. The actual behavior is also tricky as the basic layouts built with MSKLC will have its character data translated "transparently" to other "OEM" encodings according to the current input code page of the console (using one of the codepage mapping tables installed separately): the transcoder will also need to translate the 16-bit Unicode input from WM_CHAR messages into the 8-bit input stream used by the console, and this translation will need to read both surrogates at once before sending any output. Also I don't think this is specific to MSKLC drivers. A driver (not just keyboard layouts that actually contain no code but just a data structure, but also input methods using their own message loop to process and filter input events and delivering their own translated messages) built with any other tool will use the same message format. Any way, those Windows drivers cannot actually know how the editing application will finally process the two surrogates : if the application does not detect surrogates properly and chose to discard one but not the other, the driver is not at fault and it is a bug of the application. Those MSKLC drivers actually have no view on the input buffer, they process the input on the flow (but may be the a more advanced input driver with its own message processing loop could send its own messages to query the application about what is in its buffer, or to instruct it to perform some custom substring replacements/editing and update its caret position or selection). So in my view, this is not a bug of the layout drivers themselves and not even a bug of the Windows core API. The editing application (or the common interface component) has to be prepared to process both surrogates as one character, or discard lone surrogates it could see (after alerting the user with some beep message), or submit some custom replacement. It is this application or component that will need to manage its input buffer correctly. If that buffer uses 16-bit code units, deleting one position in the buffer (for example when pressing Backspace or delete) without looking at what is deleted, or performing text selection in the middle of a surrogates pair (and then blindly replacing that selection) will generate those lone surrogates in the input buffer. The same considerations would also apply to Linux input drivers and GUI components, that use 8-bit encodings including UTF-8 (this is more difficult because the Linux kernel is blind about the encoding, which is defined only in the user's input locale environment): the same havoc could happen if the editing application breaks in the middle of a multibyte UTF-8 sequence, and the applications must also be ready to accept random byte sequences including those not containing valid UTF-8 (but how those applications will actually handle the offending bytes remains also application dependant), and the same question will arise : how many code points are in the 8-bit string if it is not valid UTF-8 ? There will not be a unique answer because how the application will filter those errors will vary. You'd also have the same problem with console apps using the 8-bit BIOS/DOS input emulation API, or within terminal applications listening for input from a network socket sending 8-bit data streams (the emulation protocol will also need to filter that input and detect errors when the input does not validate the expected encoding, but how that protocol protocol will recover after the error will remain protocol dependant, and it's not sure that the emulation terminal provides notifications to the user when there are input errors, the protocol may as well interrupt the communication with an EOF event and the communication channel closed). In other words: as soon as there's a single error in some input for the UTF validation, you cannot assert any value for the whole input content. -------------- next part -------------- An HTML attachment was scrubbed... URL: From mark at macchiato.com Tue Oct 13 07:08:28 2015 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Tue, 13 Oct 2015 14:08:28 +0200 Subject: Counting Codepoints In-Reply-To: <20151013073630.7af12df6@JRWUBU2> References: <20151011222034.2a1348ae@JRWUBU2> <561B38E1.5070007@att.net> <20151012203818.7fe468d3@JRWUBU2> <20151013073630.7af12df6@JRWUBU2> Message-ID: On Tue, Oct 13, 2015 at 8:36 AM, Richard Wordingham < richard.wordingham at ntlworld.com> wrote: > Rather the question must be the unwieldy one of how > many scalar values and lone surrogates it contains in total. > ?That may be the question in theory; in practice no programming language is going to support APIs like that. So the question is whether your original question was purely theoretical, or was about some particular language/environment. If the latter, then looking at the behavior of related functions in that environment, like traversing a string, and counting in a way that is most consistent with their behavior, is the least likely to cause problems. For example, Java is pretty consistent; each of the following returns 2 as the count. String test = "\uDC00\uD800\uDC20"; int count = test.codePointCount(0, test.length()); *System.out.println("codePointCount:\t" + count);* count = 0; int cp; for (int i = 0; i < test.length(); i += Character.charCount(cp)) { cp = test.codePointAt(i); count++; } *System.out.println("Java 7 iteration:\t" + count);* count = 0; for (int cp2 : test.codePoints().toArray()) { count++; } *System.out.println("Java 8 iteration:\t" + count);* // for the last, could just call: *count = (int) test.codePoints().count();* The isolate surrogate code unit is ?consistently treated as the corresponding surrogate code point, which is what ?anyone would ?reasonably ? expect. Mark -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Tue Oct 13 09:16:47 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Tue, 13 Oct 2015 16:16:47 +0200 Subject: Counting Codepoints In-Reply-To: References: <20151011222034.2a1348ae@JRWUBU2> <561B38E1.5070007@att.net> <20151012203818.7fe468d3@JRWUBU2> <20151013073630.7af12df6@JRWUBU2> Message-ID: This works in Java because Java also treats surrogates as characters, even if it has additional APIs to test strings for their actual encoding length for Unicode. But outside strings, characters are just integers mathing their code point value, and are not restricted to be valid Unicode characters (strings also are not restricted to UTF-16 validation). Java strings are not UTF-16 strings, they are just streams of unsigned 16-bit code units, with arbitrary values and relative order (so ill-formed strings for Unicode are still valid Java strings). When UTF-16 validity is required, your examples with loops would have to test the presence of lone surrogates in the returned code points. Such detection is needed for implementing some protocols, e.g. to parse HTML pages and check the encoding (or guess it) and the input stream would then be parsed with another encoding countring codepoints differently. For I/O, the 16-bit "char" type.is actually not used, I/O is performed with signed "byte"s, they are decoded using a specific encoding which will return errors or exceptions if decoded into strings, or for the reverse operation which can also fail). 2015-10-13 14:08 GMT+02:00 Mark Davis ?? : > > On Tue, Oct 13, 2015 at 8:36 AM, Richard Wordingham < > richard.wordingham at ntlworld.com> wrote: > >> Rather the question must be the unwieldy one of how >> many scalar values and lone surrogates it contains in total. >> > > ?That may be the question in theory; in practice no programming language > is going to support APIs like that. So the question is whether your > original question was purely theoretical, or was about some particular > language/environment. > > If the latter, then looking at the behavior of related functions in that > environment, like traversing a string, and counting in a way that is most > consistent with their behavior, is the least likely to cause problems. > > For example, Java is pretty consistent; each of the following returns 2 as > the count. > > String test = "\uDC00\uD800\uDC20"; > int count = test.codePointCount(0, test.length()); > *System.out.println("codePointCount:\t" + count);* > > count = 0; > int cp; > for (int i = 0; i < test.length(); i += Character.charCount(cp)) { > cp = test.codePointAt(i); > count++; > } > *System.out.println("Java 7 iteration:\t" + count);* > > count = 0; > for (int cp2 : test.codePoints().toArray()) { > count++; > } > *System.out.println("Java 8 iteration:\t" + count);* > > // for the last, could just call: *count = (int) > test.codePoints().count();* > > The isolate surrogate code unit is > ?consistently treated > as the corresponding surrogate code point, which is what > ?anyone would > > ?reasonably ? > expect. > > Mark > -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Tue Oct 13 09:46:10 2015 From: doug at ewellic.org (Doug Ewell) Date: Tue, 13 Oct 2015 07:46:10 -0700 Subject: Counting Codepoints Message-ID: <20151013074610.665a7a7059d7ee80bb4d670165c8327d.89f65761b7.wbe@email03.secureserver.net> Richard Wordingham wrote: > You're assuming that the source of the non-conformance is external to > the program. In the case that has caused me to ask about lone > surrogates, they were actually caused by a faulty character deletion > function within the program itself. I've been bemused by all this discussion about how unpaired surrogates are supposed to behave, and this comment just cleared everything up for me. We're talking about a bug. Very well, then, the answer is that the bug should be fixed. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From prosfilaes at gmail.com Tue Oct 13 10:23:36 2015 From: prosfilaes at gmail.com (David Starner) Date: Tue, 13 Oct 2015 15:23:36 +0000 Subject: Counting Codepoints In-Reply-To: <20151013073630.7af12df6@JRWUBU2> References: <20151011222034.2a1348ae@JRWUBU2> <561B38E1.5070007@att.net> <20151012203818.7fe468d3@JRWUBU2> <20151013073630.7af12df6@JRWUBU2> Message-ID: On Mon, Oct 12, 2015 at 11:42 PM Richard Wordingham < richard.wordingham at ntlworld.com> wrote: > On Mon, 12 Oct 2015 23:35:32 +0000 > David Starner wrote: > > > Thus a Unicode string simply can't be in UTF-16 format > > internally with unpaired surrogates; a Unicode string in a programmer > > opaque format must do something with broken data on input. > > You're assuming that the source of the non-conformance is external to > the program. In the case that has caused me to ask about lone > surrogates, they were actually caused by a faulty character deletion > function within the program itself. Despite this fault, the program > remains usable - it's little worse than a word processor that insists on > autocorrupting 'GHz' and 'MHz' to 'Ghz' and 'Mhz'. > > I presume you are expecting input of fractional characters to be > buffered until there is a whole character to add to a string. For > example, a MSKLC keyboard will deliver a supplementary character in > two WM_CHAR messages, one for the high surrogate and one for the low > surrogate. > A UTF-16 string could delete one surrogate, or add a fractional character. A Unicode string (not a "UTF-16 string"), which could be stored internally in, say, a Python-like format which is Latin-1, UCS-2, or UTF-32, conversions made as needed and differences hidden from the user, can't. If you let the code delete one surrogate or add one surrogate, if you interpret surrogates at all, it's a UTF-16 string; like often in computing, it gives the programmer more power and control at the cost of being harder to use and easier to break. -------------- next part -------------- An HTML attachment was scrubbed... URL: From daniel.buenzli at erratique.ch Tue Oct 13 10:09:16 2015 From: daniel.buenzli at erratique.ch (=?utf-8?Q?Daniel_B=C3=BCnzli?=) Date: Tue, 13 Oct 2015 16:09:16 +0100 Subject: Counting Codepoints In-Reply-To: <20151013074610.665a7a7059d7ee80bb4d670165c8327d.89f65761b7.wbe@email03.secureserver.net> References: <20151013074610.665a7a7059d7ee80bb4d670165c8327d.89f65761b7.wbe@email03.secureserver.net> Message-ID: Le mardi, 13 octobre 2015 ? 15:46, Doug Ewell a ?crit : > I've been bemused by all this discussion about how unpaired surrogates > are supposed to behave I don't understand why people still insist on programming with Unicode at the encoding level rather than at the scalar value level. Deal with encoding errors and sanitize your inputs at the IO boundary of your program and then simply work with scalar values internally. Daniel From richard.wordingham at ntlworld.com Tue Oct 13 13:44:39 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Tue, 13 Oct 2015 19:44:39 +0100 Subject: Counting Codepoints In-Reply-To: References: <20151011222034.2a1348ae@JRWUBU2> <561B38E1.5070007@att.net> <20151012203818.7fe468d3@JRWUBU2> <20151013073630.7af12df6@JRWUBU2> Message-ID: <20151013194439.35417d5b@JRWUBU2> On Tue, 13 Oct 2015 14:08:28 +0200 Mark Davis ?? wrote: > On Tue, Oct 13, 2015 at 8:36 AM, Richard Wordingham < > richard.wordingham at ntlworld.com> wrote: > > Rather the question must be the unwieldy one of how > > many scalar values and lone surrogates it contains in total. > ?That may be the question in theory; in practice no programming > language is going to support APIs like that. And then exhibits such an API in Java! > // for the last, could just call: *count = (int) test.codePoints().count();* The challenge is rather one of expressing the task. Perhaps: "What is the sum of the number of scalar values and the number of lone surrogates in this Unicode 16-bit string?" Maybe even: "What is the sum of the numbers of non-surrogate codepoints, surrogate pairs and lone surrogates in this Unicode 16-bit string?" It's slightly less unwieldy in the context I actually want the expression - "Go back for a grand total of x non-surrogate codepoints, surrogate pairs or lone surrogates." Richard. From richard.wordingham at ntlworld.com Tue Oct 13 13:53:29 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Tue, 13 Oct 2015 19:53:29 +0100 Subject: Counting Codepoints In-Reply-To: References: <20151011222034.2a1348ae@JRWUBU2> <561B38E1.5070007@att.net> <20151012203818.7fe468d3@JRWUBU2> <20151013073630.7af12df6@JRWUBU2> Message-ID: <20151013195329.1647a3e8@JRWUBU2> On Tue, 13 Oct 2015 15:23:36 +0000 David Starner wrote: > A UTF-16 string could delete one surrogate, or add a fractional > character. A Unicode string (not a "UTF-16 string"), which could be > stored internally in, say, a Python-like format which is Latin-1, > UCS-2, or UTF-32, conversions made as needed and differences hidden > from the user, can't. Confusingly, the Unicode definitions are the other way round. A UTF-16 string is a string of UTF-16 codepoints in which all surrogate characters are paired surrogates. Any string of UTF-15 code units may is a Unicode 16-bit string. Richard. From richard.wordingham at ntlworld.com Tue Oct 13 14:04:49 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Tue, 13 Oct 2015 20:04:49 +0100 Subject: Counting Codepoints In-Reply-To: References: <20151011222034.2a1348ae@JRWUBU2> <561B38E1.5070007@att.net> <20151012203818.7fe468d3@JRWUBU2> <20151013073630.7af12df6@JRWUBU2> Message-ID: <20151013200449.1cc419eb@JRWUBU2> On Tue, 13 Oct 2015 12:17:43 +0200 Philippe Verdy wrote: > 2015-10-13 8:36 GMT+02:00 Richard Wordingham < > richard.wordingham at ntlworld.com>: > > For > > example, a MSKLC keyboard will deliver a supplementary character in > > two WM_CHAR messages, one for the high surrogate and one for the low > > surrogate. > I have not tested the actual behavior in 64-bit versions of Windows : > is the message field of the WM_CHAR returned by the 64-bit version > of the API still requires returning two messages and not a single one > if that field has been extended to 64-bit ? In Unicode applications, WM_CHAR still delivers one UTF-16 codepoint. I suspect if delivers just one byte in multibyte 'ANSI' encodings. There is a WM_UNICHAR message that delivers whole Unicode characters, but reportedly Microsoft does not use it. > The actual behavior is also tricky as the basic layouts built with > MSKLC will have its character data translated "transparently" to > other "OEM" encodings according to the current input code page of the > console (using one of the codepage mapping tables installed > separately): the transcoder will also need to translate the 16-bit > Unicode input from WM_CHAR messages into the 8-bit input stream used > by the console, and this translation will need to read both > surrogates at once before sending any output. This only applies to 'ANSI' applications. I am not aware of any ANSI codepages that contain supplementary characters. For a Unicode application, no translation from Unicode occurs. Richard. From richard.wordingham at ntlworld.com Tue Oct 13 17:37:06 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Tue, 13 Oct 2015 23:37:06 +0100 Subject: Why Work at Encoding Level? In-Reply-To: References: <20151013074610.665a7a7059d7ee80bb4d670165c8327d.89f65761b7.wbe@email03.secureserver.net> Message-ID: <20151013233706.63771fc4@JRWUBU2> On Tue, 13 Oct 2015 16:09:16 +0100 Daniel B?nzli wrote (under topic heading 'Counting Codepoints') > I don't understand why people still insist on programming with > Unicode at the encoding level rather than at the scalar value level. > Deal with encoding errors and sanitize your inputs at the IO boundary > of your program and then simply work with scalar values internally. If you are referring to indexing, I suspect the issue is performance. UTF-32 feels wasteful, and if the underlying character text is UTF-8 or UTF-16 we need an auxiliary array to convert character number to byte offset if we are to have O(1) time for access. This auxiliary array can be compressed chunk by chunk, but the larger the chunk, the greater the maximum access time. The way it could work is a bit strange, because this auxiliary array is redundant. For example, you could use it to record the location of every 4th or every 5th codepoint so as to store UTF-8 offset variation in 4 bits, or every 15th codepoint for UTF-16. Access could proceed by looking up the index for the relevant chunk, then adding up nibbles to find the relevant recorded location within the chunk, and then use the basic character storage itself to finally reach the intermediate points. (I doubt this is an original idea, but I couldn't find it expressed anywhere. It probably performs horribly for short strings.) Perhaps you are merely suggesting that people work with a character iterator, or in C refrain from doing integer arithmetic on pointers into strings. Richard. From daniel.buenzli at erratique.ch Tue Oct 13 18:28:26 2015 From: daniel.buenzli at erratique.ch (=?utf-8?Q?Daniel_B=C3=BCnzli?=) Date: Wed, 14 Oct 2015 00:28:26 +0100 Subject: Why Work at Encoding Level? In-Reply-To: <20151013233706.63771fc4@JRWUBU2> References: <20151013074610.665a7a7059d7ee80bb4d670165c8327d.89f65761b7.wbe@email03.secureserver.net> <20151013233706.63771fc4@JRWUBU2> Message-ID: <84D764A7A83B4D73A9C09C7D619E2922@erratique.ch> Le mardi, 13 octobre 2015 ? 23:37, Richard Wordingham a ?crit : > If you are referring to indexing, I suspect the issue is performance. > UTF-32 feels wasteful, and if the underlying character text is UTF-8 or > UTF-16 we need an auxiliary array to convert character number to byte > offset if we are to have O(1) time for access. If UTF-32 feels wasteful there are various smart ways of providing direct indexing at a reasonable cost if you are in a language that has minimal support for datatype definition and abstraction. Also I personally find indexing to be rarely useful in string processing, so it may not be the operation you want to optimize for. Having iterators-like functions as you suggest and a datatype to represent substrings seems often a better fit than doing indexing arithmetic. Note that the Swift programming language seems to have gone even further than I would have: their notion of character is a grapheme cluster tested for equality using canonical equivalence and that's what they index in their strings, see [1]. Don't know how well that works in practice as I personally never used it; but it feels like the ultimate Unicode string model you want to provide to the zero-knowledge Unicode programmer (at least for alphabetic scripts). Best, Daniel [1] https://developer.apple.com/library/ios/documentation/Swift/Conceptual/Swift_Programming_Language/StringsAndCharacters.html From verdy_p at wanadoo.fr Tue Oct 13 18:41:36 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Wed, 14 Oct 2015 01:41:36 +0200 Subject: Why Work at Encoding Level? In-Reply-To: <20151013233706.63771fc4@JRWUBU2> References: <20151013074610.665a7a7059d7ee80bb4d670165c8327d.89f65761b7.wbe@email03.secureserver.net> <20151013233706.63771fc4@JRWUBU2> Message-ID: Speed is not much linked to the in-memory buffer sizes (memory is cheap now and cumfortable) and parsing in memory encodings is extremely fast. The actual limitation is in I/O (network or storage on disk), and at this level you work with network datagrams/packets, or disk buffers or memory pages for paging, which are using buffers with static size (so the memory allocation cost can be avoided as it is reusable). Given that, you can easily create default buffers as small as about 4KB and convert it from any encoding to another with a static auxiliary buffer also small (16 KB for the worst cases) and manage with little cost the transition that may occur in the middle of an encoding sequence. Working with buffers considerably reduces the number of I/O performed, and you can still compress it by chunk (just make sure your auxiliary buffer has enough spare bytes at end for the worst case to avoid performing 2 I/O or compressing two chucks including a degenerate one. Even data compression is fast now and helps reducing the I/O : the cost of compression in memory is small compared to the cost of I/O, so much that now the Windows kernel can also use generic data compression for memory page paging to improve the global performance of the system, when the global memory page pool is full, or for disk virtualization purpose. The UTF-8 encoding is extremely simple and very fast to implement, and for most cases, it saves a lot compared to storing UTF-32 (including for large collections of text elements in memory). So using iterators is the way to go, it is simple to program, easy to optimize, and you completely forget that UTF-8 is used in the background store. 2015-10-14 0:37 GMT+02:00 Richard Wordingham < richard.wordingham at ntlworld.com>: > On Tue, 13 Oct 2015 16:09:16 +0100 > Daniel B?nzli wrote (under topic heading > 'Counting Codepoints') > > > I don't understand why people still insist on programming with > > Unicode at the encoding level rather than at the scalar value level. > > Deal with encoding errors and sanitize your inputs at the IO boundary > > of your program and then simply work with scalar values internally. > > If you are referring to indexing, I suspect the issue is performance. > UTF-32 feels wasteful, and if the underlying character text is UTF-8 or > UTF-16 we need an auxiliary array to convert character number to byte > offset if we are to have O(1) time for access. > > This auxiliary array can be compressed chunk by chunk, but the larger > the chunk, the greater the maximum access time. The way it could work > is a bit strange, because this auxiliary array is redundant. For > example, you could use it to record the location of every 4th or every > 5th codepoint so as to store UTF-8 offset variation in 4 bits, or every > 15th codepoint for UTF-16. Access could proceed by looking up the > index for the relevant chunk, then adding up nibbles to find the > relevant recorded location within the chunk, and then use the basic > character storage itself to finally reach the intermediate points. > > (I doubt this is an original idea, but I couldn't find it expressed > anywhere. It probably performs horribly for short strings.) > > Perhaps you are merely suggesting that people work with a character > iterator, or in C refrain from doing integer arithmetic on pointers > into strings. > > Richard. > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From moyogo at gmail.com Wed Oct 14 11:04:20 2015 From: moyogo at gmail.com (Denis Jacquerye) Date: Wed, 14 Oct 2015 16:04:20 +0000 Subject: Latin glottal stop in ID in NWT, Canada Message-ID: This October the CBC has an article about having a Dene character in ID in Canada. At the moment the NWT does not allow special characters in names but this might change after a report by the NWT languages commissioner. The article uses the unicase ? U+0294 LATIN LETTER GLOTTAL STOP in the name Sahai?a. ?N.W.T. ID should allow Dene symbols, says languages commissioner? http://www.cbc.ca/news/canada/north/n-w-t-id-should-allow-dene-symbols-says-languages-commissioner-1.3269222 Here is what N.W.T.'s language commissioner, Shannon Gullberg is quoted saying: ?By not allowing for names that contain Dene fonts, diacritical marks and symbols, she says the Vital Statistics Act is violating the spirit and intent of the Official Languages Act.? This is a follow-up article on what was reported in March. The CBC?s March article and a MacLean?s article were using the unicase ? U+0294 as well ?Chipewyan baby name not allowed on N.W.T. birth certificate? http://www.cbc.ca/news/canada/north/chipewyan-baby-name-not-allowed-on-n-w-t-birth-certificate-1.2984173 Where Dene languages expert Brent Kaulback is quoted saying: ?Dene fonts are now unicode fonts. They can be loaded onto any computer, and if they're typed into any computer, any other computer can read those fonts as well.? ?What?s in a name? A Chipewyan?s battle over her native tongue? http://www.macleans.ca/society/life/all-in-the-family-name/ The Toronto Star, Metro News Toronto had articles using the uppercase ? U+0241 LATIN CAPITAL LETTER GLOTTAL STOP in the name Sahai?a. This probably should have been the unicase ? U+0294 or the lowercase ? U+0241 LATIN SMALL LETTER GLOTTAL STOP (Chipewyan sources I found use one or the other character for the lowercase letters). ?Aboriginal mom fights officialdom over spelling of daughter?s name: Sahai?a? https://www.thestar.com/news/canada/2015/03/06/nwt-wont-recognize-infants-aboriginal-name.html ?Fighting for Sahai?a: Canada?s first peoples deserve the right to use their own names? http://www.metronews.ca/views/2015/03/10/fighting-for-sahai%25c9%2582a-canadas-first-peoples-deserve-the-right-to-use-their-own-names.html Searching on the web, only a couple of pages (that are now offline) use the lowercase ? U+0241 LATIN SMALL LETTER GLOTTAL STOP in Sahai?a. -------------- next part -------------- An HTML attachment was scrubbed... URL: From charupdate at orange.fr Thu Oct 15 03:06:45 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Thu, 15 Oct 2015 10:06:45 +0200 (CEST) Subject: Latin glottal stop in ID in NWT, Canada In-Reply-To: References: Message-ID: <693838558.2868.1444896405426.JavaMail.www@wwinf1n33> On Wed, 14 Oct 2015 16:04:20 +0000, Denis Jacquerye wrote: > The article uses the unicase ? U+0294 LATIN LETTER GLOTTAL STOP in the name Sahai?a. > [...] > The CBC?s March article and a MacLean?s article were using the unicase ? U+0294 as well > [...] > The Toronto Star, Metro News Toronto had articles using the uppercase ? U+0241 LATIN CAPITAL LETTER GLOTTAL STOP in the name Sahai?a. This probably should have been the unicase ? U+0294 or the lowercase ? U+0241 LATIN SMALL LETTER GLOTTAL STOP ([...]). > [...] > Searching on the web, only a couple of pages (that are now offline) use the lowercase ? U+0241 LATIN SMALL LETTER GLOTTAL STOP in Sahai?a. This raises the problem of yet another ambiguation, this one due to original diverging usages about casing vs non-casing glottal stop. Latin being a casing script, the glottal stop should be casing only. Since this is available, making an effort to unify the usage may be desirable. Here, this results in ensuring that Sahai?a and all other people with a glottal stop in their name will escape trouble with even more officialdom. Best hopes, Marcel -------------- next part -------------- An HTML attachment was scrubbed... URL: From leob at mailcom.com Thu Oct 15 17:46:46 2015 From: leob at mailcom.com (Leo Broukhis) Date: Thu, 15 Oct 2015 15:46:46 -0700 Subject: Latin glottal stop in ID in NWT, Canada In-Reply-To: <693838558.2868.1444896405426.JavaMail.www@wwinf1n33> References: <693838558.2868.1444896405426.JavaMail.www@wwinf1n33> Message-ID: Along the same lines, should I be able to change my last name officially to ?pyx?c? (NB all letters are codepoints with names starting with "LATIN"). On Thu, Oct 15, 2015 at 1:06 AM, Marcel Schneider wrote: > On Wed, 14 Oct 2015 16:04:20 +0000, Denis Jacquerye > wrote: > >> The article uses the unicase ? U+0294 LATIN LETTER GLOTTAL STOP in the >> name Sahai?a. >> [...] >> The CBC?s March article and a MacLean?s article were using the unicase ? >> U+0294 as well >> [...] >> The Toronto Star, Metro News Toronto had articles using the uppercase ? >> U+0241 LATIN CAPITAL LETTER GLOTTAL STOP in the name Sahai?a. This probably >> should have been the unicase ? U+0294 or the lowercase ? U+0241 LATIN SMALL >> LETTER GLOTTAL STOP ([...]). >> [...] >> Searching on the web, only a couple of pages (that are now offline) use >> the lowercase ? U+0241 LATIN SMALL LETTER GLOTTAL STOP in Sahai?a. > > This raises the problem of yet another ambiguation, this one due to original > diverging usages about casing vs non-casing glottal stop. Latin being a > casing script, the glottal stop should be casing only. Since this is > available, making an effort to unify the usage may be desirable. > > Here, this results in ensuring that Sahai?a and all other people with a > glottal stop in their name will escape trouble with even more officialdom. > > Best hopes, > > Marcel From dzo at bisharat.net Thu Oct 15 19:22:08 2015 From: dzo at bisharat.net (Don Osborn) Date: Thu, 15 Oct 2015 20:22:08 -0400 Subject: Non-standard 8-bit fonts still in use Message-ID: <56204330.6010106@bisharat.net> I was surprised to learn of continued reference to and presumably use of 8-bit fonts modified two decades ago for the extended Latin alphabets of Malian languages, and wondered if anyone has similar observations in other countries. Or if there have been any recent studies of adoption of Unicode fonts in the place of local 8-bit fonts for extended Latin (or non-Latin) in local language computing. At various times in the past I have encountered the idea that local languages with extended alphabets in Africa require special fonts (that region being my main geographic area of experience with multilingual computing), but assumed that this notion was fading away. See my recent blog post for a quick and by no means complete discussion about this topic, which of course has to do with more than just the fonts themselves: http://niamey.blogspot.com/2015/10/the-secret-life-of-bambara-arial.html TIA for any feedback. Don Osborn From moyogo at gmail.com Fri Oct 16 00:47:40 2015 From: moyogo at gmail.com (Denis Jacquerye) Date: Fri, 16 Oct 2015 05:47:40 +0000 Subject: Latin glottal stop in ID in NWT, Canada In-Reply-To: References: <693838558.2868.1444896405426.JavaMail.www@wwinf1n33> Message-ID: On Thu, 15 Oct 2015 at 23:55 Leo Broukhis wrote: > Along the same lines, should I be able to change my last name > officially to ?pyx?c? (NB all letters are codepoints with names > starting with "LATIN"). > If these are characters used in an official language of your territorial authority, that would make sense. But even if they are not, it is a good question. -------------- next part -------------- An HTML attachment was scrubbed... URL: From charupdate at orange.fr Fri Oct 16 02:18:06 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Fri, 16 Oct 2015 09:18:06 +0200 (CEST) Subject: Latin glottal stop in ID in NWT, Canada In-Reply-To: References: <693838558.2868.1444896405426.JavaMail.www@wwinf1n33> Message-ID: <306808615.2290.1444979886727.JavaMail.www@wwinf1h15> On Thu, 15 Oct 2015 15:46:46 -0700, Leo Broukhis wrote: > Along the same lines, should I be able to change my last name > officially to ?pyx?c? (NB all letters are codepoints with names > starting with "LATIN"). Your question is hard for me to answer, but I believe that basically you are allowed to submit a name change request for any last name you would like to bear, including any orthography. Ultimately the decision making belongs to your government. Best wishes, Marcel -------------- next part -------------- An HTML attachment was scrubbed... URL: From charupdate at orange.fr Fri Oct 16 02:32:12 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Fri, 16 Oct 2015 09:32:12 +0200 (CEST) Subject: Latin glottal stop in ID in NWT, Canada In-Reply-To: <693838558.2868.1444896405426.JavaMail.www@wwinf1n33> References: <693838558.2868.1444896405426.JavaMail.www@wwinf1n33> Message-ID: <1572692223.2568.1444980732177.JavaMail.www@wwinf1h15> Amidst the (wise) silence on the precise subject of this thread, I?m good to point out that the use of uppercase glottal stop in home country newspapers is certainly for spectacularity?s and legibility?s sake. Would it be a good idea to contact the editors, pointing to the Unicode Mailing List, and forward their advice to the List? For a more accurate bit of glottal stop encoding history than in my yesterday?s mail: While the uncased original 0294 LATIN LETTER GLOTTAL STOP is a part of Unicode since the dawn (1.1), uppercase 0241 LATIN CAPITAL LETTER GLOTTAL STOP joined up for 4.1 in support of NWT communities and made U+0294 its lowercase, but this fortunately regained autonomy one year and version later when lowercase 0242 LATIN SMALL LETTER GLOTTAL STOP was born, thanks to Canada (SCC) and Ireland (NSAI) Standards bodies.[1] This was still right before the deadline of the reference subset at the creation of a widely used font shipped with Windows (so there should be no problem on font side). To date, about half of Canadian Aboriginal languages (e.g. in NWT) use cased glottal stop, while the other (e.g. in SK) use monocameral. One of the latter uses digit seven instead. ?7? for ??? is no problem on road signs, while I?m not sure whether the same applies in text processing. I don?t believe neither that in the other languages, this translation to ASCII would be less offence than the actually enforced replacement of glottal stop in ID with a hyphen-minus. I?wonder whether NWT officialdom didn?t propose to put an apostrophe for the glottal stop till they get the missing software updates :) After all, these IPA and then Latin Extended letters are thought to be basically an enlarged curly apostrophe. The curl isn?t even required, as the same sound looks like a styled ASCII apostrophe when it occurs in a number of warmer countries (A78B LATIN CAPITAL LETTER SALTILLO, A78C LATIN SMALL LETTER SALTILLO). The thread about how to call non-ASCII characters on the whole, was a very good idea. Would anybody please send a link to NWT authorities? I think it would be fine to support the courageous mum in the lawsuit! In case this link will be useful, here it is: http://www.unicode.org/mail-arch/unicode-ml/y2015-m09/0125.html In case of trouble typing glottal stops, the best solution is probably to change for a fully Unicode supporting keyboard layout. This typically has a Compose key, which can be implemented on [AltGr]+[Space]. (Put the no-break spaces into the Shift and Shift+AltGr shift states.) Here are sequences for glottal stop: {Compose}{'}{7} ? ? {Compose}{'}{T} ? ? {Compose}{'}{t} ? ? ([T] is used because it is not far from [7] and ?T with acute? doesn?t exist. To remember, read ?gloTTal sTop?, and note a slight resemblance between '7' and 'T'.) We hope that the full range of first names will be successfully implemented, so that any person bearing a name with glottal stop and his/her relatives will never encounter any trouble again. Best wishes, Marcel [1] http://std.dkuug.dk/jtc1/sc2/wg2/docs/n2962.pdf -------------- next part -------------- An HTML attachment was scrubbed... URL: From asmus-inc at ix.netcom.com Fri Oct 16 02:38:45 2015 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Fri, 16 Oct 2015 00:38:45 -0700 Subject: Latin glottal stop in ID in NWT, Canada In-Reply-To: <306808615.2290.1444979886727.JavaMail.www@wwinf1h15> References: <693838558.2868.1444896405426.JavaMail.www@wwinf1n33> <306808615.2290.1444979886727.JavaMail.www@wwinf1h15> Message-ID: <5620A985.8080902@ix.netcom.com> An HTML attachment was scrubbed... URL: From wjgo_10009 at btinternet.com Fri Oct 16 12:10:17 2015 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Fri, 16 Oct 2015 18:10:17 +0100 (BST) Subject: The scope of Unicode (from Re: How can my research become implemented in a standardized manner?) In-Reply-To: <6041047.15822.1444472090118.JavaMail.defaultUser@defaultHost> References: <33365178.33329.1444395933996.JavaMail.defaultUser@defaultHost> <6041047.15822.1444472090118.JavaMail.defaultUser@defaultHost> Message-ID: <26364438.57814.1445015417239.JavaMail.defaultUser@defaultHost> What is the scope of Unicode please? Can it ever change? If it can change, who makes the decision? For example, does it need an ISO decision at a level higher than the WG2 committee or can the WG2 committee do it if it so pleases? How can a person apply for the scope of Unicode to become changed please? I have been considering how to make progress with trying for my research to become implemented in a standardized manner. I have been informed that a group of people have examined the document that I submitted and determined that it is out of scope for UTC. As implementation of the research in a standardized manner, if it ever takes place, will need of necessity for two base characters to become encoded into Unicode, then I feel that I need to submit a new document that is either itself in scope for UTC; or requests changing the scope of Unicode, though maybe such a document would itself not be regarded as being in scope. The thing is, I was not informed as to why my document was determined to be out of scope. If I knew why, then maybe I could write a document that is in scope. I am not expecting the Unicode Technical Committee to encode the two characters straight away. It would be good just to get a document into the Unicode Document Register. The best that I could achieve would be for the UTC to agree to keep the matter in escrow so that if I can persuade ISO to encode a plain paper list of words and code numbers and a plain paper list of preset sentences and code numbers, then UTC would encode the two base characters into plane 14 at that time so that the two plain paper lists could each be applied to produce a tagspace accessed by a plane 14 character. If UTC were to decide that, or something approaching that, I would then be able to approach ISO with first draft proposals for the two plain paper lists, not comprehensive, more placeholder applications so as to try to get things started, saying to ISO that encoding into Unicode would become possible. The two placeholder plain paper first drafts could then be gradually altered, maybe completely altered, and augmented so as to be able to get things started, just like Unicode started small and has been extended over time. So at the moment my first task is to try to produce a document that will be determined to be in scope so that it goes into the Unicode Document Register and each member of the Unicode Technical Committee has a chance to express an opinion at the meeting. So, what is the scope of Unicode please? William Overington 16 October 2015 ----Original message---- >From : wjgo_10009 at btinternet.com Date : 10/10/2015 - 11:14 (GMTST) To : unicode at unicode.org Subject : How can my research become implemented in a standardized manner? Please note that I am on moderated post, so if this post does get sent to the Unicode mailing list it will be because the moderator has kindly agreed to it being circulated. I have recently made significant progress with my research in communication through the language barrier. The capabilities are greatly improved. On 7 October 2015 I submitted a document, hoping that it would become included in the Unicode Document Register. I have been informed that a group of people have examined the document and determined that it is out of scope for UTC. I am not seeking to question that decision. As an independent researcher, not representing an organization, nor in fact employed by any organization at all, I am trying to get the system standardized as an international standard. I feel that trying to produce first a widely-used system using a Private Use Area encoding is not a realistic practical goal and even if it were practical the result would be lots of legacy data. I feel that to become successful the system needs standardization and implementation to go forward together. So what to do? More generally, how are the format and the encoding of tagspaces to be carried out in the future? The document is available on the web at the present time in two places. There is a file available for download as an attachment in a forum post of 8 October 2015 in the High-Logic Gallery forum. Adding a direct link to the post is not at present possible using the particular email system that I am using. There is direct access in my family webspace. www.users.globalnet.co.uk/~ngo/two_tagspaces.pdf In addition I have deposited the document at the British Library. William Overington 10 October 2015 -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Sat Oct 17 03:20:13 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Sat, 17 Oct 2015 09:20:13 +0100 Subject: Non-standard 8-bit fonts still in use In-Reply-To: <56204330.6010106@bisharat.net> References: <56204330.6010106@bisharat.net> Message-ID: <20151017092013.7614f94a@JRWUBU2> On Thu, 15 Oct 2015 20:22:08 -0400 Don Osborn wrote: > I was surprised to learn of continued reference to and presumably use > of 8-bit fonts modified two decades ago for the extended Latin > alphabets of Malian languages, and wondered if anyone has similar > observations in other countries. Or if there have been any recent > studies of adoption of Unicode fonts in the place of local 8-bit > fonts for extended Latin (or non-Latin) in local language computing. Non-Unicode fonts have been particularly resilient in Indic scripts, though I'm not sure what the current state of play is. I'm not sure that they are particularly '8-bit', but rather, they re-use the more accessible codes. Although these font schemes generally have the disadvantage that plain text is not supported, in the Indic world they do have advantages over Unicode: 1) What you type is what you get. Indic rearrangement irritates a lot of people. Several Tai scripts have successfully resisted it, but Indians have been suppressed by the influence of ISCII. 2) They avoid the dependence on a language-specific shaping engine. Microsoft's USE may now eliminate this advantage. 3) Text is accessible for editing. Windows provides no cursor positioning within grapheme clusters, and the one response has been to prevent editing of grapheme clusters. As a slight compensation, the idea that backward deletion should delete the preceding encoded character has a lot of implementation support. I understand that in Cambodia, Unicode was established by government edict. Richard. From charupdate at orange.fr Sat Oct 17 05:28:16 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Sat, 17 Oct 2015 12:28:16 +0200 (CEST) Subject: Latin glottal stop in ID in NWT, Canada In-Reply-To: References: Message-ID: <1770734131.3869.1445077696319.JavaMail.www@wwinf1d37> On Wed, 14 Oct 2015 16:04:20 +0000, Denis Jacquerye wrote: > The Toronto Star, Metro News Toronto had articles using the uppercase ? U+0241 LATIN CAPITAL LETTER GLOTTAL STOP in the name Sahai?a. This probably should have been the unicase ? U+0294 or the lowercase ? U+0241 LATIN SMALL LETTER GLOTTAL STOP (Chipewyan sources I found use one or the other character for the lowercase letters). I?believe it?s not too much to insist that here is no inconsistency of usage inside a given community. Based on the lowercase glottal stop encoding proposal,[1] we know that at least with respect to glottal stop casing, Chipewyan is the language of two distinct communities which are geographically separate. In North-West Territories, Chipewyan uses the bicameral glottal stop, and in Saskatchewan, Chipewyan uses the unicameral glottal stop. Based upon this, I?suppose that the cited Chipewyan sources originate from Saskatchewan, and that they happen to contain only instances of glottal stop in lowercase positions. By this occasion I?apologize for having written about unification of usage. Best regards, Marcel [1] http://std.dkuug.dk/jtc1/sc2/wg2/docs/n2962.pdf -------------- next part -------------- An HTML attachment was scrubbed... URL: From charupdate at orange.fr Sat Oct 17 05:46:37 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Sat, 17 Oct 2015 12:46:37 +0200 (CEST) Subject: Latin glottal stop in ID in NWT, Canada In-Reply-To: <1770734131.3869.1445077696319.JavaMail.www@wwinf1d37> References: <1770734131.3869.1445077696319.JavaMail.www@wwinf1d37> Message-ID: <717059844.8332.1445078797606.JavaMail.www@wwinf1p23> Please disregard my previous faulty e-mail. I don't have much time to spend on issues that I'm not directly concerned with, so sadly I'm very stressed. Here is the accurate one: On Wed, 14 Oct 2015 16:04:20 +0000, Denis Jacquerye wrote: > The Toronto Star, Metro News Toronto had articles using the uppercase ? U+0241 LATIN CAPITAL LETTER GLOTTAL STOP in the name Sahai?a. This probably should have been the unicase ? U+0294 or the lowercase ? U+0241 LATIN SMALL LETTER GLOTTAL STOP (Chipewyan sources I found use one or the other character for the lowercase letters). I? believe it?s not too much to insist that here is no inconsistency of usage inside a given community. Based on the lowercase glottal stop encoding proposal,[1] we know that at least with respect to glottal stop casing, Chipewyan is the language of two distinct communities which are geographically separate. In North-West Territories, Chipewyan uses the bicameral glottal stop, and in Saskatchewan, Chipewyan uses the unicameral glottal stop. Based upon this, I ?suppose that those among the cited Chipewyan sources which use unicase glottal stop, originate from Saskatchewan, and that those using lowercase glottal stop, originate from NWT, and that both happen to contain only instances of glottal stop in lowercase positions. By this occasion I? apologize for having written about unification of usage. Best regards, Marcel [1] http://std.dkuug.dk/jtc1/sc2/wg2/docs/n2962.pdf -------------- next part -------------- An HTML attachment was scrubbed... URL: From charupdate at orange.fr Sat Oct 17 06:28:21 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Sat, 17 Oct 2015 13:28:21 +0200 (CEST) Subject: Tirhuta (linked to: Re: Non-standard 8-bit fonts still in use) In-Reply-To: <20151017092013.7614f94a@JRWUBU2> References: <56204330.6010106@bisharat.net> <20151017092013.7614f94a@JRWUBU2> Message-ID: <1935235321.6153.1445081301207.JavaMail.www@wwinf1k18> On Sat, 17 Oct 2015 09:20:13 +0100, Richard Wordingham wrote: > On Thu, 15 Oct 2015 20:22:08 -0400 > Don Osborn wrote: > > > I was surprised to learn of continued reference to and presumably use > > of 8-bit fonts modified two decades ago for the extended Latin > > alphabets of Malian languages, and wondered if anyone has similar > > observations in other countries. Or if there have been any recent > > studies of adoption of Unicode fonts in the place of local 8-bit > > fonts for extended Latin (or non-Latin) in local language computing. > > Non-Unicode fonts have been particularly resilient in Indic scripts, > though I'm not sure what the current state of play is. I'm not sure > that they are particularly '8-bit', but rather, they re-use the more > accessible codes. > > Although these font schemes generally have the disadvantage that plain > text is not supported, in the Indic world they do have advantages over > Unicode: > > 1) What you type is what you get. Indic rearrangement irritates a lot > of people. Several Tai scripts have successfully resisted it, but > Indians have been suppressed by the influence of ISCII. Does this mean that OpenType fonts are "overscripted" and that glyph reordering and glyph substitution are not appreciated? If so, the best seems to me to convert legacy fonts to Unicode conformant fonts without scripting them. Or to provide kind of a *stable input* option that disables the advanced behaviour. Marcel ? [First in thread: http://www.unicode.org/mail-arch/unicode-ml/y2015-m09/0155.html] [Previous in thread: http://www.unicode.org/mail-arch/unicode-ml/y2015-m09/0156.html] ? -------------- next part -------------- An HTML attachment was scrubbed... URL: From eik at iki.fi Sat Oct 17 08:18:27 2015 From: eik at iki.fi (Erkki I Kolehmainen) Date: Sat, 17 Oct 2015 16:18:27 +0300 Subject: Latin glottal stop in ID in NWT, Canada In-Reply-To: <717059844.8332.1445078797606.JavaMail.www@wwinf1p23> References: <1770734131.3869.1445077696319.JavaMail.www@wwinf1d37> <717059844.8332.1445078797606.JavaMail.www@wwinf1p23> Message-ID: <000001d108de$4f1bd100$ed537300$@fi> Dear Mr. Schneider, Nobody forces you to spend any time on issues that you are not directly concerned with or even those that you are. Thus, please, spare us at least from contributions that even by your own admittance have been prepared in a haste without much thought. Sincerely, Erkki I. Kolehmainen L?hett?j?: Unicode [mailto:unicode-bounces at unicode.org] Puolesta Marcel Schneider L?hetetty: 17. lokakuuta 2015 13:47 Vastaanottaja: Denis Jacquerye Kopio: Unicode Discussion Aihe: Re: Latin glottal stop in ID in NWT, Canada Please disregard my previous faulty e-mail. I don't have much time to spend on issues that I'm not directly concerned with, so sadly I'm very stressed. Here is the accurate one: On Wed, 14 Oct 2015 16:04:20 +0000, Denis Jacquerye < moyogo at gmail.com> wrote: > The Toronto Star, Metro News Toronto had articles using the uppercase ? U+0241 LATIN CAPITAL LETTER GLOTTAL STOP in the name Sahai?a. This probably should have been the unicase ? U+0294 or the lowercase ? U+0241 LATIN SMALL LETTER GLOTTAL STOP (Chipewyan sources I found use one or the other character for the lowercase letters). I? believe it?s not too much to insist that here is no inconsistency of usage inside a given community. Based on the lowercase glottal stop encoding proposal,[1] we know that at least with respect to glottal stop casing, Chipewyan is the language of two distinct communities which are geographically separate. In North-West Territories, Chipewyan uses the bicameral glottal stop, and in Saskatchewan, Chipewyan uses the unicameral glottal stop. Based upon this, I ?suppose that those among the cited Chipewyan sources which use unicase glottal stop, originate from Saskatchewan, and that those using lowercase glottal stop, originate from NWT, and that both happen to contain only instances of glottal stop in lowercase positions. By this occasion I? apologize for having written about unification of usage. Best regards, Marcel [1] http://std.dkuug.dk/jtc1/sc2/wg2/docs/n2962.pdf -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Sun Oct 18 19:45:14 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Mon, 19 Oct 2015 01:45:14 +0100 Subject: Why Work at Encoding Level? In-Reply-To: <84D764A7A83B4D73A9C09C7D619E2922@erratique.ch> References: <20151013074610.665a7a7059d7ee80bb4d670165c8327d.89f65761b7.wbe@email03.secureserver.net> <20151013233706.63771fc4@JRWUBU2> <84D764A7A83B4D73A9C09C7D619E2922@erratique.ch> Message-ID: <20151019014514.0c392f7c@JRWUBU2> On Wed, 14 Oct 2015 00:28:26 +0100 Daniel B?nzli wrote: > If UTF-32 feels wasteful there are various smart ways of providing > direct indexing at a reasonable cost if you are in a language that > has minimal support for datatype definition and abstraction. I can't find a good one that's been published. The Elias-Fano encoding for UTF-8 indexing works out at 3 to 5 bits per character even without extending to achieve 'constant time' access, the limiting extremes being English and Ugaritic. (Most SMP scripts use a lot of ASCII.) For genuine UTF-8 text I can happily get the memory requirement down to 1.031 bits per character. I exploit the fact that one can easily advance character by character through a UTF-8 string, but limit myself to 5 advances. The 0.031 part of the factor comes in for strings longer than a thousand characters, and could be reduced to 0.002 with some extra processing. There's a lot of redundancy in the positions. > Note that the Swift programming language seems to have gone even > further than I would have: their notion of character is a grapheme > cluster tested for equality using canonical equivalence and that's > what they index in their strings, see [1]. Don't know how well that > works in practice as I personally never used it; but it feels like > the ultimate Unicode string model you want to provide to the > zero-knowledge Unicode programmer (at least for alphabetic scripts). It doesn't quite work. For Thai at least, deleting backwards should delete just a combining mark rather than the whole grapheme cluster. I couldn't find any provision for this in Swift. There is also the question (irrelevant for Thai) of whether this deletion should be done in NFC or NFD. Deleting backwards deleting only a combining mark also makes sense for the International Phonetic Alphabet, as well as for the Thai script used alphabetically (as often done for Pali) and for the Lao script - the modern Lao writing system is formally an alphabet. Richard. From doug at ewellic.org Mon Oct 19 12:07:31 2015 From: doug at ewellic.org (Doug Ewell) Date: Mon, 19 Oct 2015 10:07:31 -0700 Subject: Why Work at Encoding =?UTF-8?Q?Level=3F?= Message-ID: <20151019100731.665a7a7059d7ee80bb4d670165c8327d.1842de28ba.wbe@email03.secureserver.net> This discussion was originally about how to handle unpaired surrogates, as if that were a normal use case. Regardless of what encoding model is used to handle characters under the hood, and regardless of how the Delete key should work with actual characters or clusters, there is never any excuse for software to create unpaired surrogates, or any other sort of invalid code unit sequences. That is like having an image editor that deletes every 128th byte from a JPEG file, and then worrying about how to display the file. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From richard.wordingham at ntlworld.com Mon Oct 19 13:53:03 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Mon, 19 Oct 2015 19:53:03 +0100 Subject: Why Work at Encoding Level? In-Reply-To: <20151019100731.665a7a7059d7ee80bb4d670165c8327d.1842de28ba.wbe@email03.secureserver.net> References: <20151019100731.665a7a7059d7ee80bb4d670165c8327d.1842de28ba.wbe@email03.secureserver.net> Message-ID: <20151019195303.53e8ee83@JRWUBU2> On Mon, 19 Oct 2015 10:07:31 -0700 "Doug Ewell" wrote: > This discussion was originally about how to handle unpaired > surrogates, as if that were a normal use case. And the subject line was changed when the topic changed to traversing strings. > Regardless of what encoding model is used to handle characters under > the hood, and regardless of how the Delete key should work with actual > characters or clusters, there is never any excuse for software to > create unpaired surrogates, or any other sort of invalid code unit > sequences. How about, 'The specification says that one must pass the number of _characters_ in the string.'? Even worse, some specifications talk of 'Unicode characters' when they mean UTF-16 code units. The word 'codepoint' is even worse, as a supplementary plane codepoint is represented by two BMP codepoints. ICU (but perhaps it's actually Java) seems to have a culture of tolerating lone surrogates, and rules for handling lone surrogates are strewn across the Unicode standards and annexes. It was the once the case that basic Unicode support in regular expressions required a regular expression engine to be able to search for specified lone surrogates - a real show-stopper for an engine working in UTF-8. The Unicode collation algorithm conformance test once tested that implementations of collation collated lone surrogates correctly. Raising an exception was an automatic test failure! By contrast, no-one's proposed collation rules for broken bits of UTF-8 characters or non-minimal length forms. > That is like having an image editor that deletes every > 128th byte from a JPEG file, and then worrying about how to display > the file. 1. Of course, telemetry streams may very well contain damaged JPEG images! 2. The problem bad handling of supplementary characters seems to be associated with UTF-16 is that the damage is rarely as obvious as every 128th code unit. By contrast, bad UTF-8 handling usually comes to light as soon as the text processing moves beyond ASCII. Richard. From verdy_p at wanadoo.fr Mon Oct 19 14:35:16 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Mon, 19 Oct 2015 21:35:16 +0200 Subject: Why Work at Encoding Level? In-Reply-To: <20151019195303.53e8ee83@JRWUBU2> References: <20151019100731.665a7a7059d7ee80bb4d670165c8327d.1842de28ba.wbe@email03.secureserver.net> <20151019195303.53e8ee83@JRWUBU2> Message-ID: 2015-10-19 20:53 GMT+02:00 Richard Wordingham < richard.wordingham at ntlworld.com>: > On Mon, 19 Oct 2015 10:07:31 -0700 > "Doug Ewell" wrote: > > > This discussion was originally about how to handle unpaired > > surrogates, as if that were a normal use case. > > And the subject line was changed when the topic changed to traversing > strings. > > > Regardless of what encoding model is used to handle characters under > > the hood, and regardless of how the Delete key should work with actual > > characters or clusters, there is never any excuse for software to > > create unpaired surrogates, or any other sort of invalid code unit > > sequences. > > The word > 'codepoint' is even worse, as a supplementary plane codepoint is > represented by two BMP codepoints. > No ! The "supplementary code points" (or "supplementary characters" when they are assigned to characters) are represented in UTF-16 as two **code units**, NOT as two "code points" (even if their binary value are related). The code points in range U+D800..U+DF00 are NEVER characters they are juste permanently reserved in order to unassign them to any character, so these code points are assigned, but not to characters (otherwise these characters would not be representable as valid UTF-16). These code points also do not have any scalar value, and there are not valid scalar values in range 0xD800..0xDFFF (the valid scalar values are in two ranges of integers, separated by this hole). So please don't mix "code points" and "code units" ! -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Mon Oct 19 15:32:07 2015 From: doug at ewellic.org (Doug Ewell) Date: Mon, 19 Oct 2015 13:32:07 -0700 Subject: Unpaired surrogates (was: Re: Why Work at Encoding =?UTF-8?Q?Level=3F=29?= Message-ID: <20151019133207.665a7a7059d7ee80bb4d670165c8327d.aec337c813.wbe@email03.secureserver.net> Richard Wordingham wrote: >> This discussion was originally about how to handle unpaired >> surrogates, as if that were a normal use case. > > And the subject line was changed when the topic changed to > traversing strings. Granted. I've changed it again to reflect this specific issue. > How about, 'The specification says that one must pass the number of > _characters_ in the string.'? Even worse, some specifications talk of > 'Unicode characters' when they mean UTF-16 code units. The word > 'codepoint' is even worse, as a supplementary plane codepoint is > represented by two BMP codepoints. None of this lets any implementer or implementation off the hook. TUS is very clear that an unpaired surrogate is not to be interpreted in any way, and particularly not to be treated as an abstract character. See, for example, C1 and D75. > ICU (but perhaps it's actually Java) seems to have a culture of > tolerating lone surrogates, and rules for handling lone surrogates are > strewn across the Unicode standards and annexes. I suspect you have an example. I'd be curious what any of them has to say that does not equate to "this is an anomalous situation and represents broken and ill-formed text." Applications that treat unpaired surrogates as well-formed text do not change the rules; they are in violation of the rules. > It was the once the > case that basic Unicode support in regular expressions required a > regular expression engine to be able to search for specified lone > surrogates - a real show-stopper for an engine working in UTF-8. > The Unicode collation algorithm conformance test once tested that > implementations of collation collated lone surrogates correctly. > Raising an exception was an automatic test failure! By contrast, > no-one's proposed collation rules for broken bits of UTF-8 characters > or non-minimal length forms. Are these tests still included, or did someone notice that they were in conflict with the standard and removed them? >> That is like having an image editor that deletes every >> 128th byte from a JPEG file, and then worrying about how to display >> the file. > > 1. Of course, telemetry streams may very well contain damaged JPEG > images! Of course. But are they conformant to the JPEG standard? Is there a standard way to repair and display them? > 2. The problem bad handling of supplementary characters seems to be > associated with UTF-16 is that the damage is rarely as obvious as every > 128th code unit. By contrast, bad UTF-8 handling usually comes to light > as soon as the text processing moves beyond ASCII. Of course. I could have said "deletes random bytes from a JPEG file." An unpaired surrogate can be detected either immediately, or immediately after the next code unit. In neither case is it to be interpreted as anything other than invalid text. Philippe Verdy wrote: > No ! The "supplementary code points" (or "supplementary characters" > when they are assigned to characters) are represented in UTF-16 as two > **code units**, NOT as two "code points" (even if their binary value > are related). Surrogate values are not abstract characters, but they are code points (D10). Note that Surrogate is one of the seven types of code points (D10a). -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From richard.wordingham at ntlworld.com Mon Oct 19 15:34:01 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Mon, 19 Oct 2015 21:34:01 +0100 Subject: Why Work at Encoding Level? In-Reply-To: References: <20151019100731.665a7a7059d7ee80bb4d670165c8327d.1842de28ba.wbe@email03.secureserver.net> <20151019195303.53e8ee83@JRWUBU2> Message-ID: <20151019213401.5246bdcb@JRWUBU2> On Mon, 19 Oct 2015 21:35:16 +0200 Philippe Verdy wrote: > 2015-10-19 20:53 GMT+02:00 Richard Wordingham < > richard.wordingham at ntlworld.com>: > > The word > > 'codepoint' is even worse, as a supplementary plane codepoint is > > represented by two BMP codepoints. > No ! The "supplementary code points" (or "supplementary characters" > when they are assigned to characters) are represented in UTF-16 as > two **code units**, NOT as two "code points" (even if their binary > value are related). A code point is 'any value in the Unicode codespace' (TUS Section 3.4 D10). The 'Unicode codespace' is a range of integers from 0 to 0x10FFFF (TUS Section 3.4 D9). This works fine so long as one thinks of a 'code point' as just a number. The problem is that people rarely use the term 'scalar values'. Richard. From verdy_p at wanadoo.fr Mon Oct 19 16:17:46 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Mon, 19 Oct 2015 23:17:46 +0200 Subject: Unpaired surrogates (was: Re: Why Work at Encoding Level?) In-Reply-To: <20151019133207.665a7a7059d7ee80bb4d670165c8327d.aec337c813.wbe@email03.secureserver.net> References: <20151019133207.665a7a7059d7ee80bb4d670165c8327d.aec337c813.wbe@email03.secureserver.net> Message-ID: 2015-10-19 22:32 GMT+02:00 Doug Ewell : > Philippe Verdy wrote: > > > No ! The "supplementary code points" (or "supplementary characters" > > when they are assigned to characters) are represented in UTF-16 as two > > **code units**, NOT as two "code points" (even if their binary value > > are related). > > Surrogate values are not abstract characters, I did NOT write that. > but they are code points > That's what I wrote, you reformulate. > (D10). Note that Surrogate is one of the seven types of code points > (D10a). > I have not denied this. I denied the affirmation of Richard that said that a single code point (supplementary) could be represented as two code points (surrogate), and it was wrong for the last word ("point" vs. "unit"). -------------- next part -------------- An HTML attachment was scrubbed... URL: From markus.icu at gmail.com Mon Oct 19 16:29:29 2015 From: markus.icu at gmail.com (Markus Scherer) Date: Mon, 19 Oct 2015 14:29:29 -0700 Subject: Unpaired surrogates (was: Re: Why Work at Encoding Level?) In-Reply-To: <20151019133207.665a7a7059d7ee80bb4d670165c8327d.aec337c813.wbe@email03.secureserver.net> References: <20151019133207.665a7a7059d7ee80bb4d670165c8327d.aec337c813.wbe@email03.secureserver.net> Message-ID: On Mon, Oct 19, 2015 at 1:32 PM, Doug Ewell wrote: > > ICU (but perhaps it's actually Java) seems to have a culture of > > tolerating lone surrogates, and rules for handling lone surrogates are > > strewn across the Unicode standards and annexes. > > I suspect you have an example. I have examples from ICU processing of 16-bit Unicode strings (which are not usually required to be well-formed UTF-16 strings): - "Count code points" counts an unpaired surrogate as 1. - "Move forward/backward by n code points" counts an unpaired surrogate as 1. - "Lower-/title-/upper-case the string" passes through an unpaired surrogate as-is like any code point that does not have case mappings. - "Get property x of code point y" returns the property value according to the UCD; for example, gc(surrogate)=Cs. - Collating a string that contains an unpaired surrogate: ICU currently uses the second approach from UCA section 7.1.1 . See http://userguide.icu-project.org/strings#TOC-ICU:-16-bit-Unicode-strings However, "convert from UTF-16 to UTF-8" and such treats an unpaired surrogate as an error. > The Unicode collation algorithm conformance test once tested that > > implementations of collation collated lone surrogates correctly. > > Raising an exception was an automatic test failure! By contrast, > > no-one's proposed collation rules for broken bits of UTF-8 characters > > or non-minimal length forms. > > Are these tests still included, or did someone notice that they were in > conflict with the standard and removed them? > We updated http://www.unicode.org/Public/UCA/latest/CollationTest.html to say: "These files contain test cases that include ill-formed strings, with surrogate code points. Implementations that do not weight surrogate code points the same way as reserved code points may filter out such lines lines in the test cases, before testing for conformance." Best regards, markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Mon Oct 19 19:07:12 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Tue, 20 Oct 2015 01:07:12 +0100 Subject: Unpaired surrogates In-Reply-To: <20151019133207.665a7a7059d7ee80bb4d670165c8327d.aec337c813.wbe@email03.secureserver.net> References: <20151019133207.665a7a7059d7ee80bb4d670165c8327d.aec337c813.wbe@email03.secureserver.net> Message-ID: <20151020010712.31a6d29c@JRWUBU2> On Mon, 19 Oct 2015 13:32:07 -0700 "Doug Ewell" wrote: > Richard Wordingham wrote: > > It was the once the > > case that basic Unicode support in regular expressions required a > > regular expression engine to be able to search for specified lone > > surrogates - a real show-stopper for an engine working in UTF-8. > > The Unicode collation algorithm conformance test once tested that > > implementations of collation collated lone surrogates correctly. > > Raising an exception was an automatic test failure! By contrast, > > no-one's proposed collation rules for broken bits of UTF-8 > > characters or non-minimal length forms. > Are these tests still included, or did someone notice that they were > in conflict with the standard and removed them? Markus Scherer has answered this question as it applies to collation. For regular expressions, Requirement RL1.7 'Supplementary Code Points' still reads: "To meet this requirement, an implementation shall handle the full range of Unicode code points, including values from U+FFFF to U+10FFFF. In particular, where UTF-16 is used, a sequence consisting of a leading surrogate followed by a trailing surrogate shall be handled as a single code point in matching." Now, as we know, UTF-32 does not handle the full range of Unicode code points; it only handles scalar values. In the discussion of UTS#18 RL1.7, my objections did result in the addition of: "Note: It is permissible, but not required, to match an isolated surrogate code point (such as \u{D800}), which may occur in Unicode Strings. See Unicode String in the Unicode glossary." I'm not sure that that text loosely associated with RL1.7 gets round Requirement RL1.1, which still reads: "To meet this requirement, an implementation shall supply a mechanism for specifying any Unicode code point (from U+0000 to U+10FFFF), using the hexadecimal code point representation." Possibly a compliant implementation needs to parse hex codes for surrogate points, even if only reject input containing them and interpret them as a perverse alternative syntax for the perverse expression \p{^any}. Or does \p{^any} actually matched by isolated non-ASCII UTF-8 code units? As there is no requirement for a regular expression engine conforming to UTS#18 'Unicode Regular Expressions' to handle non-conformant Unicode strings, this need not be a problem. Richard. From verdy_p at wanadoo.fr Tue Oct 20 05:06:35 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Tue, 20 Oct 2015 12:06:35 +0200 Subject: Unpaired surrogates In-Reply-To: <20151020010712.31a6d29c@JRWUBU2> References: <20151019133207.665a7a7059d7ee80bb4d670165c8327d.aec337c813.wbe@email03.secureserver.net> <20151020010712.31a6d29c@JRWUBU2> Message-ID: 2015-10-20 2:07 GMT+02:00 Richard Wordingham < richard.wordingham at ntlworld.com>: > Now, as we know, UTF-32 does not handle the full range of Unicode code > points; ??? All valid UTFs handle the full range of valid Unicode code points. This includes UTF-32 as well as UTF-16 and UTF-8 (and their variants). it only handles scalar values. ??? UTF's allow encoding ANY valid scalar values (which are bijectively associated to a subset of valid code points). However they don't allow encoding surrogates (that are valid code points but not assigned any scalar value, so that they are not valid in any valid UTF). Visibly you are still confusing code points, code units and scalar values. > In the discussion of UTS#18 > RL1.7, my objections did result in the addition of: > > "Note: It is permissible, but not required, to match an isolated > surrogate code point (such as \u{D800}), which may occur in Unicode > Strings. See Unicode String in the Unicode glossary." > > I'm not sure that that text loosely associated with RL1.7 gets round > Requirement RL1.1, which still reads: > > "To meet this requirement, an implementation shall supply a mechanism > for specifying any Unicode code point (from U+0000 to U+10FFFF), using > the hexadecimal code point representation." > I'm also puzzled about how such a regexp will really match some input text if that input text has to be using a valid UTF. The regexp "\u{D800}" will likely match only lone surrogates (in any UTF), not a surrogate with the same value which is paired correctly to encode a supplementary code point. Note that even with **valid** UTF-8 text, U+D800 cannot occur. But if you remove the "valid" restriction, U+D800 may be present, including before U+DC00, but this won't form a valid pair: these are also lone surrogates in this case (they are paired and encode a supplementary code point, only if the text uses UTF-16 There are no valid surrogate pairs in valid UTF-8 and valid UTF-32, so if surrogates are appearing, they are all "lone" surrogates. If you blindly convert from UTF-8 or UTF-32 to UTF-16, the invalid text could become valid and new valid supplementary code points will appear unexpectedly. That's why lone surrogates cannot be part of any valid UTF, as they break the bijection. -------------- next part -------------- An HTML attachment was scrubbed... URL: From frederic.grosshans at gmail.com Tue Oct 20 05:14:22 2015 From: frederic.grosshans at gmail.com (=?UTF-8?Q?Fr=c3=a9d=c3=a9ric_Grosshans?=) Date: Tue, 20 Oct 2015 12:14:22 +0200 Subject: Non-standard 8-bit fonts still in use In-Reply-To: <56204330.6010106@bisharat.net> References: <56204330.6010106@bisharat.net> Message-ID: <562613FE.8050001@gmail.com> Le 16/10/2015 02:22, Don Osborn a ?crit : > I was surprised to learn of continued reference to and presumably use > of 8-bit fonts modified two decades ago for the extended Latin > alphabets of Malian languages, and wondered if anyone has similar > observations in other countries. Or if there have been any recent > studies of adoption of Unicode fonts in the place of local 8-bit fonts > for extended Latin (or non-Latin) in local language computing. A different usage where I suspect 8 bits proprietary fonts are used are electronic French (Grandjean) stenotypes, which use some non-unicode characters (like E without middle-bar). They are apparently used with computer software since the 1980?s (cf https://hal.archives-ouvertes.fr/jpa-00245165/document [pdf in French]) to make live subtitles. But I guess the proprietary nature of these characters and use by a single company (since ~1910) makes there encoding in Unicode unlikely. Fr?d?ric From asmus-inc at ix.netcom.com Tue Oct 20 06:29:17 2015 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Tue, 20 Oct 2015 04:29:17 -0700 Subject: Unpaired surrogates In-Reply-To: References: <20151019133207.665a7a7059d7ee80bb4d670165c8327d.aec337c813.wbe@email03.secureserver.net> <20151020010712.31a6d29c@JRWUBU2> Message-ID: <5626258D.90408@ix.netcom.com> An HTML attachment was scrubbed... URL: From mark at macchiato.com Tue Oct 20 20:23:17 2015 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Tue, 20 Oct 2015 18:23:17 -0700 Subject: Why Work at Encoding Level? In-Reply-To: <20151019100731.665a7a7059d7ee80bb4d670165c8327d.1842de28ba.wbe@email03.secureserver.net> References: <20151019100731.665a7a7059d7ee80bb4d670165c8327d.1842de28ba.wbe@email03.secureserver.net> Message-ID: > there is never any excuse for software to create unpaired surrogates, or any other sort of invalid code unit sequences First off, it depends on when one is encountered. They are invalid in UTF16, but are permitted in a Unicode 16-bit string. But more fundamentally, there may not be "excuses" for such software, but it happens anyway. Pretending it doesn't, makes for unhappy customers. For example, you don't want to be throwing an exception when one is encountered, when that could cause an app to fail. So the point is to handle the situation as gracefully, consistently, and as safely as possible. And 'safely' is key. Pretending that it doesn't exist is logically equivalent to deletion, and can cause security problems. (see tr36) Mark On Mon, Oct 19, 2015 at 10:07 AM, Doug Ewell wrote: > This discussion was originally about how to handle unpaired surrogates, > as if that were a normal use case. > > Regardless of what encoding model is used to handle characters under the > hood, and regardless of how the Delete key should work with actual > characters or clusters, there is never any excuse for software to create > unpaired surrogates, or any other sort of invalid code unit sequences. > That is like having an image editor that deletes every 128th byte from a > JPEG file, and then worrying about how to display the file. > > -- > Doug Ewell | http://ewellic.org | Thornton, CO ???? > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From daniel.buenzli at erratique.ch Tue Oct 20 20:47:45 2015 From: daniel.buenzli at erratique.ch (=?utf-8?Q?Daniel_B=C3=BCnzli?=) Date: Wed, 21 Oct 2015 02:47:45 +0100 Subject: Why Work at Encoding Level? In-Reply-To: References: <20151019100731.665a7a7059d7ee80bb4d670165c8327d.1842de28ba.wbe@email03.secureserver.net> Message-ID: <2BA47FCE32444BEA84064E15DA74C5EC@erratique.ch> Le mercredi, 21 octobre 2015 ? 02:23, Mark Davis ?? a ?crit : > But more fundamentally, there may not be "excuses" for such software, but it happens anyway. Pretending it doesn't, makes for unhappy customers. For example, you don't want to be throwing an exception when one is encountered, when that could cause an app to fail. It does happen at the input layer but it doesn't make any sense to bother the programmers with this once the IO boundary has been crossed and decoding errors handled. A good Unicode string in a programming language should at least operate at the scalar value level and these notions of Unicode n-bit strings should definitively be killed (maybe it would have inspired hopeless designers of recent programming languages to actually make better choices on that topic). Best, Daniel From mark at macchiato.com Tue Oct 20 22:37:54 2015 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Tue, 20 Oct 2015 20:37:54 -0700 Subject: Why Work at Encoding Level? In-Reply-To: <2BA47FCE32444BEA84064E15DA74C5EC@erratique.ch> References: <20151019100731.665a7a7059d7ee80bb4d670165c8327d.1842de28ba.wbe@email03.secureserver.net> <2BA47FCE32444BEA84064E15DA74C5EC@erratique.ch> Message-ID: ?> ?A good Unicode string in a programming language Yes, that would be great, no question. It isn't, however, the case in most programming languages (measured by the amount of software written in them). The original question that started these threads was how to handle isolated surrogates. If you are lucky enough to be only ever using programming languages that prevent that from ever happening, then the question is moot for you. If you're not, the question is relevant. Mark On Tue, Oct 20, 2015 at 6:47 PM, Daniel B?nzli wrote: > Le mercredi, 21 octobre 2015 ? 02:23, Mark Davis ?? a ?crit : > > But more fundamentally, there may not be "excuses" for such software, > but it happens anyway. Pretending it doesn't, makes for unhappy customers. > For example, you don't want to be throwing an exception when one is > encountered, when that could cause an app to fail. > > It does happen at the input layer but it doesn't make any sense to bother > the programmers with this once the IO boundary has been crossed and > decoding errors handled. A good Unicode string in a programming language > should at least operate at the scalar value level and these notions of > Unicode n-bit strings should definitively be killed (maybe it would have > inspired hopeless designers of recent programming languages to actually > make better choices on that topic). > > Best, > > Daniel > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From charupdate at orange.fr Wed Oct 21 01:21:39 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Wed, 21 Oct 2015 08:21:39 +0200 (CEST) Subject: The scope of Unicode (from Re: How can my research become implemented in a standardized manner?) In-Reply-To: <26364438.57814.1445015417239.JavaMail.defaultUser@defaultHost> References: <33365178.33329.1444395933996.JavaMail.defaultUser@defaultHost> <6041047.15822.1444472090118.JavaMail.defaultUser@defaultHost> <26364438.57814.1445015417239.JavaMail.defaultUser@defaultHost> Message-ID: <1732752496.822.1445408499446.JavaMail.www@wwinf1h15> On Fri, 16 Oct 2015 18:10:17 +0100 (BST), William_J_G Overington wrote: > What is the scope of Unicode please? An accurate idea of the scope of Unicode is best found on this page: http://www.unicode.org/consortium/consort.html > Can it ever change? Even though I?m not in a position to write on behalf of the UTC, I?can express my opinion that it would need a striving upset for the scope of Unicode to change. > If it can change, who makes the decision? For example, does it need an ISO decision at a level higher than the WG2 committee or can the WG2 committee do it if it so pleases? While delicately preserving respectful cooperation, Unicode has gained a thorough leadership in character encoding and process standardization engineering. And it is an autonomous consortium of powerful companies and renowned institutions. > How can a person apply for the scope of Unicode to become changed please? Individuals cannot apply for changes, but they can submit feedback, suggestions, and proposals, and they are granted free access to registers and archives, which all is already an enormous privilege. This is the more striking as several recent threads on this List pointed opposite practice at the other above invoked standards body. Kind regards, Marcel -------------- next part -------------- An HTML attachment was scrubbed... URL: From daniel.buenzli at erratique.ch Wed Oct 21 08:16:07 2015 From: daniel.buenzli at erratique.ch (=?utf-8?Q?Daniel_B=C3=BCnzli?=) Date: Wed, 21 Oct 2015 14:16:07 +0100 Subject: Why Work at Encoding Level? In-Reply-To: References: <20151019100731.665a7a7059d7ee80bb4d670165c8327d.1842de28ba.wbe@email03.secureserver.net> <2BA47FCE32444BEA84064E15DA74C5EC@erratique.ch> Message-ID: Le mercredi, 21 octobre 2015 ? 04:37, Mark Davis ?? a ?crit : > ?If you're not, the question is relevant. I'm not disputing the question, I'm disputing trying to give it a defined answer. Even if your string is UTF-16 based these problems can be solved by providing proper abstractions at the library level and ask clients to handle the problem *once* when you inject the UTF-16 strings in your abstraction which can then operate in a "clean" world where these questions do not arise. Besides programming languages do evolve and one should at least make sure that new languages provide adequate abstractions for handling Unicode text. Looking at the recent batch of new languages I don't think this is happening. I'm sure language designers are keen on taking off-the shelf designs for this rather than get into the details and but I would say that TUS by defining notions of Unicode strings at the encoding level is not doing a very good job at providing one. FWIW when I got into the standard around 2008 by reading that thick hard-copy of TUS 5.0, I took me quite some time to actually understand and uncover the real structure behind Unicode which are the scalar values. Best, Daniel From richard.wordingham at ntlworld.com Wed Oct 21 13:13:19 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Wed, 21 Oct 2015 19:13:19 +0100 Subject: Why Work at Encoding Level? In-Reply-To: References: <20151019100731.665a7a7059d7ee80bb4d670165c8327d.1842de28ba.wbe@email03.secureserver.net> <2BA47FCE32444BEA84064E15DA74C5EC@erratique.ch> Message-ID: <20151021191319.28f710e4@JRWUBU2> On Wed, 21 Oct 2015 14:16:07 +0100 Daniel B?nzli wrote: > Le mercredi, 21 octobre 2015 ? 04:37, Mark Davis ?? a ?crit : > > ?If you're not, the question is relevant. > > I'm not disputing the question, I'm disputing trying to give it a > defined answer. Even if your string is UTF-16 based these problems > can be solved by providing proper abstractions at the library level > and ask clients to handle the problem *once* when you inject the > UTF-16 strings in your abstraction which can then operate in a > "clean" world where these questions do not arise. That sounds good, but would you please talk me through how you apply it in the TSF method InsertTextAtSelection. Remember that the user may have switched input method several times. Richard. From mark at macchiato.com Wed Oct 21 13:43:02 2015 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Wed, 21 Oct 2015 11:43:02 -0700 Subject: Why Work at Encoding Level? In-Reply-To: References: <20151019100731.665a7a7059d7ee80bb4d670165c8327d.1842de28ba.wbe@email03.secureserver.net> <2BA47FCE32444BEA84064E15DA74C5EC@erratique.ch> Message-ID: Mark On Wed, Oct 21, 2015 at 6:16 AM, Daniel B?nzli wrote: > Le mercredi, 21 octobre 2015 ? 04:37, Mark Davis ?? a ?crit : > > ?If you're not, the question is relevant. > > I'm not disputing the question, I'm disputing trying to give it a defined > answer. Even if your string is UTF-16 based these problems can be solved by > providing proper abstractions at the library level and ask clients to > handle the problem *once* when you inject the UTF-16 strings in your > abstraction which can then operate in a "clean" world where these questions > do not arise. > Again, a nice thought? ?I am sympathetic to what you want. B ut ?for most people ? it runs into the brick wall of reality. ? Let's take Java for example. You could clearly write your own StringX class, that was logically UTF-32 ( like the Uniform model in http://macchiati.blogspot.com/2012/07/unicode-string-models-many-programming.html ? ?). But modern products? uses countless libraries to do their work, so you'll end up converting every time you call one of those libraries or get back a result. In the end, it might make your piece of code more reliable, but there will be a certain cost. And you are still dependent on those other libraries. Moreover, a key problem is the indexes. When you are calling out to an api that takes a String and an index into that string, you could have a simple method to return a String (if that is your internal representation). But you will to convert from your codepoint index to the api's code unit index. That either involves storing an interesting data structure in your StringX object, or doing a scan, which is relatively expensive. > > Besides programming languages do evolve and one should at least make sure > that new languages provide adequate abstractions for handling Unicode text. > Looking at the recent batch of new languages I don't think this is > happening. I'm sure language designers are keen on taking off-the shelf > designs for this rather than get into the details and but I would say that > TUS by defining notions of Unicode strings at the encoding level is not > doing a very good job at providing one. > ?Unicode evolved over time, and had pretty severe constraints when it originated. I agree that for a new language it would be cleaner to have a Uniform model?. > > FWIW when I got into the standard around 2008 by reading that thick > hard-copy of TUS 5.0, I took me quite some time to actually understand and > uncover the real structure behind Unicode which are the scalar values. > > Best, > > Daniel > > > ?Asmus put it nicely (why the thread split I don't know). "When it comes to methods operating on buffers there's always the tension between viewing the buffer as text elements vs. as data elements. For some purposes, from error detection to data cleanup you need to be able to treat the buffer as data elements. For many other operations, a focus on text elements is enough. If you desire to have a regex that you can use to validate a raw buffer, then that regex must do something sensible with partial code points. If you don't have multiple regex engines, then limiting your single one to valid input prevents you from using it everywhere." -------------- next part -------------- An HTML attachment was scrubbed... URL: From daniel.buenzli at erratique.ch Wed Oct 21 13:50:32 2015 From: daniel.buenzli at erratique.ch (=?utf-8?Q?Daniel_B=C3=BCnzli?=) Date: Wed, 21 Oct 2015 19:50:32 +0100 Subject: Why Work at Encoding Level? In-Reply-To: <20151021191319.28f710e4@JRWUBU2> References: <20151019100731.665a7a7059d7ee80bb4d670165c8327d.1842de28ba.wbe@email03.secureserver.net> <2BA47FCE32444BEA84064E15DA74C5EC@erratique.ch> <20151021191319.28f710e4@JRWUBU2> Message-ID: <68B9534C1C664F329002B97FF3394EA0@erratique.ch> Le mercredi, 21 octobre 2015 ? 19:13, Richard Wordingham a ?crit : > That sounds good, but would you please talk me through how you apply it > in the TSF method InsertTextAtSelection. Remember that the user may > have switched input method several times. Sorry don't know these acronyms or methods. Interaction with the input method should always eventually yield a stream of scalar values; if it's badly designed, you should try to abstract it so that it provides the right mecanism for you. Daniel From daniel.buenzli at erratique.ch Wed Oct 21 14:42:30 2015 From: daniel.buenzli at erratique.ch (=?utf-8?Q?Daniel_B=C3=BCnzli?=) Date: Wed, 21 Oct 2015 20:42:30 +0100 Subject: Why Work at Encoding Level? In-Reply-To: References: <20151019100731.665a7a7059d7ee80bb4d670165c8327d.1842de28ba.wbe@email03.secureserver.net> <2BA47FCE32444BEA84064E15DA74C5EC@erratique.ch> Message-ID: <24F8D8CBEBBC400C942C73975A063608@erratique.ch> Le mercredi, 21 octobre 2015 ? 19:43, Mark Davis ?? a ?crit : > Moreover, a key problem is the indexes. When you are calling out to an api that takes a String and an index into that string, you could have a simple method to return a String (if that is your internal representation). But you will to convert from your codepoint index to the api's code unit index. That either involves storing an interesting data structure in your StringX object, or doing a scan, which is relatively expensive. I'm not sure I fully understand what you wanted to say here. So I'm just trying to respond to the last sentence. You can have an abstract datatype that *represents* a scalar value index in a string, it knows the exact byte index (or underlying storage element) at which the scalar value starts in the string but it hides the actual value from you. This allows you to access directly the scalar value without having to scan or store an interesting data structure in StringX while avoiding direct access to the underlying encoding. The idea here is that direct random indexing is rarely needed; what happens most of the time is that you need to remember specific points in the string during a string traversal ? for example think about delineating the substrings matching of a pattern. Whenever you hit these points the traversal function knows the exact byte index and can be used to yield values of the abstract index datatype. > ?Unicode evolved over time, and had pretty severe constraints when it originated. Sure, what I'm trying to say here, is that its presentation could maybe be modernized a bit by putting a greater emphasis on scalar values and less on their encoding. This could improve the messy conceptual model of Unicode I tend to find in the brain of my programmer peers. > ?Asmus put it nicely (why the thread split I don't know). > > "When it comes to methods operating on buffers there's always the tension between viewing the buffer as text elements vs. as data elements. For some purposes, from error detection to data cleanup you need to be able to treat the buffer as data elements. [...] > If you desire to have a regex that you can use to validate a raw buffer, then that regex must do something sensible with partial code points. I personally don't think this is a good or desirable way of operating. Sanitize inputs and treat encoding errors first at the IO boundary of your program; then process the cleaned up data on which you know strong invariant are holding. Best, Daniel From richard.wordingham at ntlworld.com Wed Oct 21 16:50:10 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Wed, 21 Oct 2015 22:50:10 +0100 Subject: Why Work at Encoding Level? In-Reply-To: <68B9534C1C664F329002B97FF3394EA0@erratique.ch> References: <20151019100731.665a7a7059d7ee80bb4d670165c8327d.1842de28ba.wbe@email03.secureserver.net> <2BA47FCE32444BEA84064E15DA74C5EC@erratique.ch> <20151021191319.28f710e4@JRWUBU2> <68B9534C1C664F329002B97FF3394EA0@erratique.ch> Message-ID: <20151021225010.07598367@JRWUBU2> On Wed, 21 Oct 2015 19:50:32 +0100 Daniel B?nzli wrote: > Le mercredi, 21 octobre 2015 ? 19:13, Richard Wordingham a ?crit : > > That sounds good, but would you please talk me through how you > > apply it in the TSF method InsertTextAtSelection. Remember that the > > user may have switched input method several times. > > Sorry don't know these acronyms or methods. Interaction with the > input method should always eventually yield a stream of scalar > values; if it's badly designed, you should try to abstract it so that > it provides the right mecanism for you. The simpler-looking input methods provide and then delete text. For example, I have a Keyman for Linux input editor based on XSAMPA in which I can successively input e_H\ to get the successive text displays e e_ ? e?. (The latter includes a spacing tone mark.) The input editor knows whether my application has or U+00E9 LATIN SMALL LETTER E WITH ACUTE in my backing store when I strike the backslash - there is a callback for this very purpose, but the input editor does have fallback logic, which is needed when it uses the X protocols. It uses the GTK+ interface with GTK+ applications, and sends the commands "delete one character before the cursor" in each case and "insert e?" or "insert e?" accordingly. Now, the GTK+ commands function in terms of scalar values, which should be nice. However, notice that text and positions go in both directions across the interface. The Text Services Framework on Windows works similarly, but its commands seem to be expressed in terms of absolute UTF-16 positions. Abstraction may move the problem, but it doesn't eliminate it. The best one can hope for is a reusable abstraction. Richard. From mark at kli.org Wed Oct 21 18:50:48 2015 From: mark at kli.org (Mark E. Shoulson) Date: Wed, 21 Oct 2015 19:50:48 -0400 Subject: The scope of Unicode (from Re: How can my research become implemented in a standardized manner?) In-Reply-To: <26364438.57814.1445015417239.JavaMail.defaultUser@defaultHost> References: <33365178.33329.1444395933996.JavaMail.defaultUser@defaultHost> <6041047.15822.1444472090118.JavaMail.defaultUser@defaultHost> <26364438.57814.1445015417239.JavaMail.defaultUser@defaultHost> Message-ID: <562824D8.7050006@kli.org> On 10/16/2015 01:10 PM, William_J_G Overington wrote: > > I have been considering how to make progress with trying for my > research to become implemented in a standardized manner. > > I have been informed that a group of people have examined the document > that I submitted and determined that it is out of scope for UTC. > There are millions of people on this great globe doing all kinds of research into all kinds of things. Most of them somehow manage to do so without requiring an international standards body to change its workings and basic outlook to accommodate them. It staggers the imagination that your research simply cannot be done without the cooperation of Unicode, and moreover, that you have the nerve to ask for it to change its *entire scope* just so that your personal project, stalled by your own hand, can move forward. Learn how all those millions of people out there manage to do their work and further their research without calling on multinational bodies to bend to their whims. It must be possible, everyone else seems to be able to do it. The only thing stopping your research from progressing and standardizing is you. Unicode isn't doing what you want? Make your own standard. Make it standard for *your* stuff. Get people to like it and use it. You cannot expect Unicode to change to be what you want any time in the foreseeable future; make do without it. Please. Grow up and take responsibility for your own research and stop trying to bend Unicode into what YOU think it should be, when the clear consensus is that it isn't. The rest of us are tired of having to answer this question (or see it answered) over and over. ~mark From wjgo_10009 at btinternet.com Thu Oct 22 04:21:43 2015 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Thu, 22 Oct 2015 10:21:43 +0100 (BST) Subject: The scope of Unicode (from Re: How can my research become implemented in a standardized manner?) In-Reply-To: <33232474.13418.1445504878174.JavaMail.root@webmail02.bt.ext.cpcloud.co.uk> References: <33232474.13418.1445504878174.JavaMail.root@webmail02.bt.ext.cpcloud.co.uk> Message-ID: <22651988.15114.1445505703382.JavaMail.defaultUser@defaultHost> Mark E. Shoulson wrote: > Unicode isn't doing what you want? Make your own standard. Make it standard for *your* stuff. Get people to like it and use it. Unicode and the International Standard with which it is synchronized are the standards. I submitted a rewritten document on Monday 19 October 2015. The document is available on the web. http://www.users.globalnet.co.uk/~ngo/a_preliminary_proposal_to_encode_two_base_characters.pdf It is linked from the following web page. http://www.users.globalnet.co.uk/~ngo/library.htm The document has been deposited, as an email attachment, with the British Library for Legal Deposit and a receipt received. Here is a link about Legal Deposit in the United Kingdom. http://www.bl.uk/aboutus/legaldeposit/index.html William Overington 22 October 2015 From rick at unicode.org Thu Oct 22 11:54:01 2015 From: rick at unicode.org (Rick McGowan) Date: Thu, 22 Oct 2015 09:54:01 -0700 Subject: The scope of Unicode (from Re: How can my research become implemented in a standardized manner?) In-Reply-To: <26364438.57814.1445015417239.JavaMail.defaultUser@defaultHost> References: <33365178.33329.1444395933996.JavaMail.defaultUser@defaultHost> <6041047.15822.1444472090118.JavaMail.defaultUser@defaultHost> <26364438.57814.1445015417239.JavaMail.defaultUser@defaultHost> Message-ID: <562914A9.6030802@unicode.org> Hello William, Answers to most of your questions can be found among the pages of the Unicode Consortium website. I'll try to answer your questions about scope which may also be of interest to other subscribers, but please note that *everything I say in this e-mail is solely my own opinion and does not reflect the opinions or policies of Unicode, Inc, or any of its committees. * > > What is the scope of Unicode please? > The scope of The Unicode /Standard /(TUS) is set forth in Chapter 1, which you can find here: _http://www.unicode.org/versions/Unicode8.0.0/ch01.pdf_ The scope of the Unicode /Consortium /is essentially distilled in the mission statement, which is on the home page:_ __http://www.unicode.org/_and on the "What is Unicode" page here:_ __http://www.unicode.org/standard/WhatIsUnicode.html_ under the heading "About the Unicode Consortium"... and formally here, in the corporate bylaws:_ __http://www.unicode.org/consortium/Unicode-Bylaws.pdf_ under "Article I - Purpose and Membership", which says: ...This Corporation?s specific purpose shall be to enable people around the world to use computers in any language, by providing freely-available specifications and data to form the foundation for software internationalization in all major operating systems, search engines, applications, and the World Wide Web. An essential part of this purpose is to standardize, maintain, educate and engage academic and scientific communities, and the general public about, make publicly available, promote, and disseminate to the public a standard character encoding that provides for an allocation for more than a million characters. > Can it ever change? The answer to that question depends on what you mean by "it", and "change", really. The scope of the /standard /has changed several times over the course of its history, as has the scope of the /consortium/, for good reasons. For example, the corporate scope was expanded to include a variety of standards beyond just the character encoding standard, which were of interest to members (and continue to be of interest). The scope of the /standard /was expanded to include code space for more than 65,536 characters, to include characters needed for historical scripts, and so forth. > If it can change, who makes the decision? For example, does it need an > ISO decision at a level higher than the WG2 committee or can the WG2 > committee do it if it so pleases? > Like any /corporation/, the Unicode Consortium bylaws are subject to changes from time to time. The full members, as set forth in the bylaws, are the ones who may make changes to the bylaws. There are some restrictions, of course, such as operating within various legal parameters and within the scope of a public-benefit charitable organization, as defined under US law. The /standard /is mainly controlled by the Unicode Technical Committee, operating under the TC Procedures laid out here:_ __http://www.unicode.org/consortium/tc-procedures.html_ and subject to interpretation or restriction by the officers and board of directors. The UTC works very closely with members of ISO/IEC JTC1/SC2 and the working group WG2 under it. (You can find out about ISO procedures and so forth on their site.) > How can a person apply for the scope of Unicode to become changed please? The most direct way to influence the scope of the Unicode Standard is through becoming a full member of the consortium: _http://www.unicode.org/consortium/join.html_ so that you can vote in corporate meetings and for members of the board, as well as in technical committees. Then, presumably, you would go to an annual member's meeting (or call for a special meeting) and present your case for the scope of the consortium to be changed. Then, if you want to change the scope of The Unicode Standard, you call for a vote in the UTC and achieve a majority of votes on whatever resolution you put to the committee. This is /intentionally /a weighty process. > I have been considering how to make progress with trying for my > research to become implemented in a standardized manner. > Personally, I think you're getting ahead of yourself. First, you should demonstrate that you have done research and produced results that at least some people find so useful and important that they are eager to implement the findings. Then, once you have done that, think about standardizing something, but only after you have a /working model /of the thing sufficient to demonstrate its general utility. While I do not speak for the UTC in any way, observations of the committee over a period of some years have led me to conclude that they never encode something, call it "X", on pure speculation that some future research might result in "X" being useful for some purpose that has not even been demonstrated as a need, or clearly enough articulated to engender the committee's confidence in its potential utility. Regards, Rick -------------- next part -------------- An HTML attachment was scrubbed... URL: From asmus-inc at ix.netcom.com Thu Oct 22 15:59:01 2015 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Thu, 22 Oct 2015 13:59:01 -0700 Subject: The scope of Unicode (from Re: How can my research become implemented in a standardized manner?) In-Reply-To: <562914A9.6030802@unicode.org> References: <33365178.33329.1444395933996.JavaMail.defaultUser@defaultHost> <6041047.15822.1444472090118.JavaMail.defaultUser@defaultHost> <26364438.57814.1445015417239.JavaMail.defaultUser@defaultHost> <562914A9.6030802@unicode.org> Message-ID: <56294E15.8050800@ix.netcom.com> On 10/22/2015 9:54 AM, Rick McGowan wrote: > Personally, I think you're getting ahead of yourself. First, you > should demonstrate that you have done research and produced results > that at least some people find so useful and important that they are > eager to implement the findings. Then, once you have done that, think > about standardizing something, but only after you have a /working > model /of the thing sufficient to demonstrate its general utility. > > While I do not speak for the UTC in any way, observations of the > committee over a period of some years have led me to conclude that > they never encode something, call it "X", on pure speculation that > some future research might result in "X" being useful for some purpose > that has not even been demonstrated as a need, or clearly enough > articulated to engender the committee's confidence in its potential > utility. To the degree that one can make a "general" statement, this pretty much sums it up - and, as my experience with standards, both as developer and consumer of them has convinced me, this is absolutely the right approach. Well-intentioned innovation should not be what drives standards. "Standard" practice is what should drive them. Seeming exceptions, such a programming language standards, have a strong community that works on testing and trying out new language features ahead of them being added to the language, so that people have a good idea how they will pan out - but ever there, the occasional feature ends up being still-born. What concerns the original question, Rick is absolutely correct. The scope of the consortium is set by its members. Formally by the full members, but with an ear to what will make the Consortium strong (that is attract all classes of members and technical experts). Originally, of course, there wasn't a Consortium. Just a number or people working on a common goal. They called themselves the Unicode Working Group, and developed a complete 700 + page draft, before deciding on using the Consortium structure as the most appropriate for their work. This path, of creating new structure at the end of an informal collaboration of like-minded people is one of the more successful routes for new projects and ideas to spread. If an idea or concept is powerful enough to generate committed interest, then that is a good predictor for future staying power. Conversely, if you cannot get people to work with you informally, trying to "make" some existing formal group accept your ideas isn't going to lead to any better results. A./ -------------- next part -------------- An HTML attachment was scrubbed... URL: From mark at kli.org Thu Oct 22 19:48:21 2015 From: mark at kli.org (Mark E. Shoulson) Date: Thu, 22 Oct 2015 20:48:21 -0400 Subject: The scope of Unicode (from Re: How can my research become implemented in a standardized manner?) In-Reply-To: <22651988.15114.1445505703382.JavaMail.defaultUser@defaultHost> References: <33232474.13418.1445504878174.JavaMail.root@webmail02.bt.ext.cpcloud.co.uk> <22651988.15114.1445505703382.JavaMail.defaultUser@defaultHost> Message-ID: <562983D5.2090808@kli.org> It's nice that you've written proposals. I suppose the various groups will pick it up and get back to you as they usually do. But if they say "no, you're out of scope" again, it probably means that you're out of scope, and submitting another proposal of the same thing will not make it any more in-scope. I have no idea why deposition with the British Library is in any way significant or even relevant. It's nice to mail documents to people who will save them, yes. You ask about these same questions often. Often enough that some have been banned as topics of conversation here. You've been doing it for years. "Can the scope of Unicode change?" you ask. At this point, I suggest that you act as if the answer is "No!" and move on, without trying to force Unicode to become a partner in your research. Even if the answer is "Maybe" it's not the kind of thing you can be *sure* will happen. You need to proceed in a way that doesn't depend on other things beyond your control. You want to join Unicode as an official member and try to change its scope from the inside, where you can even vote? Be my guest. You can't proceed with your research without a multinational standards committee changing *its entire scope and outlook* just to accommodate you? Then you're going about research wrong. "Unicode and the International Standard with which it is synchronized are the standards" you say? Obviously not, since Unicode has said that it doesn't encode what you want. So it is NOT the standard for the things you want to use it for. It's the standard for other things. Do doctors insist that the WHO completely change its focus so their research can be included? Other researchers the world over are doing their thing without asking ISO, Unicode, ANSI, DIN, or the IAEA for that matter change to suit them. I have not heard of other cases like this, which doesn't mean there aren't any, but it probably means there aren't many, and I haven't heard any standards organizations announcing changes based on requests like this, either. This is not the standard you were looking for. Find another or make your own (or both), like a responsible researcher and scientist. ~mark On 10/22/2015 05:21 AM, William_J_G Overington wrote: > Mark E. Shoulson wrote: > >> Unicode isn't doing what you want? Make your own standard. Make it standard for *your* stuff. Get people to like it and use it. > Unicode and the International Standard with which it is synchronized are the standards. > > I submitted a rewritten document on Monday 19 October 2015. > > The document is available on the web. > > http://www.users.globalnet.co.uk/~ngo/a_preliminary_proposal_to_encode_two_base_characters.pdf > > It is linked from the following web page. > > http://www.users.globalnet.co.uk/~ngo/library.htm > > The document has been deposited, as an email attachment, with the British Library for Legal Deposit and a receipt received. > > Here is a link about Legal Deposit in the United Kingdom. > > http://www.bl.uk/aboutus/legaldeposit/index.html > > William Overington > > 22 October 2015 > > > From petercon at microsoft.com Thu Oct 22 21:47:18 2015 From: petercon at microsoft.com (Peter Constable) Date: Fri, 23 Oct 2015 02:47:18 +0000 Subject: The scope of Unicode (from Re: How can my research become implemented in a standardized manner?) In-Reply-To: <562983D5.2090808@kli.org> References: <33232474.13418.1445504878174.JavaMail.root@webmail02.bt.ext.cpcloud.co.uk> <22651988.15114.1445505703382.JavaMail.defaultUser@defaultHost> <562983D5.2090808@kli.org> Message-ID: From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Mark E. Shoulson Sent: Friday, October 23, 2015 9:48 AM > I have no idea why deposition with the British Library is in any way significant or even relevant. It's nice to mail documents to people who will save them, yes. Hmmm... If I (or anyone else) were to forward to the British Library every item I post to this or other public lists or fora, or anything else I'd like to have publicly recorded, they'll provide a permanent, public record? I would have expected them to be pretty selective of what things they decide to hang onto. Peter From charupdate at orange.fr Fri Oct 23 01:59:21 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Fri, 23 Oct 2015 08:59:21 +0200 (CEST) Subject: Latin glottal stop in ID in NWT, Canada Message-ID: <1022406759.1183.1445583561682.JavaMail.www@wwinf1h14> On Wed, 14 Oct 2015 16:04:20 +0000, Denis Jacquerye wrote: > Here is what N.W.T.'s language commissioner, Shannon Gullberg is quoted saying: > ?By not allowing for names that contain Dene fonts, diacritical marks and symbols, she says the Vital Statistics Act is violating the spirit and intent of the Official Languages Act.? > [...] > Where Dene languages expert Brent Kaulback is quoted saying: > ?Dene fonts are now unicode fonts. They can be loaded onto any computer, and if they're typed into any computer, any other computer can read those fonts as well.? I?m very glad to read the above statements. Only when looking closer, especially to other parts of the cited articles, but also elsewhere on the web in other countries, I?m amazed and stranged to see that there is still an important lack of knowledge even among learned people about what is Unicode and what is today?s nature of our daily worktools. Some people stay talking about fonts, where Unicode talks about characters. Some people view glottal stops as symbols and as not being a part of Latin script, while they ship with every computer on Windows (and though, any other OS) through font support since almost a decade or so. Reading forth, I stumbled upon yet other oddities. Some people are calling ?Roman alphabet? what seemingly should be Latin script, while roman is today a font style only. The following are but examples, which are here because they?re inside the thread?s topic: >>> The department said it has to adhere to the Vital Statistics Act, which recognizes only names that use letters from the Roman alphabet. Having symbols like the glottal stop on birth certificates would also interfere with obtaining passports and other documents issued by the federal government, according to an email from a department spokesperson. http://www.macleans.ca/society/life/all-in-the-family-name/ >>> The Northwest Territories government has refused to register the girl under that name, saying that all names must be spelled using the standard Roman alphabet. >>> The territory?s Vital Statistics Department told her she couldn?t register her baby under that name. It said Roman characters are legally required for names because they have to appear on official federal documents. >>> Healy said the department is working with Ottawa to see if it?s possible to allow such characters on official documents. The issue raises both technical and economic issues, he said. >>> ?In the event that the fonts cannot be accepted by the federal government, the department will have to continue to produce a birth certificate that only includes the Roman alphabet,? he said. https://www.thestar.com/news/canada/2015/03/06/nwt-wont-recognize-infants-aboriginal-name.html >>> the Northwest Territories government was unable to register a name that is not written entirely in the Roman alphabet. >>> In an email [...], a government representative explained that's because the glottal stop isn't part of the Roman alphabet. http://www.cbc.ca/news/canada/north/chipewyan-baby-name-not-allowed-on-n-w-t-birth-certificate-1.2984173 I stop quoting here not to lengthen the refrain, but rather point out that among the mentioned reasons for glottal stop refusal?missing resources and unwillingness to buy new engraving and fixed-type printing machines, perhaps ignorance too?there seem to be two main reasons: A | Missing ethical guidelines awareness. The following quotations corroborate what I already outlined off list (I?m highlighting with uppercase): >>> For Arok Wolvengrey, head of the indigenous languages department at the First Nations University of Canada in Regina, these stories aren?t surprising, and point to the ways Aboriginal languages are under threat. ?The decision not to allow the proper representation of their children?s name IS A SERIOUS INSULT,? he says. ?This is ANOTHER EXAMPLE OF THE DUAL MESSAGES GOVERNMENTS OFTEN SEND. They say they respect our official languages, but that?s definitely not how it plays out in practice. For many people who no longer speak these languages, this is the only way they can preserve their ancestry.? >>> In Nunavut, which recognizes Inuktitut, English and French, Inuit can register traditional names, INCLUDING THE GLOTTAL STOP, FOR GOVERNMENT DOCUMENTS. But it looks as though the Northwest Territories won?t be MAKING CONCESSIONS ANY TIME SOON. ?Practically, the current vital statistics database and printer do not accommodate glottal stops?.?.?.?and significant resources would be needed to upgrade them,? a spokesperson for the department said in an email this week. http://www.macleans.ca/society/life/all-in-the-family-name/ B | Missing keyboard layouts. Most people (including myself two years back) simply don?t know how to update their keyboard layout in a manner that allows to input these characters in a reasonable way. This however isn?t mentioned by anybody, obviously because it questions the ability to control a more or less trivial everyday worktool. Germany standardized five years ago a couple of new backwards compatible keyboard layouts designed for the proper input of all occurring names in Latin script, and ALL THREE GLOTTAL STOPS HAVE PLACES ASSIGNED ON KEYS (at least on the complete one of them). Finland has a standard keyboard with support of S?mi and many other languages. Great Britain has an extended keyboard layout including local language characters. In France there are at least two associations proposing keyboard layouts, and the government is standardizing an official layout or two for multilingual support. In CANADA, QU?BEC has had the merit of creating a MULTILINGUAL keyboard that has been successfully standardized A QUARTER OF A CENTURY ago, and this is nowadays THREATENED BY THE IT INDUSTRY, so that it even cannot complete to the initially planned end stage allowing full support of all Latin characters for official languages. See this other thread on the Unicode Public List: Effectiveness of locale support (was: Re: Custom source samples) http://www.unicode.org/mail-arch/unicode-ml/y2015-m09/0014.html http://www.unicode.org/mail-arch/unicode-ml/y2015-m09/0015.html I?m perhaps the last person on the Unicode List to be in a position to point to other people?s ignorance, as I am very ignorant myself and was even more when I?started e-mailing to Unicode and to the Unicode List. Knowing thus by experience what ignorance is, how it works, and what it does, I?m in turn perhaps the only person that is capable to send this e-mail. However I came near not to send it to the List, as I?d written it in a way that wasn?t really fit for public audience. Now I believe that my wording is scaled enough not to hurt anybody who isn?t to. I suggest that *all* Canadian local and territorial authorities cooperate with Qu?bec and the Federal government to fully support the completion and implementation of the CANADIAN MULTILINGUAL STANDARD keyboard layout. All the best, Marcel -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Fri Oct 23 02:53:15 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Fri, 23 Oct 2015 08:53:15 +0100 Subject: Terminology (was: Latin glottal stop in ID in NWT, Canada) In-Reply-To: <1022406759.1183.1445583561682.JavaMail.www@wwinf1h14> References: <1022406759.1183.1445583561682.JavaMail.www@wwinf1h14> Message-ID: <20151023085315.443d56b6@JRWUBU2> On Fri, 23 Oct 2015 08:59:21 +0200 (CEST) Marcel Schneider wrote: > Reading forth, I stumbled upon yet other oddities. Some people are > calling ?Roman alphabet? what seemingly should be Latin script, while > roman is today a font style only. The following are but examples, > which are here because they?re inside the thread?s topic: I think you're making the mistake of assuming that the Unicode Standard is written in English, rather than some jargon that is confusingly like it. I would like an English translation of Chapter 3 'Conformance', but I suspect a French translation would have higher priority, and I don't think that's going to happen any time soon. 'Latin script', in so far as it is translatable, translates into English as 'Roman alphabet'. In the language of the TUS, the word 'alphabet' has a more restricted meaning, whereby, for example, the Thai alphabet is not used for the Thai language! The Thai alphabet is, however, used for the Pali language and is promoted for Pattani Malay. When the characters of the Thai alphabet are used for the Thai language, they are used as an 'abugida', not as an 'alphabet'.) Richard. From charupdate at orange.fr Fri Oct 23 03:52:31 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Fri, 23 Oct 2015 10:52:31 +0200 (CEST) Subject: Terminology (was: Latin glottal stop in ID in NWT, Canada) In-Reply-To: <20151023085315.443d56b6@JRWUBU2> References: <1022406759.1183.1445583561682.JavaMail.www@wwinf1h14> <20151023085315.443d56b6@JRWUBU2> Message-ID: <655778553.3264.1445590351640.JavaMail.www@wwinf1h14> On Fri, 23 Oct 2015 08:53:15 +0100, Richard Wordingham wrote; > On Fri, 23 Oct 2015 08:59:21 +0200 (CEST) > Marcel Schneider wrote: > > > Reading forth, I stumbled upon yet other oddities. Some people are > > calling ?Roman alphabet? what seemingly should be Latin script, while > > roman is today a font style only. [...] > 'Latin script', in so far as it is translatable, translates into > English as 'Roman alphabet'. Thank you for the correction. Indeed I wasn't aware that in the languages around me, "Roman alphabet" is though less recommended but synonymous of "Latin alphabet" and as such has even an entry in Simple Wikipedia, eventually to convert spirits to "Latin alphabet" (which I've just helped forth by replacing some instances). So I apologize to all people I offended on this point, but maintain the other statements unless otherwise corrected. > I would like an English translation of Chapter 3 'Conformance', > but I suspect a French translation would have higher priority, and I > don't think that's going to happen any time soon. Indeed the last French translation being that of version 5.0, updating it could be consistent, but thinking about Germans and their using the original documentation (presumably by reading it in English), and all the other countries and languages, and a part of Swiss people communicating across the inner language barrier by switching to English, I wonder whether this pain must be taken. For somebody who has read TUS in French, it could be much harder mailng about on the Unicode Public List :) Kind regards, Marcel -------------- next part -------------- An HTML attachment was scrubbed... URL: From charupdate at orange.fr Fri Oct 23 06:34:26 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Fri, 23 Oct 2015 13:34:26 +0200 (CEST) Subject: Terminology (was: Latin glottal stop in ID in NWT, Canada) In-Reply-To: <20151023085315.443d56b6@JRWUBU2> References: <1022406759.1183.1445583561682.JavaMail.www@wwinf1h14> <20151023085315.443d56b6@JRWUBU2> Message-ID: <786526406.6308.1445600066983.JavaMail.www@wwinf1h14> On Fri, 23 Oct 2015 08:53:15 +0100, Richard Wordingham wrote: > I think you're making the mistake of assuming that the Unicode Standard > is written in English, rather than some jargon that is confusingly like it. The idea that some technical specification is not written in good English, is generally an illusion produced by the very nature of the content. More specifically about TUS, I have a strong confidence about accurate expression, that happens to be illustrated by the following quotation from the incriminated chapter (highlighting uppercase added): >>> Additional information can be found throughout the other chapters of this core specification for the Unicode Standard. However, because of the need to keep extended discussions of scripts, sets of symbols, and other characters READABLE, material in other chapters is not always labeled as to its normative or informative status. http://www.unicode.org/versions/Unicode8.0.0/ch03.pdf#G22672 > I would like an English translation of Chapter 3 'Conformance', I guess that there may be some need of a *manual*, in the spirit that led the French translator to adding annotations. May you please quote some examples of what you wish to see expressed in a different way? > 'Latin script', in so far as it is translatable, translates into > English as 'Roman alphabet'. I?know your expertise from previous threads, but I've no means to adhere to the equivalence you put between a script and an alphabet. The delusion I point out in quotations about "Roman alphabet," or alternately but way worse, the hypocrisy, is that while a handful of diacritics are certainly supported in order to spell French names in a reasonable and legible way, and while the ? and ? letters can scarcely be registered as "ae" or "oe" in Canada, other letters of the Latin (well, say, Roman) script are excluded, refused, and banned. And that is justified by telling people that a glottal stop isn't part of the Roman alphabet. "?" isn't neither, as this character is not a part of the alphabet, just to take the one that is on *all* Canadian traditional keyboards. Nor is ?, which is on none of them [but on Canadian Multilingual Standard]. Agreed, I?haven't been there to look into their data base and at the cited printer. > In the language of the TUS, the word > 'alphabet' has a more restricted meaning, whereby, for example, the > Thai alphabet is not used for the Thai language! The Thai alphabet is, > however, used for the Pali language and is promoted for Pattani Malay. > When the characters of the Thai alphabet are used for the Thai > language, they are used as an 'abugida', not as an 'alphabet'.) Again, I do know nothing about Thai, but if in TUS an abugida can be addressed to as an alphabet if the same is used as such, it seems to me that the word 'alphabet' has a pretty extended meaning in TUS. In any case, isolating an arbitrary subset inside our Latin script and promoting it as the so-called Roman alphabet to get some pretext for refusing that compatriots or strangers bear their real and choosen names, [quote] IS A SERIOUS INSULT [/quote]. ? Additionally, at the age of Unicode, this results in being as well an insult to the whole work of the Consortium. Marcel -------------- next part -------------- An HTML attachment was scrubbed... URL: From wjgo_10009 at btinternet.com Fri Oct 23 05:50:06 2015 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Fri, 23 Oct 2015 11:50:06 +0100 (BST) Subject: The scope of Unicode (from Re: How can my research become implemented in a standardized manner?) In-Reply-To: <562983D5.2090808@kli.org> References: <33232474.13418.1445504878174.JavaMail.root@webmail02.bt.ext.cpcloud.co.uk> <22651988.15114.1445505703382.JavaMail.defaultUser@defaultHost> <562983D5.2090808@kli.org> Message-ID: <33319830.22484.1445597406500.JavaMail.defaultUser@defaultHost> Mark E. Shoulson wrote: > But if they say "no, you're out of scope" again, it probably means that you're out of scope, and submitting another proposal of the same thing will not make it any more in-scope. Well, as at the time of writing this post, 11:06 am on Friday morning here in England, the document has neither appeared in the Unicode Document Register nor have I received any reply to my submission > I have no idea why deposition with the British Library is in any way significant or even relevant. Four reasons. 1. Archiving of my writing for as long as civilization lasts. 2. Conservation so that even if the idea is rejected now by the Unicode Consortium then the document is there for the future when different people may look upon the idea differently. 3. Academic precedence, proof that I wrote about that idea at that time. 4. Proof of prior publication in case someone else at a later date tries to patent the invention with a view to gaining a monopoly. > It's nice to mail documents to people who will save them, yes. Yes. > You want to join Unicode as an official member and try to change its scope from the inside, where you can even vote? Be my guest. Well, as an individual I cannot join as a Full Member, even if I could afford the money. I find it interesting that the Unicode Consortium publishes the Universal Declaration of Human Rights in many languages, yet a human being cannot join as a Full Member and have a vote. Interesting. > You can't proceed with your research without a multinational standards committee changing *its entire scope and outlook* just to accommodate you? Well, not its entire scope and outlook. Just a very small change from what has already been changed for flags so as to allow this localized target display rather than just a direct glyph target display as for flags. The scope was changed for emoji and variation selectors requesting a colourful glyph, the scope was changed this year by undeprecating most of the tag characters and introducing the idea of a base character followed by a sequence of tag characters with the sequence of tag characters derived from another, external to Unicode, standards document. The scope is changed by considering using a base character and a sequence of tag characters for customized in-line graphics. The document is in the Unicode Document Register. So can the scope change for my invention? I suggest that it can if the Unicode Technical Committee wants it to change. The issue is as to whether the Unicode Technical Committee will be allowed to consider that possibility in its meeting. It seems to me that discussion of a new invention should not be rejected solely on a scoping issue when the scoping rules were made before the invention was made. I feel that it would serve no useful purpose for the encoding proposal to be rejected on the grounds of existing scope rules with no opportunity for the UTC as a whole to consider whether it wishes to change scope so that this invention with its wonderful possibilities can proceed. It is not good to try to run in treacle and I do not want to have to be satisfied with trying to develop some vastly underpowered markup-based system. William Overington 23 October 2015 From wjgo_10009 at btinternet.com Fri Oct 23 09:08:22 2015 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Fri, 23 Oct 2015 15:08:22 +0100 (BST) Subject: The scope of Unicode (from Re: How can my research become implemented in a standardized manner?) In-Reply-To: References: <33232474.13418.1445504878174.JavaMail.root@webmail02.bt.ext.cpcloud.co.uk> <22651988.15114.1445505703382.JavaMail.defaultUser@defaultHost> <562983D5.2090808@kli.org> Message-ID: <17310297.40446.1445609302606.JavaMail.defaultUser@defaultHost> Peter Constable wrote: > Hmmm... If I (or anyone else) were to forward to the British Library every item I post to this or other public lists or fora, or anything else I'd like to have publicly recorded, they'll provide a permanent, public record? No. For Legal Deposit, there needs to be an association with the United Kingdom, as either where the item was produced, or published; or both. Also the item must be published. However, given an association with the United Kingdom, as either where the item was produced, or published; or both; then the answer is, with some exceptions, broadly yes. However, if someone from outside the United Kingdom sent the British Library something as a gift, then that is a separate matter from Legal Deposit and I have been advised that the matter would be dealt with by a different department and the item would be sent to a Curator and a decision would be made as to whether to keep the item. There is a web page leading to lots of information about Legal Deposit. http://www.bl.uk/aboutus/legaldeposit/index.html http://www.legislation.gov.uk/uksi/2013/777/regulation/13/made However, sound is accepted when it is part of a larger item. So, for example, my sound recording, a .wav file, embedded in a pdf with some notes, was accepted. http://www.users.globalnet.co.uk/~ngo/the_mobile_art_shop.pdf Actually, the pdf is smaller than the original .wav file due to a lossless compression when exporting the pdf from the Serif PagePlus desktop publishing program. If playing the sound, please note that there can be problems with some browser and pdf reader combinations. The best thing is to download the file to local storage, then open Adobe Reader, then open the file from within Adobe Reader. I have deposited various types of item, including .pdf files (including three pdfs each with a sound recording) and .TTF files. I think that I was the first person to deposit a .TTF file. > I would have expected them to be pretty selective of what things they decide to hang onto. The idea is to gather a collection of all of the cultural output of the United Kingdom. So the collection policy is comprehensive, with a few exceptions as to type of publication, yet not based on any assessment of literary merit of an item of a collected type. Items are gathered by the British Library automated harvester program from my family webspace from time to time, yet for some items I send a copy as an email attachment at or soon after publication and receive an email receipt so that I know that the item is stored at the British Library. William Overington 23 October 2015 From wjgo_10009 at btinternet.com Fri Oct 23 10:01:01 2015 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Fri, 23 Oct 2015 16:01:01 +0100 (BST) Subject: The scope of Unicode (from Re: How can my research become implemented in a standardized manner?) In-Reply-To: <562914A9.6030802@unicode.org> References: <33365178.33329.1444395933996.JavaMail.defaultUser@defaultHost> <6041047.15822.1444472090118.JavaMail.defaultUser@defaultHost> <26364438.57814.1445015417239.JavaMail.defaultUser@defaultHost> <562914A9.6030802@unicode.org> Message-ID: <7184978.47471.1445612461273.JavaMail.defaultUser@defaultHost> Thank you for your comprehensive answer. Rick McGowan wrote: > Personally, I think you're getting ahead of yourself. First, you should demonstrate that you have done research and produced results that at least some people find so useful and important that they are eager to implement the findings. Then, once you have done that, think about standardizing something, but only after you have a working model of the thing sufficient to demonstrate its general utility. I am an independent researcher, researching at home, using the internet and various software items on a laptop computer. I am not able to produce a working model. I can mostly only produce thought experiments, sometimes expressed as a simulation, like a story narrative. Maybe I could produce a short animation movie. > While I do not speak for the UTC in any way, observations of the committee over a period of some years have led me to conclude that they never encode something, call it "X", on pure speculation that some future research might result in "X" being useful for some purpose that has not even been demonstrated as a need, or clearly enough articulated to engender the committee's confidence in its potential utility. Well, as I say, I am an independent researcher, researching at home. May I just mention one thing though which might be regarded as significant. A short time ago I was talking with someone who is a clinician and I asked about whether there were issues trying to communicate with people through the language barrier. I was told that sometimes people bring a relative or friend to translate. An example was given to me of sometimes needing to use mime to try to express the meaning of "Have you vomited?". I asked if the following would be helpful. Use your computer to look down a menu for a preset sentence "Have you vomited?". Select the sentence. Behind the scenes a code is generated. Throw the code to the mobile telephone of the patient. On the screen of the patient's mobile telephone the sentence localized into his or her language is displayed. I said that there would be a standardized list of preset sentences, set out in English as International Standards are produced in English and that the National Standardization Body for each country would translate the list into the language of its country and produce a list to convert the codes to the local language. There was amazement and enthusiasm for this possibility. So there we are. The supreme irony of all of this is that there has been much objection to my invention in this mailing list over the years, with no good reason ever stated, yet it would be the very existence of The Unicode Standard itself that would allow the localized text to appear on the screen of the mobile telephone of the patient! If this invention had been made in the research laboratory of a large information technology company maybe things would be very different. William Overington 23 October 2015 -------------- next part -------------- An HTML attachment was scrubbed... URL: From rick at unicode.org Fri Oct 23 12:11:56 2015 From: rick at unicode.org (Rick McGowan) Date: Fri, 23 Oct 2015 10:11:56 -0700 Subject: The scope of Unicode (from Re: How can my research become implemented in a standardized manner?) In-Reply-To: <7184978.47471.1445612461273.JavaMail.defaultUser@defaultHost> References: <33365178.33329.1444395933996.JavaMail.defaultUser@defaultHost> <6041047.15822.1444472090118.JavaMail.defaultUser@defaultHost> <26364438.57814.1445015417239.JavaMail.defaultUser@defaultHost> <562914A9.6030802@unicode.org> <7184978.47471.1445612461273.JavaMail.defaultUser@defaultHost> Message-ID: <562A6A5C.7000506@unicode.org> William, All right... This is likely to be my last posting on the subject... > ... there has been much objection to my invention in this mailing list > over the years, with no good reason ever stated, ... > > If this invention had been made in the research laboratory of a large > information technology company maybe things would be very different. > Please see attached image, for example. While it's not yet as fun as Star Trek, this kind of thing can be done for simple interactions in a variety of languages using a $20 cell phone... See also: https://en.wikipedia.org/wiki/Google_Translate /As of October 2015, Google Translate supports 90 languages at various levels^ and serves over 200 million people daily./ -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: have-you-vomited.jpg Type: image/jpeg Size: 45351 bytes Desc: not available URL: From eik at iki.fi Fri Oct 23 12:31:57 2015 From: eik at iki.fi (Erkki I Kolehmainen) Date: Fri, 23 Oct 2015 20:31:57 +0300 Subject: VS: The scope of Unicode (from Re: How can my research become implemented in a standardized manner?) In-Reply-To: <7184978.47471.1445612461273.JavaMail.defaultUser@defaultHost> References: <33365178.33329.1444395933996.JavaMail.defaultUser@defaultHost> <6041047.15822.1444472090118.JavaMail.defaultUser@defaultHost> <26364438.57814.1445015417239.JavaMail.defaultUser@defaultHost> <562914A9.6030802@unicode.org> <7184978.47471.1445612461273.JavaMail.defaultUser@defaultHost> Message-ID: <000301d10db8$b7e2c250$27a846f0$@fi> Dear Mr. Overington, First of all, you have never paid any attention to the formidable problems of getting vetted translations of whatever proposed (or to be ---) standard sentences of yours. You have admitted that you are not at all familiar with CLDR, but the people who have worked on CLDR are fully aware of the problems of getting agreed to localized expressions for all kinds of items. The value of deposit at the British Library seems questionable at best. Furthermore, if published means published on this list, it has no value whatsoever, since it does not mean any peer review and acceptance, which ? as you well know ? isn?t forthcoming. Incidentally, the standards body that has had considerable dealings with some of the kinds of problems that you claim to be researching is ETSI Human Factors. You might want to approach them in order to get any support. Sincerely, Erkki I. Kolehmainen L?hett?j?: Unicode [mailto:unicode-bounces at unicode.org] Puolesta William_J_G Overington L?hetetty: 23. lokakuuta 2015 18:01 Vastaanottaja: rick at unicode.org; unicode at unicode.org Aihe: Re: The scope of Unicode (from Re: How can my research become implemented in a standardized manner?) Thank you for your comprehensive answer. Rick McGowan wrote: > Personally, I think you're getting ahead of yourself. First, you should demonstrate that you have done research and produced results that at least some people find so useful and important that they are eager to implement the findings. Then, once you have done that, think about standardizing something, but only after you have a working model of the thing sufficient to demonstrate its general utility. I am an independent researcher, researching at home, using the internet and various software items on a laptop computer. I am not able to produce a working model. I can mostly only produce thought experiments, sometimes expressed as a simulation, like a story narrative. Maybe I could produce a short animation movie. > While I do not speak for the UTC in any way, observations of the committee over a period of some years have led me to conclude that they never encode something, call it "X", on pure speculation that some future research might result in "X" being useful for some purpose that has not even been demonstrated as a need, or clearly enough articulated to engender the committee's confidence in its potential utility. Well, as I say, I am an independent researcher, researching at home. May I just mention one thing though which might be regarded as significant. A short time ago I was talking with someone who is a clinician and I asked about whether there were issues trying to communicate with people through the language barrier. I was told that sometimes people bring a relative or friend to translate. An example was given to me of sometimes needing to use mime to try to express the meaning of "Have you vomited?". I asked if the following would be helpful. Use your computer to look down a menu for a preset sentence "Have you vomited?". Select the sentence. Behind the scenes a code is generated. Throw the code to the mobile telephone of the patient. On the screen of the patient's mobile telephone the sentence localized into his or her language is displayed. I said that there would be a standardized list of preset sentences, set out in English as International Standards are produced in English and that the National Standardization Body for each country would translate the list into the language of its country and produce a list to convert the codes to the local language. There was amazement and enthusiasm for this possibility. So there we are. The supreme irony of all of this is that there has been much objection to my invention in this mailing list over the years, with no good reason ever stated, yet it would be the very existence of The Unicode Standard itself that would allow the localized text to appear on the screen of the mobile telephone of the patient! If this invention had been made in the research laboratory of a large information technology company maybe things would be very different. William Overington 23 October 2015 -------------- next part -------------- An HTML attachment was scrubbed... URL: From srl at icu-project.org Fri Oct 23 13:01:23 2015 From: srl at icu-project.org (Steven R. Loomis) Date: Fri, 23 Oct 2015 11:01:23 -0700 Subject: The scope of Unicode (from Re: How can my research become implemented in a standardized manner?) In-Reply-To: <7184978.47471.1445612461273.JavaMail.defaultUser@defaultHost> References: <33365178.33329.1444395933996.JavaMail.defaultUser@defaultHost> <6041047.15822.1444472090118.JavaMail.defaultUser@defaultHost> <26364438.57814.1445015417239.JavaMail.defaultUser@defaultHost> <562914A9.6030802@unicode.org> <7184978.47471.1445612461273.JavaMail.defaultUser@defaultHost> Message-ID: <95E382E8-13E6-4AF0-AE77-995E07375876@icu-project.org> William, I work in the research laboratory (but do not speak for) a large information technology company. I'm also their primary representative to Unicode and other standards bodies. I would not (and have not) leapt from an idea to a document to a standard. I won't repeat the good and helpful advice you have already received. Make your first target a working model, not standardization. That is the way the research laboratory of a large information technology company works. S Enviado desde nuestro iPhone. > El 23 oct 2015, a las 8:01 AM, William_J_G Overington escribi?: > > If this invention had been made in the research laboratory of a large information technology company maybe things would be very different. -------------- next part -------------- An HTML attachment was scrubbed... URL: From duerst at it.aoyama.ac.jp Fri Oct 23 16:36:06 2015 From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J._D=c3=bcrst?=) Date: Sat, 24 Oct 2015 06:36:06 +0900 Subject: The scope of Unicode (from Re: How can my research become implemented in a standardized manner?) In-Reply-To: <562A6A5C.7000506@unicode.org> References: <33365178.33329.1444395933996.JavaMail.defaultUser@defaultHost> <6041047.15822.1444472090118.JavaMail.defaultUser@defaultHost> <26364438.57814.1445015417239.JavaMail.defaultUser@defaultHost> <562914A9.6030802@unicode.org> <7184978.47471.1445612461273.JavaMail.defaultUser@defaultHost> <562A6A5C.7000506@unicode.org> Message-ID: <562AA846.10306@it.aoyama.ac.jp> On 2015/10/24 02:11, Rick McGowan wrote: > William, > > All right... This is likely to be my last posting on the subject... > >> ... there has been much objection to my invention in this mailing list >> over the years, with no good reason ever stated, ... >> >> If this invention had been made in the research laboratory of a large >> information technology company maybe things would be very different. It's easy to guess that many people have made very similar inventions before. For example, there are many books that contain simple phrases in a few languages for tourists. Also, if you had your set of sentences and their translations, it wouldn't be difficult to create e.g. a smart phone application for it. The doctor you mentioned was excited about your idea because she isn't a language specialist. If she had thought about, or experimented with, the idea, she would quickly have come to a point where she wants more and more sentences, for all kinds of slightly different situations. That's the point where she will start to see that your idea isn't actually that great. > Please see attached image, for example. While it's not yet as fun as > Star Trek, this kind of thing can be done for simple interactions in a > variety of languages using a $20 cell phone... > > See also: https://en.wikipedia.org/wiki/Google_Translate > > /As of October 2015, Google Translate supports 90 languages at > various Well, the translation isn't perfect :-(. It translates "Have you vomited?" to "?????????????" Apart from the unnecessary (in Japanese) subject, and the usually not used question mark, it's present tense, corresponding to "Are you vomiting?". I'm sure no doctor would have to ask this. Regards, Martin. From richard.wordingham at ntlworld.com Fri Oct 23 17:16:32 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Fri, 23 Oct 2015 23:16:32 +0100 Subject: Terminology (was: Latin glottal stop in ID in NWT, Canada) In-Reply-To: <786526406.6308.1445600066983.JavaMail.www@wwinf1h14> References: <1022406759.1183.1445583561682.JavaMail.www@wwinf1h14> <20151023085315.443d56b6@JRWUBU2> <786526406.6308.1445600066983.JavaMail.www@wwinf1h14> Message-ID: <20151023231632.649d706e@JRWUBU2> On Fri, 23 Oct 2015 13:34:26 +0200 (CEST) Marcel Schneider wrote: > On Fri, 23 Oct 2015 08:53:15 +0100, Richard Wordingham wrote: > > I would like an English translation of Chapter 3 'Conformance', > I guess that there may be some need of a *manual*, in the spirit that > led the French translator to adding annotations. May you please quote > some examples of what you wish to see expressed in a different way? "C5: A process shall not assume that it is required to interpret any particular coded character sequence." I think this is meant to mean that processes do not have to interpret every coded character sequence presented to them, but this appears to be a concession and not a requirement, and I cannot derive it from the text. An example of a non-compliant process would be helpful. I could interpret this requirement as prohibiting the generation of a missing glyph glyph, for that is an error report that it has failed to interpret a coded character sequence. I hope this is not an intended interpretation. "C6: A process shall not assume that the interpretations of two canonical-equivalent character sequences are distinct." Firstly, I have grave difficulties assigning mental activities to processes. Secondly, it may be possible to interpet "A process shall not assume X" as "A process shall function correctly regardless of whether X holds." However, let image(Y) be the bitmap depicting the string Y. Then the following logic would be non-compliant: if A and B are canonically equivalent and image(A) and image(B) are different, then write(A, " and ", B, "are canonically equivalent but have different images ", image(A), " and ", image(B)); end if The logic is non-compliant, for if it is invoked then the write statement will only work correctly if image(A) and image(B) are different, i.e. if A and B are interpreted differently. Apparently it is permissible to render canonically equivalent sequences differently, so image(A) and image(B) might be different even though canonically equivalent. I therefore conclude that C6 is in some language that I do not adequately understand. > Again, I do know nothing about Thai, but if in TUS an abugida can be > addressed to as an alphabet if the same is used as such, it seems to > me that the word 'alphabet' has a pretty extended meaning in TUS. TUS tries to make accurate use of the distinction between 'alphabet', 'abugida' and 'abjad', 20th century jargon promoted if not invented by Peter Daniels. The distinction lies in the way vowels are indicated - always / with a default / not at all. The distinction may be useful for a writing system, i.e. a way of using the 'script', but it rapidly encounters the problem that a script may have several different writing systems. For example, the presence or absence of vowel marks switches the Arabic and Hebrew scripts, as used for those languages, between being an abjad and being an alphabet. > In any case, isolating an arbitrary subset inside our Latin script > and promoting it as the so-called Roman alphabet to get some pretext > for refusing that compatriots or strangers bear their real and > choosen names, [quote] IS A SERIOUS INSULT [/quote]. It is not a matter of an 'arbitrary' subset. I expect the relevant subset is the *French* alphabet, assuming Quebec has followed (or preceded?) France and added 'w' to the alphabet. That this subset should be confused with the concept of 'Roman' is not surprising, even though the Romans lacked 'J', 'j', 'U', 'v', 'W' and 'w'. > Additionally, at the age of Unicode, this results in being as well an > insult to the whole work of the Consortium. Unicode does not dictate what is accepted as 'the alphabet'; it is only recently that 'j' has been accepted as part of the Welsh alphabet. When I was a child, I learnt that there was no 'j' in the Welsh alphabet - and wondered how the Joneses were supposed to write their names in Welsh! (One partial answer, of course, is that the very common Welsh surname 'Jones' is English for 'Evans'.) Richard. From doug at ewellic.org Fri Oct 23 17:25:19 2015 From: doug at ewellic.org (Doug Ewell) Date: Fri, 23 Oct 2015 15:25:19 -0700 Subject: The scope of Unicode (from Re: How can my research become implemented in a standardized =?UTF-8?Q?manner=3F=29?= Message-ID: <20151023152519.665a7a7059d7ee80bb4d670165c8327d.03141f2aa3.wbe@email03.secureserver.net> Martin J. D?rst wrote: > Well, the translation isn't perfect :-(. It translates > "Have you vomited?" to "?????????????" > Apart from the unnecessary (in Japanese) subject, and the usually not > used question mark, it's present tense, corresponding to "Are you > vomiting?". I'm sure no doctor would have to ask this. Doctors ask this kind of question all the time: "Are you coughing? Are you wheezing?" The clear intent of such a question is not so much "Are you now?" but rather "Have you been lately?" In any case, the language-challenged doctor-patient pair should reach common understanding fairly quickly. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From mollycatblack at gmail.com Fri Oct 23 21:33:43 2015 From: mollycatblack at gmail.com (Molly Black) Date: Fri, 23 Oct 2015 20:33:43 -0600 Subject: crafting emoji Message-ID: Why is there no knitting needles, yarn, sewing needle with thread or sewing machine in the emoji library? Has this been discussed already? -------------- next part -------------- An HTML attachment was scrubbed... URL: From mark at macchiato.com Fri Oct 23 23:03:05 2015 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Fri, 23 Oct 2015 21:03:05 -0700 Subject: crafting emoji In-Reply-To: References: Message-ID: We haven't seen a proposal for that. See http://www.unicode.org/emoji/selection.html for how to submit one. Mark On Fri, Oct 23, 2015 at 7:33 PM, Molly Black wrote: > Why is there no knitting needles, yarn, sewing needle with thread or > sewing machine in the emoji library? Has this been discussed already? > -------------- next part -------------- An HTML attachment was scrubbed... URL: From eliz at gnu.org Sat Oct 24 00:40:32 2015 From: eliz at gnu.org (Eli Zaretskii) Date: Sat, 24 Oct 2015 08:40:32 +0300 Subject: Terminology (was: Latin glottal stop in ID in NWT, Canada) In-Reply-To: <20151023231632.649d706e@JRWUBU2> References: <1022406759.1183.1445583561682.JavaMail.www@wwinf1h14> <20151023085315.443d56b6@JRWUBU2> <786526406.6308.1445600066983.JavaMail.www@wwinf1h14> <20151023231632.649d706e@JRWUBU2> Message-ID: <83io5wygbz.fsf@gnu.org> > Date: Fri, 23 Oct 2015 23:16:32 +0100 > From: Richard Wordingham > > "C6: A process shall not assume that the interpretations of two > canonical-equivalent character sequences are distinct." > > Firstly, I have grave difficulties assigning mental activities to > processes. > > Secondly, it may be possible to interpet "A process shall not assume X" > as "A process shall function correctly regardless of whether X holds." > > However, let image(Y) be the bitmap depicting the string Y. Then the > following logic would be non-compliant: > > if A and B are canonically equivalent and image(A) and image(B) are > different, then > write(A, " and ", B, "are canonically equivalent but have different > images ", image(A), " and ", image(B)); > end if > > The logic is non-compliant, for if it is invoked then the write > statement will only work correctly if image(A) and image(B) are > different, i.e. if A and B are interpreted differently. Apparently it > is permissible to render canonically equivalent sequences differently, so > image(A) and image(B) might be different even though canonically > equivalent. > > I therefore conclude that C6 is in some language that I do not > adequately understand. AFAIU, Unicode is about processing text, and only mentions display rarely, where it's directly related to the processing part. So the above is about _processing_ canonically-equivalent sequences, not about their display. When looked at in this way, I see no difficulties in understanding the text. > > Again, I do know nothing about Thai, but if in TUS an abugida can be > > addressed to as an alphabet if the same is used as such, it seems to > > me that the word 'alphabet' has a pretty extended meaning in TUS. > > TUS tries to make accurate use of the distinction between 'alphabet', > 'abugida' and 'abjad', 20th century jargon promoted if not invented by > Peter Daniels. The distinction lies in the way vowels are indicated - > always / with a default / not at all. The distinction may be useful > for a writing system, i.e. a way of using the 'script', but it rapidly > encounters the problem that a script may have several different writing > systems. For example, the presence or absence of vowel marks switches > the Arabic and Hebrew scripts, as used for those languages, between > being an abjad and being an alphabet. The Hebrew script is never an alphabet, AFAIU, it's likely an abugida when the vowel marks are used. The so-called "full spelling", where some vowels are indicated by consonants, does not replace all the vowels with consonants, so it isn't, strictly speaking, an alphabet in the above sense. From c933103 at gmail.com Sat Oct 24 01:33:05 2015 From: c933103 at gmail.com (gfb hjjhjh) Date: Sat, 24 Oct 2015 14:33:05 +0800 Subject: Fwd: Terminology (was: Latin glottal stop in ID in NWT, Canada) In-Reply-To: References: <1022406759.1183.1445583561682.JavaMail.www@wwinf1h14> <20151023085315.443d56b6@JRWUBU2> <786526406.6308.1445600066983.JavaMail.www@wwinf1h14> Message-ID: ---------- Forwarded message ---------- From: gfb hjjhjh Date: 2015-10-23 20:17 GMT+08:00 Subject: Re: Terminology (was: Latin glottal stop in ID in NWT, Canada) To: Marcel Schneider writing other languages in Latin alphabet is still called romanization not latinization.? 2015-10-23 19:34 GMT+08:00 Marcel Schneider : > On Fri, 23 Oct 2015 08:53:15 +0100, Richard Wordingham < > richard.wordingham at ntlworld.com> wrote: > > > I think you're making the mistake of assuming that the Unicode Standard > > is written in English, rather than some jargon that is confusingly like > it. > > The idea that some technical specification is not written in good English, > is generally an illusion produced by the very nature of the content. More > specifically about TUS, I have a strong confidence about accurate > expression, that happens to be illustrated by the following quotation from > the incriminated chapter (highlighting uppercase added): > > >>> Additional information can be found throughout the other chapters of > this core specification for the Unicode Standard. However, because of the > need to keep extended discussions of scripts, sets of symbols, and other > characters READABLE, material in other chapters is not always labeled as to > its normative or informative status. > http://www.unicode.org/versions/Unicode8.0.0/ch03.pdf#G22672 > > > I would like an English translation of Chapter 3 'Conformance', > > I guess that there may be some need of a *manual*, in the spirit that led > the French translator to adding annotations. > May you please quote some examples of what you wish to see expressed in a > different way? > > > 'Latin script', in so far as it is translatable, translates into > > English as 'Roman alphabet'. > > I know your expertise from previous threads, but I've no means to adhere > to the equivalence you put between a script and an alphabet. > The delusion I point out in quotations about "Roman alphabet," or > alternately but way worse, the hypocrisy, is that while a handful of > diacritics are certainly supported in order to spell French names in a > reasonable and legible way, and while the ? and ? letters can scarcely be > registered as "ae" or "oe" in Canada, other letters of the Latin (well, > say, Roman) script are excluded, refused, and banned. And that is justified > by telling people that a glottal stop isn't part of the Roman alphabet. "?" > isn't neither, as this character is not a part of the alphabet, just to > take the one that is on *all* Canadian traditional keyboards. Nor is ?, > which is on none of them [but on Canadian Multilingual Standard]. Agreed, > I haven't been there to look into their data base and at the cited printer. > > > In the language of the TUS, the word > > 'alphabet' has a more restricted meaning, whereby, for example, the > > Thai alphabet is not used for the Thai language! The Thai alphabet is, > > however, used for the Pali language and is promoted for Pattani Malay. > > When the characters of the Thai alphabet are used for the Thai > > language, they are used as an 'abugida', not as an 'alphabet'.) > > Again, I do know nothing about Thai, but if in TUS an abugida can be > addressed to as an alphabet if the same is used as such, it seems to me > that the word 'alphabet' has a pretty extended meaning in TUS. > > In any case, isolating an arbitrary subset inside our Latin script and > promoting it as the so-called Roman alphabet to get some pretext for > refusing that compatriots or strangers bear their real and choosen names, > [quote] IS A SERIOUS INSULT [/quote]. > > > > Additionally, at the age of Unicode, this results in being as well an > insult to the whole work of the Consortium. > > Marcel > -------------- next part -------------- An HTML attachment was scrubbed... URL: From Shawn.Steele at microsoft.com Sat Oct 24 02:59:56 2015 From: Shawn.Steele at microsoft.com (Shawn Steele) Date: Sat, 24 Oct 2015 07:59:56 +0000 Subject: crafting emoji In-Reply-To: References: Message-ID: Seeing the title first I read ?crafting? as a verb and thought you wanted to knit some ? From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Molly Black Sent: Friday, October 23, 2015 7:34 PM To: unicode at unicode.org Subject: crafting emoji Why is there no knitting needles, yarn, sewing needle with thread or sewing machine in the emoji library? Has this been discussed already? -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Sat Oct 24 06:33:32 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Sat, 24 Oct 2015 12:33:32 +0100 Subject: Terminology (was: Latin glottal stop in ID in NWT, Canada) In-Reply-To: <83io5wygbz.fsf@gnu.org> References: <1022406759.1183.1445583561682.JavaMail.www@wwinf1h14> <20151023085315.443d56b6@JRWUBU2> <786526406.6308.1445600066983.JavaMail.www@wwinf1h14> <20151023231632.649d706e@JRWUBU2> <83io5wygbz.fsf@gnu.org> Message-ID: <20151024123332.4a3b263d@JRWUBU2> On Sat, 24 Oct 2015 08:40:32 +0300 Eli Zaretskii wrote: > > Date: Fri, 23 Oct 2015 23:16:32 +0100 > > From: Richard Wordingham > > > > "C6: A process shall not assume that the interpretations of two > > canonical-equivalent character sequences are distinct." > > > > Firstly, I have grave difficulties assigning mental activities to > > processes. > > > > Secondly, it may be possible to interpet "A process shall not > > assume X" as "A process shall function correctly regardless of > > whether X holds." > > > > However, let image(Y) be the bitmap depicting the string Y. Then > > the following logic would be non-compliant: > > > > if A and B are canonically equivalent and image(A) and image(B) are > > different, then > > write(A, " and ", B, "are canonically equivalent but have > > different images ", image(A), " and ", image(B)); > > end if > > > > The logic is non-compliant, for if it is invoked then the write > > statement will only work correctly if image(A) and image(B) are > > different, i.e. if A and B are interpreted differently. Apparently > > it is permissible to render canonically equivalent sequences > > differently, so image(A) and image(B) might be different even > > though canonically equivalent. > > > > I therefore conclude that C6 is in some language that I do not > > adequately understand. > > AFAIU, Unicode is about processing text, and only mentions display > rarely, where it's directly related to the processing part. So the > above is about _processing_ canonically-equivalent sequences, not > about their display. When looked at in this way, I see no > difficulties in understanding the text. Display is part of interpretation - indeed, it is currently the most important part. At least, I would interpret displaying U+0041 with a glyph like 'X' (an example in 'D2 Character identity') as violating: "C4: A process shall interpret a coded character sequence according to the character semantics established by this standard, if that process does interpret that coded character sequence." I chose the complicated function image() as being less controversial. However, as you do not think it interprets a string, consider the full, default toUppercase() instead. The problem lies with troublesome U+0345 COMBINING GREEK YPOGEGRAMMENI (subscript iota) with ccc=240, which uppercases to U+0399 GREEK CAPITAL LETTER IOTA with ccc=0. While U+0345 commutes with Greek accents, U+0399 does not. Thus U+1F80 GREEK SMALL LETTER ALPHA WITH PSILI AND YPOGEGRAMMENI uppercases, in full mode, to , but the canonically equivalent lower case form uppercases, in full mode, to the inequivalent upper case . The brute force solution to this in practice minor issue is to convert strings to NFD before upper-casing, but this would foul of one guess of the meaning of C6, namely "An author shall not assume that the interpretations of two canonical-equivalent character sequences are distinct". Of course, if that is the meaning, determining whether X = toNFC(toUppercase(toNFD(X))) is compliant depends on answering the question, "Did the author think he could get a different result if he omitted the conversion to NFD?". I'm not sure whether the code would be compliant under my interpretation if the author was unsure as to whether omitting the conversion would get a different result. > The Hebrew script is never an alphabet, AFAIU, it's likely an abugida > when the vowel marks are used. No, the definition of an abugida is that there is a default vowel which is indicated by the absence of any vowel mark. In fully pointed Hebrew, it's only final, silent and quiescent consonants that lack vowel marks. I don't like the definitions, because they are extremely vulnerable to small changes in use. Indeed, having taken the name from the consonant system underlying the Ethiopic syllabary, the inventors of the term subsequently concluded that the eponymous abugida was not actually an abugida! > The so-called "full spelling", where > some vowels are indicated by consonants, does not replace all the > vowels with consonants, so it isn't, strictly speaking, an alphabet in > the above sense. Nor would I claim it as such. Richard. From eliz at gnu.org Sat Oct 24 06:43:27 2015 From: eliz at gnu.org (Eli Zaretskii) Date: Sat, 24 Oct 2015 14:43:27 +0300 Subject: Terminology (was: Latin glottal stop in ID in NWT, Canada) In-Reply-To: <20151024123332.4a3b263d@JRWUBU2> References: <1022406759.1183.1445583561682.JavaMail.www@wwinf1h14> <20151023085315.443d56b6@JRWUBU2> <786526406.6308.1445600066983.JavaMail.www@wwinf1h14> <20151023231632.649d706e@JRWUBU2> <83io5wygbz.fsf@gnu.org> <20151024123332.4a3b263d@JRWUBU2> Message-ID: <83twpgwkyo.fsf@gnu.org> > Date: Sat, 24 Oct 2015 12:33:32 +0100 > From: Richard Wordingham > > > AFAIU, Unicode is about processing text, and only mentions display > > rarely, where it's directly related to the processing part. So the > > above is about _processing_ canonically-equivalent sequences, not > > about their display. When looked at in this way, I see no > > difficulties in understanding the text. > > Display is part of interpretation - indeed, it is currently the most > important part. At least, I would interpret displaying U+0041 with a > glyph like 'X' (an example in 'D2 Character identity') as violating: > > "C4: A process shall interpret a coded character sequence according > to the character semantics established by this standard, if that process > does interpret that coded character sequence." Sorry, I don't see this as a violation. "Interpret" doesn't have to include display. > > The Hebrew script is never an alphabet, AFAIU, it's likely an abugida > > when the vowel marks are used. > > No, the definition of an abugida is that there is a default vowel which > is indicated by the absence of any vowel mark. In fully pointed > Hebrew, it's only final, silent and quiescent consonants that lack vowel > marks. Are you saying that "a default vowel which is indicated by the absence of any vowel mark" excludes the absence of any vowel? If so, then I'd consider that strange, but all it means Hebrew does not fit this classification at all. > > The so-called "full spelling", where some vowels are indicated by > > consonants, does not replace all the vowels with consonants, so it > > isn't, strictly speaking, an alphabet in the above sense. > > Nor would I claim it as such. But you did say: > the presence or absence of vowel marks switches the Arabic and > Hebrew scripts, as used for those languages, between being an abjad > and being an alphabet. Then when is the Hebrew script an alphabet, in your view? From richard.wordingham at ntlworld.com Sat Oct 24 07:45:31 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Sat, 24 Oct 2015 13:45:31 +0100 Subject: Terminology (was: Latin glottal stop in ID in NWT, Canada) In-Reply-To: <83twpgwkyo.fsf@gnu.org> References: <1022406759.1183.1445583561682.JavaMail.www@wwinf1h14> <20151023085315.443d56b6@JRWUBU2> <786526406.6308.1445600066983.JavaMail.www@wwinf1h14> <20151023231632.649d706e@JRWUBU2> <83io5wygbz.fsf@gnu.org> <20151024123332.4a3b263d@JRWUBU2> <83twpgwkyo.fsf@gnu.org> Message-ID: <20151024134531.191e37a9@JRWUBU2> On Sat, 24 Oct 2015 14:43:27 +0300 Eli Zaretskii wrote: > Then when is the Hebrew script an alphabet, in your view? The Hebrew script for Hebrew is an alphabet when the niqqud are used, as in ordinary copies of the Old Testament, e.g. as in https://www.academic-bible.com/en/online-bibles/biblia-hebraica-stuttgartensia-bhs/read-the-bible-text/ . I don't feel that calling it an alphabet as opposed to an abjad is helpful, but by the definitions it's an alphabet. Yiddish has been described as using an alphabet, but that may not apply to all Yiddish orthographies. Richard. From eliz at gnu.org Sat Oct 24 08:04:24 2015 From: eliz at gnu.org (Eli Zaretskii) Date: Sat, 24 Oct 2015 16:04:24 +0300 Subject: Terminology (was: Latin glottal stop in ID in NWT, Canada) In-Reply-To: <20151024134531.191e37a9@JRWUBU2> References: <1022406759.1183.1445583561682.JavaMail.www@wwinf1h14> <20151023085315.443d56b6@JRWUBU2> <786526406.6308.1445600066983.JavaMail.www@wwinf1h14> <20151023231632.649d706e@JRWUBU2> <83io5wygbz.fsf@gnu.org> <20151024123332.4a3b263d@JRWUBU2> <83twpgwkyo.fsf@gnu.org> <20151024134531.191e37a9@JRWUBU2> Message-ID: <83r3kkwh7r.fsf@gnu.org> > Date: Sat, 24 Oct 2015 13:45:31 +0100 > From: Richard Wordingham > > On Sat, 24 Oct 2015 14:43:27 +0300 > Eli Zaretskii wrote: > > > Then when is the Hebrew script an alphabet, in your view? > > The Hebrew script for Hebrew is an alphabet when the niqqud are used, as > in ordinary copies of the Old Testament, e.g. as in > https://www.academic-bible.com/en/online-bibles/biblia-hebraica-stuttgartensia-bhs/read-the-bible-text/ . > I don't feel that calling it an alphabet as opposed to an abjad is > helpful, but by the definitions it's an alphabet. An alphabet, AFAIU, has to have vowels that are represented as letters, equally to consonants. Hebrew with niqqud doesn't fit that description, because niqqud are not letters. > Yiddish has been described as using an alphabet I agree, but I wasn't talking about Yiddish. From richard.wordingham at ntlworld.com Sat Oct 24 09:17:02 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Sat, 24 Oct 2015 15:17:02 +0100 Subject: Terminology (was: Latin glottal stop in ID in NWT, Canada) In-Reply-To: <83r3kkwh7r.fsf@gnu.org> References: <1022406759.1183.1445583561682.JavaMail.www@wwinf1h14> <20151023085315.443d56b6@JRWUBU2> <786526406.6308.1445600066983.JavaMail.www@wwinf1h14> <20151023231632.649d706e@JRWUBU2> <83io5wygbz.fsf@gnu.org> <20151024123332.4a3b263d@JRWUBU2> <83twpgwkyo.fsf@gnu.org> <20151024134531.191e37a9@JRWUBU2> <83r3kkwh7r.fsf@gnu.org> Message-ID: <20151024151702.07bfd524@JRWUBU2> On Sat, 24 Oct 2015 16:04:24 +0300 Eli Zaretskii wrote: > An alphabet, AFAIU, has to have vowels that are represented as > letters, equally to consonants. Hebrew with niqqud doesn't fit that' > description, because niqqud are not letters. My sentiments exactly, but our sentiments don't match the definitions. Unicode: "A writing system in which both consonants and vowels are indicated" Daniels & Bright in 'The World's Writing Systems' p4: "In an _alphabet_, the characters denote consonants and vowels." Richard. From wjgo_10009 at btinternet.com Sat Oct 24 07:44:07 2015 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Sat, 24 Oct 2015 13:44:07 +0100 (BST) Subject: The scope of Unicode (from Re: How can my research become implemented in a standardized manner?) In-Reply-To: <562A6A5C.7000506@unicode.org> References: <33365178.33329.1444395933996.JavaMail.defaultUser@defaultHost> <6041047.15822.1444472090118.JavaMail.defaultUser@defaultHost> <26364438.57814.1445015417239.JavaMail.defaultUser@defaultHost> <562914A9.6030802@unicode.org> <7184978.47471.1445612461273.JavaMail.defaultUser@defaultHost> <562A6A5C.7000506@unicode.org> Message-ID: <22827807.23217.1445690647143.JavaMail.defaultUser@defaultHost> Rick McGowan referred to Google Translate. I have been referred to Google Translate previously and I replied. http://www.unicode.org/mail-arch/unicode-ml/y2011-m01/0112.html I thought about what Rick wrote yet the problem is the matter of provenance of the translation. The clinician could not be sure of the provenance of the translation, even if, in fact, the translation were perfect, which it might be. Whereas with the preset sentences, previously translated manually by a native speaker at the request of the National Standards Body of the country where the language is spoken, the provenance of the translation would be part of the system. I have used Google Translate on many occasions, mostly for translation into English just to try to gain an understanding of some text written in a language other than English. Another issue is that Google Translate requires at-the-time access to a remote server, whereas what I am suggesting would not require at-the-time access to a remote server: perhaps sometimes needing access to a remote server to update sentence lists. William Overington 24 October 2015 -------------- next part -------------- An HTML attachment was scrubbed... URL: From wjgo_10009 at btinternet.com Sat Oct 24 08:10:12 2015 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Sat, 24 Oct 2015 14:10:12 +0100 (BST) Subject: VS: The scope of Unicode (from Re: How can my research become implemented in a standardized manner?) In-Reply-To: <000301d10db8$b7e2c250$27a846f0$@fi> References: <33365178.33329.1444395933996.JavaMail.defaultUser@defaultHost> <6041047.15822.1444472090118.JavaMail.defaultUser@defaultHost> <26364438.57814.1445015417239.JavaMail.defaultUser@defaultHost> <562914A9.6030802@unicode.org> <7184978.47471.1445612461273.JavaMail.defaultUser@defaultHost> <000301d10db8$b7e2c250$27a846f0$@fi> Message-ID: <5225466.24805.1445692212699.JavaMail.defaultUser@defaultHost> Erkki I. Kolehmainen wrote: > First of all, you have never paid any attention to the formidable problems of getting vetted translations of whatever proposed (or to be ---) standard sentences of yours. You have admitted that you are not at all familiar with CLDR, but the people who have worked on CLDR are fully aware of the problems of getting agreed to localized expressions for all kinds of items. I wrote within http://www.unicode.org/mail-arch/unicode-ml/y2015-m10/0181.html , which is the post to which you replied, the following text. quote I said that there would be a standardized list of preset sentences, set out in English as International Standards are produced in English and that the National Standardization Body for each country would translate the list into the language of its country and produce a list to convert the codes to the local language. end quote Now maybe I am missing some issue here, so if the above suggested process is regarded as problematic I would like to address any problems that are felt to exist. > The value of deposit at the British Library seems questionable at best. Furthermore, if published means published on this list, it has no value whatsoever, since it does not mean any peer review and acceptance, which ? as you well know ? isn?t forthcoming. > Furthermore, if published means published on this list, ... It does not. In the context of this thread of the pdf document being published, published means published as in United Kingdom Law about Legal Deposit. In the particular situation here, published refers to the fact that the pdf document was published in my family webspace by me, the publisher of the document. I am the publisher of the document and also the author of the document. > Incidentally, the standards body that has had considerable dealings with some of the kinds of problems that you claim to be researching is ETSI Human Factors. You might want to approach them in order to get any support. Thank you for that information. William Overington 24 October 2014 -------------- next part -------------- An HTML attachment was scrubbed... URL: From wjgo_10009 at btinternet.com Sat Oct 24 08:29:55 2015 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Sat, 24 Oct 2015 14:29:55 +0100 (BST) Subject: The scope of Unicode (from Re: How can my research become implemented in a standardized manner?) Message-ID: <27236888.25861.1445693395359.JavaMail.defaultUser@defaultHost> >> If this invention had been made in the research laboratory of a large information technology company maybe things would be very different. Steven R. Loomis wrote: > I would not (and have not) leapt from an idea to a document to a standard. I won't repeat the good and helpful advice you have already received. Make your first target a working model, not standardization. That is the way the research laboratory of a large information technology company works. Yes, it would be good for me to start by making a working model. However, I do not have the facilities to do so, nor indeed much of the knowledge and skills to do so either. So I have tried to do what I can in the hope that what I can produce might act as a catalyst to people wanting to implement the invention in a standardized manner: the standardization right from the start being very important so that there is not proprietaryness about what is put into use, in the hope that lots of people will join in and together produce an elegant and useful implemented system. Well as at 2:26 pm United Kingdom time today, Saturday, my pdf document has not been added into the Unicode Document Register, nor have I received an email stating that it has been rejected. There is now just over a week to go before the next Unicode Technical Committee meeting, so it remains to be known as to whether the document will be discussed by the Unicode Technical Committee at that meeting. William Overington 24 October 2015 -------------- next part -------------- An HTML attachment was scrubbed... URL: From wjgo_10009 at btinternet.com Sat Oct 24 08:49:32 2015 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Sat, 24 Oct 2015 14:49:32 +0100 (BST) Subject: The scope of Unicode (from Re: How can my research become implemented in a standardized manner?) In-Reply-To: <562AA846.10306@it.aoyama.ac.jp> References: <33365178.33329.1444395933996.JavaMail.defaultUser@defaultHost> <6041047.15822.1444472090118.JavaMail.defaultUser@defaultHost> <26364438.57814.1445015417239.JavaMail.defaultUser@defaultHost> <562914A9.6030802@unicode.org> <7184978.47471.1445612461273.JavaMail.defaultUser@defaultHost> <562A6A5C.7000506@unicode.org> <562AA846.10306@it.aoyama.ac.jp> Message-ID: <27185386.26980.1445694572196.JavaMail.defaultUser@defaultHost> Martin J. D?rst wrote: > Also, if you had your set of sentences and their translations, it wouldn't be difficult to create e.g. a smart phone application for it. Thank you. Unfortunately I do not have the facilities and knowledge and skills to produce a smart phone application myself. > The doctor you mentioned was excited about your idea because she isn't a language specialist. If she had thought about, or experimented with, the idea, she would quickly have come to a point where she wants more and more sentences, for all kinds of slightly different situations. Let us please call the above paragraph Paragraph A. > That's the point where she will start to see that your idea isn't actually that great. Let us please call the above paragraph Paragraph B. Now from Paragraph A one can say, well, yes, there would need to be quite a collection of sentences encoded, maybe several hundred or even a thousand or more. Maybe lots of clinicians could suggest preset sentences each from his or her own experience and a list could be produced. A lot of work and even then there could still be gaps so that every possible situation would not be covered. Though there the PanLex dictionary could be used. I do not follow how Paragraph B follows from Paragraph A. William Overington 24 October 2014 From root at unicode.org Sat Oct 24 12:44:35 2015 From: root at unicode.org (Sarasvati) Date: Sat, 24 Oct 2015 12:44:35 -0500 Subject: The scope of Unicode - Closed Message-ID: <201510241744.t9OHiZ6r024133@sarasvati.unicode.org> This discussion of "the scope of Unicode" is now closed. Please refrain from sending any further replies to this list. Thank you for your cooperation. Regards from your, -- Sarasvati From eik at iki.fi Sat Oct 24 12:56:37 2015 From: eik at iki.fi (Erkki I Kolehmainen) Date: Sat, 24 Oct 2015 20:56:37 +0300 Subject: VS: VS: The scope of Unicode (from Re: How can my research become implemented in a standardized manner?) In-Reply-To: <5225466.24805.1445692212699.JavaMail.defaultUser@defaultHost> References: <33365178.33329.1444395933996.JavaMail.defaultUser@defaultHost> <6041047.15822.1444472090118.JavaMail.defaultUser@defaultHost> <26364438.57814.1445015417239.JavaMail.defaultUser@defaultHost> <562914A9.6030802@unicode.org> <7184978.47471.1445612461273.JavaMail.defaultUser@defaultHost> <000301d10db8$b7e2c250$27a846f0$@fi> <5225466.24805.1445692212699.JavaMail.defaultUser@defaultHost> Message-ID: <000001d10e85$54d1a640$fe74f2c0$@fi> Mr. Overington, You have certainly missed the point. I mentioned CLDR and the practical translation problems that we encounter with it, because Unicode has been exceptionally successful in activating people to work with it. You seem to know of the working of the National Bodies even less than you do of CLDR. To my knowledge. there is no NB that has the resource (or even the will) to do what you expect them to do. You cannot address this problem unless you have extremely deep pockets and are prepared to fund the operation of the various National Bodies (which probably could not accept this funding anyway) who have had to abandon active participation in several areas that they have deemed important in the past. (That is partially due to the hyper drive by ISO of OSI that turned out to be a catastrophic fiasco.) On my part, I refrain from addressing this subject area any further on the public list. Sincerely, Erkki I. Kolehmainen L?hett?j?: William_J_G Overington [mailto:wjgo_10009 at btinternet.com] L?hetetty: 24. lokakuuta 2015 16:10 Vastaanottaja: eik at iki.fi; unicode at unicode.org; rick at unicode.org Aihe: Re: VS: The scope of Unicode (from Re: How can my research become implemented in a standardized manner?) Erkki I. Kolehmainen wrote: > First of all, you have never paid any attention to the formidable problems of getting vetted translations of whatever proposed (or to be ---) standard sentences of yours. You have admitted that you are not at all familiar with CLDR, but the people who have worked on CLDR are fully aware of the problems of getting agreed to localized expressions for all kinds of items. I wrote within http://www.unicode.org/mail-arch/unicode-ml/y2015-m10/0181.html , which is the post to which you replied, the following text. quote I said that there would be a standardized list of preset sentences, set out in English as International Standards are produced in English and that the National Standardization Body for each country would translate the list into the language of its country and produce a list to convert the codes to the local language. end quote Now maybe I am missing some issue here, so if the above suggested process is regarded as problematic I would like to address any problems that are felt to exist. > The value of deposit at the British Library seems questionable at best. Furthermore, if published means published on this list, it has no value whatsoever, since it does not mean any peer review and acceptance, which ? as you well know ? isn?t forthcoming. > Furthermore, if published means published on this list, ... It does not. In the context of this thread of the pdf document being published, published means published as in United Kingdom Law about Legal Deposit. In the particular situation here, published refers to the fact that the pdf document was published in my family webspace by me, the publisher of the document. I am the publisher of the document and also the author of the document. > Incidentally, the standards body that has had considerable dealings with some of the kinds of problems that you claim to be researching is ETSI Human Factors. You might want to approach them in order to get any support. Thank you for that information. William Overington 24 October 2014 -------------- next part -------------- An HTML attachment was scrubbed... URL: From rick at unicode.org Sat Oct 24 13:02:28 2015 From: rick at unicode.org (Rick McGowan) Date: Sat, 24 Oct 2015 11:02:28 -0700 Subject: Unicode Emoji Charts updated Message-ID: <562BC7B4.2050306@unicode.org> The Unicode Emoji charts have been updated to show the new images from Apple, and the Selection Factors for emoji proposals have also been updated. Among the many other topics at the Unicode technical conference on October 26-28, there will be a new panel session on emoji for people to find out more about how new emoji are developed. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: apple_1f575.png Type: image/png Size: 12122 bytes Desc: not available URL: From doug at ewellic.org Sat Oct 24 17:57:41 2015 From: doug at ewellic.org (Doug Ewell) Date: Sat, 24 Oct 2015 16:57:41 -0600 Subject: A Bulldog moves on Message-ID: I wish this day had never come. http://obits.dignitymemorial.com/dignity-memorial/obituary.aspx?n=Michael-Kaplan&lc=4246&pid=176192738&uuid=ec6b8cda-c4b1-4b5f-9422-925b1e09a03a -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From duerst at it.aoyama.ac.jp Sat Oct 24 18:41:12 2015 From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J._D=c3=bcrst?=) Date: Sun, 25 Oct 2015 08:41:12 +0900 Subject: A Bulldog moves on In-Reply-To: References: Message-ID: <562C1718.2010407@it.aoyama.ac.jp> Hello Doug, Thanks for making us aware of this very sad event. Michael did a lot for Unicode, and fought bravely with his illness. I hope we can all remember him this week at the Unicode Conference, where he gave so many amazing talks. I also hope that somebody somehow will be able to conserve all his tremendously instructive and funny blogs. Regards, Martin. On 2015/10/25 07:57, Doug Ewell wrote: > I wish this day had never come. > > http://obits.dignitymemorial.com/dignity-memorial/obituary.aspx?n=Michael-Kaplan&lc=4246&pid=176192738&uuid=ec6b8cda-c4b1-4b5f-9422-925b1e09a03a > > > -- > Doug Ewell | http://ewellic.org | Thornton, CO ???? > > . > From umesh.p.nair at gmail.com Sat Oct 24 19:02:56 2015 From: umesh.p.nair at gmail.com (Umesh P N) Date: Sat, 24 Oct 2015 17:02:56 -0700 Subject: A Bulldog moves on In-Reply-To: <562C1718.2010407@it.aoyama.ac.jp> References: <562C1718.2010407@it.aoyama.ac.jp> Message-ID: Very sad news. Anybody who has met him or listened to his talks will remember him for ever. Condolences... On Sat, Oct 24, 2015 at 4:41 PM, Martin J. D?rst wrote: > Hello Doug, > > Thanks for making us aware of this very sad event. Michael did a lot for > Unicode, and fought bravely with his illness. I hope we can all remember > him this week at the Unicode Conference, where he gave so many amazing > talks. > > I also hope that somebody somehow will be able to conserve all his > tremendously instructive and funny blogs. > > Regards, Martin. > > > On 2015/10/25 07:57, Doug Ewell wrote: > >> I wish this day had never come. >> >> >> http://obits.dignitymemorial.com/dignity-memorial/obituary.aspx?n=Michael-Kaplan&lc=4246&pid=176192738&uuid=ec6b8cda-c4b1-4b5f-9422-925b1e09a03a >> >> >> -- >> Doug Ewell | http://ewellic.org | Thornton, CO ???? >> >> . >> >> -- umesh.p.nair at gmail.com | ?o??l????@???u?d??s??n -------------- next part -------------- An HTML attachment was scrubbed... URL: From textexin at xencraft.com Sat Oct 24 20:20:32 2015 From: textexin at xencraft.com (Tex Texin) Date: Sat, 24 Oct 2015 18:20:32 -0700 Subject: A Bulldog moves on In-Reply-To: References: Message-ID: <000e01d10ec3$586c7710$09456530$@xencraft.com> I am quite saddened to read this. Michael had amazing strength and courage in dealing with his physical challenges. He was also very honest and frank in speaking on nearly any subject, even when he was discussing topics about his sponsors or employers. This may have worked against him at times, but was very much appreciated by his audience, even moreso because the technical details were quite accurate and insightful. He provided a tremendous library of information regarding internationalization, often in subject areas that were not addressed elsewhere- Visual Basic, keyboard support, Microsoft internationalization libraries, and some southeast Asian languages. For many other internationalization topics he offered education that was comprehensible in a humorous and often personalized manner. Michael will be missed not only for his wealth of knowledge and experience, but for his frankness and comeraderie. I am not sure if the list will let the image through but I attached a picture of Michael at IUC31. tex -----Original Message----- From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Doug Ewell Sent: Saturday, October 24, 2015 3:58 PM To: unicode at unicode.org Subject: A Bulldog moves on I wish this day had never come. http://obits.dignitymemorial.com/dignity-memorial/obituary.aspx?n=Michael-Kaplan&lc=4246&pid=176192738&uuid=ec6b8cda-c4b1-4b5f-9422-925b1e09a03a -- Doug Ewell | http://ewellic.org | Thornton, CO ???? -------------- next part -------------- A non-text attachment was scrubbed... Name: Michael Kaplan-3.jpg Type: image/jpeg Size: 344572 bytes Desc: not available URL: From lists+unicode at seantek.com Sat Oct 24 23:57:44 2015 From: lists+unicode at seantek.com (Sean Leonard) Date: Sat, 24 Oct 2015 21:57:44 -0700 Subject: A Bulldog moves on In-Reply-To: References: Message-ID: <562C6148.8040305@seantek.com> A very sad day in the history of this community. I learned a lot about Unicode, and about internationalization and localization on Windows, directly through his posts. And now, having done a bit of research, it looks like he left the Internet a gift with some recent blog posts about quite a second moonlighting life! http://www.siao2.com/ RIP Michael. You had a great run. Sean On 10/24/2015 3:57 PM, Doug Ewell wrote: > I wish this day had never come. > > http://obits.dignitymemorial.com/dignity-memorial/obituary.aspx?n=Michael-Kaplan&lc=4246&pid=176192738&uuid=ec6b8cda-c4b1-4b5f-9422-925b1e09a03a > > > -- > Doug Ewell | http://ewellic.org | Thornton, CO ???? > From charupdate at orange.fr Mon Oct 26 02:53:40 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Mon, 26 Oct 2015 08:53:40 +0100 (CET) Subject: A Bulldog moves on In-Reply-To: References: Message-ID: <1856997122.1845.1445846021092.JavaMail.www@wwinf1n04> I?m very saddened. A big Thank to Michael S. Kaplan for his great work and help. Sincere condolences to his Family. Marcel On Sat, 24 Oct 2015 16:57:41 -0600, Doug Ewell" wrote: > I wish this day had never come. > > http://obits.dignitymemorial.com/dignity-memorial/obituary.aspx?n=Michael-Kaplan&lc=4246&pid=176192738&uuid=ec6b8cda-c4b1-4b5f-9422-925b1e09a03a > > -- > Doug Ewell | http://ewellic.org | Thornton, CO ???? -------------- next part -------------- An HTML attachment was scrubbed... URL: From daniel.buenzli at erratique.ch Mon Oct 26 12:39:09 2015 From: daniel.buenzli at erratique.ch (=?utf-8?Q?Daniel_B=C3=BCnzli?=) Date: Mon, 26 Oct 2015 17:39:09 +0000 Subject: Emoji data in UCD xml ? Message-ID: <2121253DDDDD4862B07E9F7B762FA59A@erratique.ch> Hello, If I read correctly UTR #51, the way of determining if a scalar value is an emoji character is to consult this data file [1]. Are there any plans to integrate this data in the UCD xml ? Best, Daniel [1] http://www.unicode.org/Public/emoji/1.0/emoji-data.txt From doug at ewellic.org Mon Oct 26 13:50:59 2015 From: doug at ewellic.org (Doug Ewell) Date: Mon, 26 Oct 2015 11:50:59 -0700 Subject: Emoji data in UCD xml =?UTF-8?Q?=3F?= Message-ID: <20151026115059.665a7a7059d7ee80bb4d670165c8327d.eca2a41a9d.wbe@email03.secureserver.net> Daniel B?nzli wrote: > If I read correctly UTR #51, the way of determining if a scalar value > is an emoji character is to consult this data file [1]. Are there any > plans to integrate this data in the UCD xml ? Apologies in advance for asking an annoying "what are your motives" question, but: What is your goal in classifying a character as "emoji" or "not emoji"? According to emoji-data.txt, U+00A9 COPYRIGHT SIGN and U+00AE REGISTERED SIGN are emoji, while U+263B BLACK SMILING FACE is not. This may be at variance with what many people would expect. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From daniel.buenzli at erratique.ch Mon Oct 26 14:33:35 2015 From: daniel.buenzli at erratique.ch (=?utf-8?Q?Daniel_B=C3=BCnzli?=) Date: Mon, 26 Oct 2015 19:33:35 +0000 Subject: Emoji data in UCD xml ? In-Reply-To: <20151026115059.665a7a7059d7ee80bb4d670165c8327d.eca2a41a9d.wbe@email03.secureserver.net> References: <20151026115059.665a7a7059d7ee80bb4d670165c8327d.eca2a41a9d.wbe@email03.secureserver.net> Message-ID: Le lundi, 26 octobre 2015 ? 18:50, Doug Ewell a ?crit : > What is your goal in classifying a character as "emoji" or "not emoji"? Part of an heuristic for best effort character width estimation in terminals. Daniel From charupdate at orange.fr Tue Oct 27 07:54:36 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Tue, 27 Oct 2015 13:54:36 +0100 (CET) Subject: Non-standard 8-bit fonts still in use Message-ID: <1750675961.7791.1445950476985.JavaMail.www@wwinf1n04> I was preparing the following feedback long before the obituary of Michael S. Kaplan. I stay mourning. ? Since discussion restarted, am I allowed to send this today, instead of tomorrow? Initially it was planned for yesterday, the day when I found Doug Ewell?s and following messages, that brought me the bad news. ? I?m grateful for Erkki I. Kolehmainen?s advice to complete best effort prior to sending. My apologies for previous defaults. ? On Thu, 15 Oct 2015 20:22:08 -0400, Don Osborn wrote: > I was surprised to learn of continued reference to and presumably use of > 8-bit fonts modified two decades ago for the extended Latin alphabets of > Malian languages [...]> > See my recent blog post for a quick and by no means complete discussion > about this topic, which of course has to do with more than just the > fonts themselves: > http://niamey.blogspot.com/2015/10/the-secret-life-of-bambara-arial.html Here is another example of legacy font usage less than two years back: http://csprousers.org/forum/viewtopic.php?f=1&t=753 ? Legacy fonts offer at least one substantial advantage, which is already underscored in the comments on the cited blog post: They allow to use any habitual ASCII-oriented keyboard layout like French in Mali. Personally I?feel with all the people who stay using the fonts issued by the ?1990s [?] joint project of the Malian Ministry of Education and the French Agence de Coop?ration Culturelle et Technique (ACCT [?])?, and I wouldn?t neither throw away a proven worktool without being sure to get a better one. I suppose the cited people in any case didn't use the "clavier unifi? fran?ais-bambara" that you linked to on another blog post: http://www.mali-pense.net/IMG/pdf/le-clavier_francais-bambara.pdf cited on: http://www.mali-pense.net/Ressources-pour-la-pratique-du.html cited on: http://niamey.blogspot.fr/2014/11/writing-bambara-right.html Indeed there is a big oopsie with keyboard layouts on Windows: we cannot associate applications and default keyboard layouts like we can associate file extensions and applications. So one working method not to be bothered with switching keyboard layouts is to have appropriate templates in the word processor with extra fonts instead of extra layouts. ? The glyph issue: To get ?"Bambara Arial" to work on the internet?, a simple macro replacing ?, %/?, q, Q, x, X, v, V with ?, ?, ?, ?, ?, ?, ?, ? isn?t enough because though Arial, it won?t be ?true Bambara? any more given the inconsistency of all fonts I?could view that use the ?n? form for uppercase eng, like Arial and Times do, while sticking with the ?N? form for uppercase palatal n. I believe it?s not just ?more to it? but even the main reason, despite of opposite opinion in comments. I couldn?t gather any suitable font, but book-printers must have them, and possibly both shapes in the same font. Such fonts seem to be really confidential. In a bilingual Bambara-French book from France (1996), the typeface clearly shows that ?n?-shaped uppercase ? has been emulated by using oversize lowercase. Ported to HTML, this workaround results in replacing all ??? by lowercase in a [font-size: 135%; line-height: 75%; font-weight: lighter;] span, though it stays looking semi-bold when less than 400 is unavailable. That can make aware that it doesn?t render in plain text. And it works best with sans-serif fonts. I?really don?t know if this has at least some resemblance to Bambara Arial, and I?do wish to be able to check. I note, too, that such a construct is not Unicode conformant. It would be desirable to overcome that system of special fonts, workarounds, and limited support. I don?t know if really some communities prefer the ?N? form for uppercase palatal n, or even for eng, or both. Was there a problem at the time when actual fonts were created? Does anybody know a solution? I?m likely to believe that this would eventually be language tagging and use of modern rendering engines along with up-to-date fonts providing both glyphs. However, in my opinion, correct display of so widely spoken and written a language as Bamanankan, should not have to rely on sophisticated byways. ? About Unicode aware education: I?m not likely to share presumptions about lack of training in Mali more than in other countries, including European. Keyboard layout documentation from Europe last updated after fifteen years of Unicode still targets non-conformant rendering engines (where precomposed and decomposed characters display differently) and doesn?t mind using canonical decomposition afterwards to streamline the input (actually, for Bambara, by using the French ???, ???, ???, and ??? keys). Well, the existence of decomposition in text processing was about the first things Unicode taught me, as I?was too ignorant to directly point the browser to TUS and UAXes to learn about?while already bothering about creating keyboard layouts... That turns out not to be a single case. And with the one-laptop-per-child program, Malian and other African technicians will be making far better keyboards than Europeans did. I?d quoted some parts from the following forum page of 2006, but removed them to shorten. I?believe it says a lot about the topic. Interested subscribers are welcome to look it up on line: Ibibio, Efik, Anaang and ICT (fonts, keyboards, applications) http://www.quicktopic.com/37/H/q8r5VVqGF5Q ? ? ? Yet another solution is a universal Latin backward compatible layout on French keyboard optimized for Bambara and including N?Ko: 1 - Universal Latin is rather intuitive and backwards compatible with a Compose key plus five classic dead keys (on three physical keys); optimization is performed by (a) filling up the keys on third and fourth level, and (b) filling up the circumflex group (and other groups), with the additional characters, given that circumflex is already the base shift state dead key on French layout, and Q, V, X don?t exist with circumflex. That?s overcoming definitely the traditional but unnecessary limitation of dead keys to one diacritic instead of considering that every dead key gives access to a natural group (as opposed to the artificial groups defined in the global keyboard standard), like the US?International keyboard had started doing for ???. 2 - N?Ko being a caseless script, its 59?characters can be placed into the Kana shift states. The standard Kana toggle in Windows drivers requires a whole key (e.g. above Tab, suitable for French layout), while Keyman allows to add a supplemental toggle in a way that doesn?t?invoked for example by Ctrl+CapsLock. That toggle will allow to get N?Ko integrated. 3 - Universal Latin makes sense also because it covers *all* characters listed on this page: http://www.bisharat.net/A12N/charsum.html as well as all the circled (plain text) digits and letters: ???????, and in its final stage a lot of symbols and pictographs. Good luck, Marcel -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at mxmerz.de Tue Oct 27 16:03:43 2015 From: unicode at mxmerz.de (Max Merz) Date: Tue, 27 Oct 2015 22:03:43 +0100 Subject: Emoji Proposal: Face With One Eyebrow Raised Message-ID: Hello, I would like to submit a proposal to encode an emoji depicting a ?face with one eyebrow raised?, as to indicate scepticism, surprise, concern, disagreement. The ?Submitting Character Proposals? page on unicode.org recommends to discuss preliminary proposals on this mailing list ? I am currently working on my proposal, but I would appreciate general feedback about whether this idea is doomed from the start, has already been discussed, comes at a bad time, etc.? Best regards, Max Merz From sarabiarafael at gmail.com Wed Oct 28 04:59:49 2015 From: sarabiarafael at gmail.com (Rafael Sarabia) Date: Wed, 28 Oct 2015 10:59:49 +0100 Subject: Unicode equivalence between Word for Windows/MAC Message-ID: Dear all, I need to use a document both in Word 2007 for Windows and Word 2011 for Mac and I'm finding some incompatibility issues. The file has been created in Word for Windows and save as "Unicode" (which I believe -although I am not certain means "UTF-16"). In Word 2011 for Mac I have several options to save it as Unicode: Unicode 6.1, Unicode 6.1 Little-Endian and UTF-8. None of them seem to be equivalent to the "Unicode" encoding in Word 2007 for Windows. My question is very simple: which is the encoding equivalent in Word 2011 for Mac to "Unicode" in Word for Windows (*which allows me to work in both operating systems/Word programs interchangeably*? One of the three abovementioned possibilities or another one? (I don't have the complete list in front of me) Thank you very much in advance. Kind regards. Rafael Sarabia -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Wed Oct 28 11:06:15 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Wed, 28 Oct 2015 17:06:15 +0100 Subject: Unicode equivalence between Word for Windows/MAC In-Reply-To: References: Message-ID: Unicode 6.1 just indicates the version of the repertoire, Mac indicates there that the encoding is based on Unicode (just like Windows) and that it is UTF-16. With the "Little Endian" option, the file is saved with UTF-16 code units in little endian order (like on PC and on today's x86-based Mac's), on old versions of MacOS the byte order was big endian (as these Macs were not based on the x86 architecture but on Motorola 68K in the first generations and later on PowerPC). Word documents actually have two internal structures : the old .doc binary format, and later the .docx format which is based on XML (actual files are in fact ZIPped archives containing multiple XML files. All these XML files declare themselves their own encoding (which can be UTF-8, or UTF-16 starting with a "byte order mark" indicating if it is little-endian or big endian, or "UTF-16LE" where there's no leading byte order mark but the order is assumed to by little-endian. The UTF-8 encoding is not dependant of byte orders, however it may be a bit larger in code space than UTF-16 when working with non Latin scripts. This different of size for XML (significant for languages like Japanese, Korean or Chinese) however is insignificant in .docx files are they are internally compressed in a ZIP archive. Additionally files for Mac noy only contain the document but also have a legacy "resource fork" containing some metadata about the application creating the document and tagging its internal format and some other options specific to the format. But those metadata are generally not transmitted if you forward them for example by email attachments. UTF-8 and UTF-16 encodings should all be interoperable between Windows and MacOS anyway, notably for files in .docx format (internally based on XML), except for a few characters that are specific to MacOS (such as the Apple logo) and only supported by private encodings or by PUA and with MacOS-specific fonts when mapped to Unicode. The main differences will not be really in the character encoding but in the internal support of custom extensions for Word, such as active scripts, or embedding options for documents created by other apps, possibly not from Microsoft itself. Those Word documents will load between all platforms but some embedded items may not be found on the target machine (including fonts used on the source machine but not embedded in the transmitted document: for this reason, Word/Office comes on the Mac with a basic set of fonts made by Microsoft supported by Windows but installed with Word/Office as complements, including Arial, Verdana, Courier New, Times New Roman, ..., when MacOS natively had Helvetica, Courier, Times... which are very similar in term of design but with slightly different metrics.) The main cause of incompatibility will then be if you create a document on your Mac with fonts specific to your system but not preinstalled with Office on both platforms: if a font is missing there are some fallbacks to other fonts but the docuemnt layout may be slightly altered due to the different font metrics. The second kind of incompatiblity will occur if you have embedded in your Word document some "objects" created by Windows-specific or MacOS-specific apps that the recipient does not have on his system: those embdded components may just show as blank rectangle on the page. The third kind of incompatiblity comes from scripting : embedded scripts in your document may be disabled automatically on the recipient machine unless those components are installed, and their use is authenticated by a digital signature of the document by its creator, and you have given trust to this creator to execute his scripts on your system: this is not specific to Windows or MacOS, and those scripts will be disabled on the recipient machine even if the recipient has the same OS and the same version of Word as the initial creator of the document. Office will display in thios case a "yellow warning banner" about those (unsigned or not trusted) scripts, but the document should still be readable and editable without those scripts, even if the scripts (most often written in VBscript) are not runnable. Those scripts are normally helper tools used on the creator PC intended to help create/edit the document, but they should not be needed to read the document "as is". When you create a file with those tools, you should save the file in a statically-rendered version where those tools are purged out from the document, or a version where those embedded scripts are signed by a (Office) tool installed and trusted on both systems. But character encodings are not an issue if those encodings are Unicode-based (UTF-8 and UTF-16 are interoperable between Windows, MacOS, Linux, Unix and almost all modern OSes ; this was not the case with old 8-bit charsets such as Windows codepages or MacRoman and similar due to their many national variants, or variants specific to some versions of these OSes), except if the recipient uses an old version of Office apps on old OSes without native Unicode support. UTF-8 and UTF-16 are present and natively supported in Windows since Windows 98 (on Windows 95, you may have to install an additional Unicode compatiblity layer, and old versions of Windows 3.x or before may not support thoise encodings correctly and will typically also not support the newer, XML-based, .docx or .odt formats but only the legacy .doc format or older .rtf formats with a limited set of private encodings and limited characters repertoire, those old Office apps may then display some "tofu" if the encoding is not supported ; note also that Office comes with a set of "format converters" to help convert incoming documents to one of the supported formats, but the conversion may be lossy or approximative in some cases). For all versions of Windows with a native "Win32" API (instead of the old "Win16" and "DOS"/"BIOS"-like APIs), you should never encounter issue with Unicode encodings ; the same applies to all versions of Mac OS since MacOS X (which has native support of Unicode, instead of legacy Mac codepages). Native support of Unicode UTF's is no longer an option on most OSes (and at least on all OSes where you'll use an Office application or a web browser or any graphical desktop environment), it is preinstalled by default on all modern OSes including Unix/Linux (old 8-bit encodings and fonts for X11 are still supported on these systems, and a few default fonts for these legacy encodings are still installed, but most applications no longer use or need them, except for the system console/shell used in non-graphic mode for legacy terminal emulations such as VT-220 and similar, or on embedded versions of Linux which actually don't need to render or decode any text and transmit text data in an "as is" format, or that only support basic ASCII or a single 8-bit codepage for those consoles without support of internationalization). 2015-10-28 10:59 GMT+01:00 Rafael Sarabia : > Dear all, > > I need to use a document both in Word 2007 for Windows and Word 2011 for > Mac and I'm finding some incompatibility issues. > > The file has been created in Word for Windows and save as "Unicode" (which > I believe -although I am not certain means "UTF-16"). > > In Word 2011 for Mac I have several options to save it as Unicode: Unicode > 6.1, Unicode 6.1 Little-Endian and UTF-8. None of them seem to be > equivalent to the "Unicode" encoding in Word 2007 for Windows. > > > > My question is very simple: which is the encoding equivalent in Word 2011 > for Mac to "Unicode" in Word for Windows (*which allows me to work in > both operating systems/Word programs interchangeably*? One of the three > abovementioned possibilities or another one? (I don't have the complete > list in front of me) > > > Thank you very much in advance. > > Kind regards. > > Rafael Sarabia > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Wed Oct 28 11:52:51 2015 From: doug at ewellic.org (Doug Ewell) Date: Wed, 28 Oct 2015 09:52:51 -0700 Subject: Unicode equivalence between Word for Windows/MAC Message-ID: <20151028095251.665a7a7059d7ee80bb4d670165c8327d.ee6f2a5803.wbe@email03.secureserver.net> Rafael Sarabia wrote: > My question is very simple: which is the encoding equivalent in Word > 2011 for Mac to "Unicode" in Word for Windows (*which allows me to > work in both operating systems/Word programs interchangeably*? One of > the three abovementioned possibilities or another one? (I don't have > the complete list in front of me) There seems to be an issue with Word for Mac in reading and writing Unicode text, which should be as simple and straightforward as you expect: http://answers.microsoft.com/en-us/mac/forum/macoffice2011-macword/why-does-word-for-mac-always-mangle-unicode-text/ad95c7ab-ab56-45af-8a74-51d26f079d25 -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From jkorpela at cs.tut.fi Wed Oct 28 13:05:53 2015 From: jkorpela at cs.tut.fi (Jukka K. Korpela) Date: Wed, 28 Oct 2015 20:05:53 +0200 Subject: Unicode equivalence between Word for Windows/MAC In-Reply-To: References: Message-ID: <56310E81.9010004@cs.tut.fi> 28.10.2015, 11:59, Rafael Sarabia wrote: > I need to use a document both in Word 2007 for Windows and Word 2011 for > Mac and I'm finding some incompatibility issues. Before going into the details of plain text file encodings, I think it is important to decide whether you need to use plain text for information interchange. If you simply save the document in Word format (perhaps safest in the older format, a .doc file), I would expect Windows and Mac versions to play well together. If you specifically do not want to use Word format, then why are you using Word in the first place, as opposite to using a Unicode-aware plain text editor. (This was a genuine question, not rhetoric. There are possible reasons for using Word even when you want plain text, e.g. the use of spell checking features in Word.) Yucca From charupdate at orange.fr Thu Oct 29 03:29:15 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Thu, 29 Oct 2015 09:29:15 +0100 (CET) Subject: Latin glottal stop in ID in NWT, Canada Message-ID: <1786035177.2903.1446107355879.JavaMail.www@wwinf1n04> On Thu, 15 Oct 2015 15:46:46 -0700, Leo Broukhis wrote: > Along the same lines, should I be able to change my last name > officially to ?pyx?c? (NB all letters are codepoints with names > starting with "LATIN"). This request results in using Latin to imitate Cyrillic in a country where this kind of approach has never been official. In case somebody has missed that: The current thread is about enforcing *legal* orthography in a country where it is part of *official* languages. Arguing that this aping Cyrillic be ?along the same lines?, is stacking insult over insult. It?s a true example of the way how jokification can be perverted. As that has been done in public, it brings the need of a public apology, particularly with respect to future archive readers. This has already been exposed off list, in conformance to List Policies. However I feel the need to send it ?on the record?, so everybody is reassured that there was more than one single person defending the serious of the thread?s subject. By this occasion I apologize in turn for having felt obliged to give a reply on the spot in public, as being the addressee of the above request. A more reflected answer two weeks later ends up being better than a bad one in time. Expectant, Marcel -------------- next part -------------- An HTML attachment was scrubbed... URL: From charupdate at orange.fr Thu Oct 29 04:17:23 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Thu, 29 Oct 2015 10:17:23 +0100 (CET) Subject: Terminology (was: Latin glottal stop in ID in NWT, Canada) In-Reply-To: References: <1022406759.1183.1445583561682.JavaMail.www@wwinf1h14> <20151023085315.443d56b6@JRWUBU2> <786526406.6308.1445600066983.JavaMail.www@wwinf1h14> Message-ID: <165409839.4124.1446110244182.JavaMail.www@wwinf1n04> On 2015-10-23 20:17 GMT+08:00, gfb hjjhjh wrote: > writing other languages in Latin alphabet is still called romanization not latinization. The legacy wording is due to the very comprehensive cultural phenomenon that was originally referred to. Today, referring to the ?*Roman* alphabet? is mainly adding an antiquated touch to the functional confusion between a script and an alphabet. IMO when we still think in terms of ?Roman alphabet? instead of ?Latin script?, we seem somewhere not to have been given the opportunity to grow up into the age we are living in. Here are the guilty word processors, that have ?italic? as a toggle to streamline the UI, so that users can?t learn how to call ?put it back to ??????. Then, accurate expression is challenged in English by compound semantics: ?scripture? implies holiness, and ?script? implies handwriting, while the hypothetical ?writ? would be a confusing tongue twister. But ?script? for ?writing system? looks good to me, and certainly to all people familiar with Unicode thanks to some training. Thanks to the Unicode Consortium! Looking at how many Northern Americans are still missing the point, we get another reason not to believe that Africans could be less trained about Unicode than peeople on other continents are. Best, Marcel -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Thu Oct 29 08:25:52 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Thu, 29 Oct 2015 14:25:52 +0100 Subject: Latin glottal stop in ID in NWT, Canada In-Reply-To: <1786035177.2903.1446107355879.JavaMail.www@wwinf1n04> References: <1786035177.2903.1446107355879.JavaMail.www@wwinf1n04> Message-ID: 2015-10-29 9:29 GMT+01:00 Marcel Schneider : > On Thu, 15 Oct 2015 15:46:46 -0700, Leo Broukhis wrote: > > > Along the same lines, should I be able to change my last name > > officially to ?pyx?c? (NB all letters are codepoints with names > > starting with "LATIN"). > > This request results in using Latin to imitate Cyrillic in a country > where this kind of approach has never been official. > The glottal stop is used in African countries that have never used any Cyrillic alphabet. That letter is full part of the Latin alphabet but needed for those languages. That Latin glottal stop competes also with a representation in the Arabic script (the letter form however is different). On the opposite, Native Americans HAVE used the Cyrillic script in Alaska and probably as well in North-Western territories in Canada, in a time where native languages started to be alphabetized by missionaries. Today, it is natural that native Americans that have strong cultural and linguistic links with other native people all around the Artic circle (in Alaska, Northern Europe and Northern and Eastern Russia) want to restore their communications even if they live now in different countries that have now other dominant languages. I don't think that the glottal stop is strange in the Latin script. It is also part of the IPA symbols (where is is unicased), but outside IPA, that letter should be dual cased, like other Latin letters. Until recently the lowercase form was not encoded only because the glottal stop was initially encoded for IPA. > In case somebody has missed that: The current thread is about enforcing > *legal* orthography in a country where it is part of *official* languages. > Canadian syllabics is also used and may be this was the reason for opposing the letter. But as soon as Canada wants romanizations, syllables present in Canadian Syllabics should be represented correctly too in Latin. That letter is encoded since long. With enough time the lowercase form will get used too, because it is natural for the Latin script (which initially also unicameral, lowercase letters only appeared during the Middle Age.) > Arguing that this aping Cyrillic be ?along the same lines?, is stacking > insult over insult. > Arguing with Cyriliic is a bad view. That letter is Latin, just like other stops (or additional letters such as the schwa or the open O) needed for many minority languages around the world that have been romanized. In all times, the Latin script has been adapted to represent significant differences, with variants of base letters such as overstrikes or combining accents (that were slowly added over time, and only later formalized with stable orthographies. English also used more letters than those used today (e.g. wynn, and thorn still used in Nordic European countries). > It?s a true example of the way how jokification can be perverted. > This can only come from someone in an administration that is not really aware of tghe hiostory of languages, cultures and script. That person simply ignores the long evolution of the Latin script, including for English or French used in today's Canadian administration and government. This is an educational problem for that person or lack of training. There are certainly many smart persons in the Canadian adminsitration that could have argued against that limited vision based on modern English and French. > As that has been done in public, it brings the need of a public apology, > particularly with respect to future archive readers. > > This has already been exposed off list, in conformance to List Policies. > However I feel the need to send it ?on the record?, so everybody is > reassured that there was more than one single person defending the serious > of the thread?s subject. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From kenwhistler at att.net Thu Oct 29 11:14:18 2015 From: kenwhistler at att.net (Ken Whistler) Date: Thu, 29 Oct 2015 09:14:18 -0700 Subject: Emoji data in UCD xml ? In-Reply-To: <2121253DDDDD4862B07E9F7B762FA59A@erratique.ch> References: <2121253DDDDD4862B07E9F7B762FA59A@erratique.ch> Message-ID: <563245DA.2060207@att.net> There has been some preliminary discussion of this. The problem is that the data in emoji-data.txt has not yet been formally rationalized into a coherent set of Unicode character properties. The UTC would first need to determine exactly what property (or list of properties) is involved, before incorporating it (or them) formally into the Unicode Character Database (UCD) and into the XML version of the UCD, and the documentation of it (or them) formally into UAX #44. --Ken On 10/26/2015 10:39 AM, Daniel B?nzli wrote: > If I read correctly UTR #51, the way of determining if a scalar value is an emoji character is to consult this data file [1]. Are there any plans to integrate this data in the UCD xml ? > > From charupdate at orange.fr Thu Oct 29 11:23:17 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Thu, 29 Oct 2015 17:23:17 +0100 (CET) Subject: Latin glottal stop in ID in NWT, Canada In-Reply-To: References: <1786035177.2903.1446107355879.JavaMail.www@wwinf1n04> Message-ID: <282388625.14143.1446135798058.JavaMail.www@wwinf1f02> On Thu, 29 Oct 2015 14:25:52 +0100, Philippe Verdy" wrote: > 2015-10-29 9:29 GMT+01:00 Marcel Schneider : > >> On Thu, 15 Oct 2015 15:46:46 -0700, Leo Broukhis wrote: >> >>> Along the same lines, should I be able to change my last name >>> officially to ?pyx?c? (NB all letters are codepoints with names >>> starting with "LATIN"). >> >> This request results in using Latin to imitate Cyrillic in a country where this kind of approach has never been official. > The glottal stop is used in African countries that have never used any Cyrillic alphabet. That letter is full part of the Latin alphabet but needed for those languages. That Latin glottal stop competes also with a representation in the Arabic script (the letter form however is different). The request I was referring to is Leo's. Otherwise, Philippe's contribution is informative and useful. It just shouldn't have been sent in *reply* to the above. I'm not likely to respond further on this topic, as I stopped to be concerned. I just repeat myself quoting: > Arguing that this aping Cyrillic be ?along the same lines?, is stacking insult over insult. > It?s a true example of the way how jokification can be perverted. > As that has been done in public, it brings the need of a public apology, particularly with respect to future archive readers. > This has already been exposed off list, in conformance to List Policies. However I feel the need to send it ?on the record?, so everybody is reassured that there was more than one single person defending the serious of the thread?s subject. Expecting, Marcel -------------- next part -------------- An HTML attachment was scrubbed... URL: From leob at mailcom.com Thu Oct 29 12:20:35 2015 From: leob at mailcom.com (Leo Broukhis) Date: Thu, 29 Oct 2015 10:20:35 -0700 Subject: Latin glottal stop in ID in NWT, Canada In-Reply-To: <282388625.14143.1446135798058.JavaMail.www@wwinf1f02> References: <1786035177.2903.1446107355879.JavaMail.www@wwinf1n04> <282388625.14143.1446135798058.JavaMail.www@wwinf1f02> Message-ID: Dear Marcel, In proposing my "along the same lines" post, I was intending not to mock the alleged feelings of the involved but to underline the impracticality of the idea by providing an extreme example of proliferating arbitrary characters defined as "latin letter" in Unicode in all documents, and the concomitant issues with maintaining public records, communicating with public officials, etc. However, I've just realized that the request to allow the glottal stop character in the name is likely to be limited to the birth certificate, which is absolutely fair, and, I presume that it is well understood that in all everyday documents and materials requiring ease of data interchange, like a government ID, a credit card, or a passport, an official transliteration, be it a hyphen or an apostrophe, will be used. Therefore, I apologize. Regards, Leo On Thu, Oct 29, 2015 at 9:23 AM, Marcel Schneider wrote: > On Thu, 29 Oct 2015 14:25:52 +0100, Philippe Verdy" > wrote: > >> 2015-10-29 9:29 GMT+01:00 Marcel Schneider : >> >>> On Thu, 15 Oct 2015 15:46:46 -0700, Leo Broukhis >>> wrote: >>> >>>> Along the same lines, should I be able to change my last name >>>> officially to ?pyx?c? (NB all letters are codepoints with names >>>> starting with "LATIN"). >>> >>> This request results in using Latin to imitate Cyrillic in a country >>> where this kind of approach has never been official. > >> The glottal stop is used in African countries that have never used any >> Cyrillic alphabet. That letter is full part of the Latin alphabet but needed >> for those languages. That Latin glottal stop competes also with a >> representation in the Arabic script (the letter form however is different). > > The request I was referring to is Leo's. > > Otherwise, Philippe's contribution is informative and useful. It just > shouldn't have been sent in *reply* to the above. > > I'm not likely to respond further on this topic, as I stopped to be > concerned. > > I just repeat myself quoting: > >> Arguing that this aping Cyrillic be ?along the same lines?, is stacking >> insult over insult. >> It?s a true example of the way how jokification can be perverted. >> As that has been done in public, it brings the need of a public apology, >> particularly with respect to future archive readers. >> This has already been exposed off list, in conformance to List Policies. >> However I feel the need to send it ?on the record?, so everybody is >> reassured that there was more than one single person defending the serious >> of the thread?s subject. > > Expecting, > > Marcel From mark at macchiato.com Thu Oct 29 13:20:50 2015 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Thu, 29 Oct 2015 11:20:50 -0700 Subject: Emoji data in UCD xml ? In-Reply-To: <563245DA.2060207@att.net> References: <2121253DDDDD4862B07E9F7B762FA59A@erratique.ch> <563245DA.2060207@att.net> Message-ID: As Ken said, there's been some preliminary discussion, but we wanted to get initial information out in connection with UTR #51 first, and take more time to consider what UCD properties would look like, and which are necessary. The basic information that people want to access for implementations are: - Is a character emoji or not? - Which emoji have default text presentation? (others having emoji presentation) - Which emoji are modifiers, and which are modifier bases? (others being neither) - Which sequences of emoji are recommended (zwj and/or combining marks) for those who support them? - flags and modifier sequences are specified algorithmically, and don't need to be listed. The levels, the distinction between primary and secondary, and the carrier sources were useful in development of the emoji data and tr51 but aren't really necessary for implementations. Mark On Thu, Oct 29, 2015 at 9:14 AM, Ken Whistler wrote: > There has been some preliminary discussion of this. The problem is that > the data in emoji-data.txt has not yet been formally rationalized into a > coherent set of Unicode character properties. The UTC would first need to > determine exactly what property (or list of properties) is involved, before > incorporating it (or them) formally into the Unicode Character Database > (UCD) > and into the XML version of the UCD, and the documentation of it (or them) > formally into UAX #44. > > --Ken > > > On 10/26/2015 10:39 AM, Daniel B?nzli wrote: > >> If I read correctly UTR #51, the way of determining if a scalar value is >> an emoji character is to consult this data file [1]. Are there any plans to >> integrate this data in the UCD xml ? >> >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From petercon at microsoft.com Fri Oct 30 01:07:36 2015 From: petercon at microsoft.com (Peter Constable) Date: Fri, 30 Oct 2015 06:07:36 +0000 Subject: Latin glottal stop in ID in NWT, Canada In-Reply-To: References: <1786035177.2903.1446107355879.JavaMail.www@wwinf1n04> Message-ID: From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Philippe Verdy Sent: Thursday, October 29, 2015 6:26 AM > On the opposite, Native Americans HAVE used the Cyrillic script in Alaska > and probably as well in North-Western territories in Canada? In Alaska, yes, because the languages in question are, in fact, Siberian languages. But where have you gotten the idea that Cyrillic script has been used in orthographies for languages spoken in Northwest Territories? I?ve never seen any indication of that, and I am very doubtful. (Btw, it?s ?the Northwest Territories?, not ?North-Western territories?.) Peter -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Fri Oct 30 07:59:42 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Fri, 30 Oct 2015 13:59:42 +0100 Subject: Latin glottal stop in ID in NWT, Canada In-Reply-To: References: <1786035177.2903.1446107355879.JavaMail.www@wwinf1n04> Message-ID: Borders around Alaska were very fuzzy and native Americans were mobile in the region. It seems unaoidable that at some time some of their languages have been written by some missionaries and books/religious texts exhanged around. As well, even before Alaska was sold by the Russian Empire to USA, there were also many Russian migrants going to Canada and USA via Alaska,; and meeting also native Americans. The US and British Canadian authorities were not as active as they are today in those areas, and aboriginal populations (as well as many m?igrants) were certainly more autonomous and more mobile than they are today, and had more cultural exchanges. At that time they were still not small minorities as they are today, and the usage of English nad French by them was much less common. PS: Note that I used the term "probably". "North-West Territories" is only today's name of an organized Canadian province. For long, this area was not incorporated, so I used a *generic* term (with "territories" in lowercase (and the term I used was probably referring to the whole Arctic region, where native Americans were travelling for long distances across seasons for their traditional fishery and hunting). If you look at a "common" map centered on the equatorial line, the artic region seems enormous, but look at a map centered on the pole, and consider what were the limits of the iceshelfs in past centuries and how those populations were living in the area, independantly of the European/American and Asian countries established to the south. The arctic Ocean was an essential resource and people lived all around it on a quite thin border of land and on iceshelfs with very scarce resources. They had to be mobile and received little help from the south. But the area was also regularly visited by European and Asian fishers or explorers, and notably from Russia looking for routes to the Atlantic or Pacific and selling products to local native populations or trying to fix them under some imperial rule. There were also a many more active native languages than those that remain today, many of them are now extinct or persist only in some old transcriptions written in the Latin or Cyrillic alphabets (possibly in sinograms or Mongolian scripts too, with Chinese or Japanese explorers, fishers and merchants from their former empirial regimes: there could remain old books transcribing some of those old arctic native languages), but these old transiptions may have been preciously kept by today's native peoples in their local communities, or they could remain in some museum or public library all around the Northern hemisphere. 2015-10-30 7:07 GMT+01:00 Peter Constable : > *From:* Unicode [mailto:unicode-bounces at unicode.org] *On Behalf Of *Philippe > Verdy > *Sent:* Thursday, October 29, 2015 6:26 AM > > > On the opposite, Native Americans HAVE used the Cyrillic script in > Alaska > > > and probably as well in North-Western territories in Canada? > > > > In Alaska, yes, because the languages in question are, in fact, Siberian > languages. > > > > But where have you gotten the idea that Cyrillic script has been used in > orthographies for languages spoken in Northwest Territories? I?ve never > seen any indication of that, and I am very doubtful. > > > > (Btw, it?s ?the Northwest Territories?, not ?North-Western territories?.) > > > > > > > > Peter > -------------- next part -------------- An HTML attachment was scrubbed... URL: From petercon at microsoft.com Fri Oct 30 10:56:30 2015 From: petercon at microsoft.com (Peter Constable) Date: Fri, 30 Oct 2015 15:56:30 +0000 Subject: Latin glottal stop in ID in NWT, Canada In-Reply-To: References: <1786035177.2903.1446107355879.JavaMail.www@wwinf1n04> Message-ID: > If you look at a "common" map centered on the equatorial line, Philippe, I have personal ties to northern Canada. I?m aware of the distances. Alaska is comparable to the combination of France, Germany, Poland, Belarus and Ukraine. The distances involved are comparable to migrating from the Ural Mountains to France. >"North-West Territories" is only today's name of an organized Canadian province. You said you?re earlier reference was with a more generic meaning. But now you clearly misspell when referring to the administrative entity, even after I gave you the correct spelling. Also, in Canada, territories are not considered provinces: these different types of administrative unit have distinct statuses in relation to the constitution and the federal government. Russian migrants going to wherever doesn?t seem relevant to me. Yes, potentially they can influence other peoples, but the only kinds of migrants that tend to influence literacy among other people groups are missionaries, and I?m not aware of Russian missionaries having worked in the Northwest Territories. The languages in question are spoken in coastal regions of Alaska. You either have to cross the width of Alaska or else cross the tall coastal mountains before you reach northwestern territories of Canada. It seems very unlikely to me, given that you?re dealing with very, very different ecological and climactic zones. > there could remain old books I could just as readily speculate that early Gauls in Normandy wrote with early ideographic writing. After all, it is far easier to migrate across Eurasia, with much less variation in climactic zones, than to go from the Alaskan coast to the Canadian interior. Rather than speculate, can we just stick to documented attestations we can point to? Hypothetical possibilities about Cyrillic don?t seem too relevant to the topic of actual glottal stop usage in Canada, which is fairly well documented. Peter From: verdyp at gmail.com [mailto:verdyp at gmail.com] On Behalf Of Philippe Verdy Sent: Friday, October 30, 2015 6:00 AM To: Peter Constable Cc: Marcel Schneider ; Unicode Discussion ; Leo Broukhis Subject: Re: Latin glottal stop in ID in NWT, Canada Borders around Alaska were very fuzzy and native Americans were mobile in the region. It seems unaoidable that at some time some of their languages have been written by some missionaries and books/religious texts exhanged around. As well, even before Alaska was sold by the Russian Empire to USA, there were also many Russian migrants going to Canada and USA via Alaska,; and meeting also native Americans. The US and British Canadian authorities were not as active as they are today in those areas, and aboriginal populations (as well as many m?igrants) were certainly more autonomous and more mobile than they are today, and had more cultural exchanges. At that time they were still not small minorities as they are today, and the usage of English nad French by them was much less common. PS: Note that I used the term "probably". "North-West Territories" is only today's name of an organized Canadian province. For long, this area was not incorporated, so I used a *generic* term (with "territories" in lowercase (and the term I used was probably referring to the whole Arctic region, where native Americans were travelling for long distances across seasons for their traditional fishery and hunting). If you look at a "common" map centered on the equatorial line, the artic region seems enormous, but look at a map centered on the pole, and consider what were the limits of the iceshelfs in past centuries and how those populations were living in the area, independantly of the European/American and Asian countries established to the south. The arctic Ocean was an essential resource and people lived all around it on a quite thin border of land and on iceshelfs with very scarce resources. They had to be mobile and received little help from the south. But the area was also regularly visited by European and Asian fishers or explorers, and notably from Russia looking for routes to the Atlantic or Pacific and selling products to local native populations or trying to fix them under some imperial rule. There were also a many more active native languages than those that remain today, many of them are now extinct or persist only in some old transcriptions written in the Latin or Cyrillic alphabets (possibly in sinograms or Mongolian scripts too, with Chinese or Japanese explorers, fishers and merchants from their former empirial regimes: there could remain old books transcribing some of those old arctic native languages), but these old transiptions may have been preciously kept by today's native peoples in their local communities, or they could remain in some museum or public library all around the Northern hemisphere. 2015-10-30 7:07 GMT+01:00 Peter Constable >: From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Philippe Verdy Sent: Thursday, October 29, 2015 6:26 AM > On the opposite, Native Americans HAVE used the Cyrillic script in Alaska > and probably as well in North-Western territories in Canada? In Alaska, yes, because the languages in question are, in fact, Siberian languages. But where have you gotten the idea that Cyrillic script has been used in orthographies for languages spoken in Northwest Territories? I?ve never seen any indication of that, and I am very doubtful. (Btw, it?s ?the Northwest Territories?, not ?North-Western territories?.) Peter -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Fri Oct 30 14:09:03 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Fri, 30 Oct 2015 19:09:03 +0000 Subject: Latin glottal stop in ID in NWT, Canada In-Reply-To: References: <1786035177.2903.1446107355879.JavaMail.www@wwinf1n04> Message-ID: <20151030190903.4a97dc45@JRWUBU2> On Fri, 30 Oct 2015 06:07:36 +0000 Peter Constable wrote: > From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of > Philippe Verdy Sent: Thursday, October 29, 2015 6:26 AM > > > On the opposite, Native Americans HAVE used the Cyrillic script in > > Alaska and probably as well in North-Western territories in Canada? > > In Alaska, yes, because the languages in question are, in fact, > Siberian languages. I wouldn't describe Tlingit as a Siberian language. There are some old Cyrillic script Christian materials in Tlingit. The Canadian connections are in British Columbia and Yukon. Richard. From js_choi at icloud.com Fri Oct 30 13:51:45 2015 From: js_choi at icloud.com (J.S. Choi) Date: Fri, 30 Oct 2015 13:51:45 -0500 Subject: On emoji and the two rightwards black arrows Message-ID: # On emoji and the two rightwards black arrows This is a long post, and I apologize for that; it?s a somewhat complicated topic. The post is about two encoded characters: U+27A1 Black Rightwards Arrow and U+2B95 Rightwards Black Arrow . ? The post first reviews their encodings? respective histories, as I currently understand it; hopefully I?m not mistaken about anything. ? It then informally suggests that U+2B95 be added to emoji-data.txt (and possibly be given standardized text/emoji variants)?as U+27A1 already has been?on the basis that U+2B95 is as equally, if not more, suited than U+27A1 to serve as a general rightwards arrow symbol. ? It also proposes that clarification be added to their entries in the code charts about the differences between their intended functions, and answering when to use one versus the other, as per their contrasting histories. I don?t intend to be making anything like a formal proposal yet, but I might in the future. For now, I?d like to clarify the characters? respective intended purposes and see how feasible or likely the proposed changes would be before investing time, etc. in a formal proposal. ## History The history below is taken from the following posts: ? Ken Whistler: ? 2015-05 . ? 2015-10 . ? Mark Davis: ? 2015-10 . ? Michel Suignard: ? 2015-05 . (Note that this post contains paragraphs quoted from another person that is not marked differently, with Suignard?s replies below each one.) ? 1993: The glyphs from ITC Zapf Dingbats typeface were encoded in the Unicode Standard 1.1 for compatibility with PostScript printers that use them. This included U+27A1 Black Rightwards Arrow. The Zapf Dingbat arrows all face rightwards, as generically rotatable arrow glyphs. No leftwards, upwards, or downwards versions of arrows were encoded because PostScript printers were assumed to rotate generic rightwards arrows in original Zapf Dingbats fonts. U+27A1?s representative glyph is taken from Zapf Dingbats. ? 2003: Representatives of North Korea (the DPRK) submitted a proposal to add compatibility characters for a DPRK encoding standard . These included black-filled arrows in the four cardinal directions. The proposal only included leftwards, upwards, and downwards black arrows, apparently because the representatives believed that U+27A1 fit their purposes for compatibility with their rightwards black arrow. The former three were encoded as U+2B05?U+2B07 in the Unicode Standard 4.0. Their representative glyphs and names were taken from the DPRK proposal; the glyphs and names thus did not align with U+27A1 (e.g., U+2B05 Leftwards Black Arrow vs. U+27A1 Black Rightwards Arrow). Whistler states that ??nobody commented on? them and ?nobody much cared, because because these were compatibility additions for a DPRK standard, and weren't mapped to any commercial sets at the time, anyway? (2015-05). The unification of new DPRK compatibility arrows U+2B05?U+2B07 with rotations of Zapf Dingbat arrow U+27A1 was implied by the Standard but not explicit. For the next decade, most fonts implementing all four characters used glyphs matching the code charts? (i.e., using the mismatching Zapf Dingbat glyph for the right arrow, and the DPRK glyphs for the other black arrows). ? 2011?2013: Google, Apple, and Microsoft begin to support emoji characters from Japanese cellular carriers using characters from the Unicode Standard 6.0. Four of those Japanese-carrier characters are black arrows in the four cardinal directions (UTR #51). The three companies use the DPRK-compatibility black arrows U+2B05?U+2B07 for three of them. Presumably because it was assumed to be part of their set and there was no better alternative, the Zapf Dingbat U+27A1 for the final, rightward black arrow from the Japanese-carrier emoji. Based on then-current usage, these four characters? mappings, among others, are added to a new, separate Unicode data file for emoji data . The data to this data have ?not yet been formally rationalized into a coherent set of Unicode character properties? (Whistler 2015-10), in ? 2014: A ?complete re-rationalization of all the arrows symbols? occurred (Whistler 2015-05) in the Unicode Standard 7.0 due to addition of arrows from Wingdings, Wingdings 2, and Webdings . The DPRK-compatibility black arrows U+2B05?U+2B07 are unified with similar Wingding black arrows, and their representative glyphs are modified thus to harmonize. However, the glyph of Zapf Dingbat arrow U+27A1 is deemed to be unmodifiable, because its identity is strongly coupled to the original arrow glyph in the ITC Zapf Dingbat typefaces. The now-generic black arrows U+2B05?U+2B07 are thus disunified from rotations of U+27A1. A new character, U+2B95 Rightwards Black Arrow, is added with the intention of completing the U+2B05?U+2B07 set; it receives a correspondingly matching representative glyph. ## Present issues The new U+B295 Rightwards Black Arrow together with the now-generic U+2B05?U+2B07 are supposed to form a single set of arrows, with correspondingly matching representative glyphs, as Mr. Suignard has said. It will take time for U+B295 to be implemented by new fonts, but ?the explicit glyph updates for U+2B00..U+2B0D?were clearly intentional? (Whistler 2015-05). In other words, according to the Standard since version 7.0, the matching character that is the rightward version of U+2B05?U+2B07 is now clearly U+B295?not U+27A1, which has been disunified from the set and is now merely a Zapf Dingbat. However, this is still not yet completely true: UTR #51 and emoji-data.txt currently define the rightwards version of U+2B05?U+2B07 to be the Zapf Dingbat U+27A1. UTR #51 currently does not define U+B295 to be an emoji character. Furthermore, there are no text/emoji standardized variants of U+B295 yet, unlike U+27A1. Upon reviewing the history above, it becomes apparent that this is due to missed timing between the advent of Unicode emoji (in 2011?2013) and the advent of U+B295 (in 2014). Apple, Google, and Microsoft had no character other than U+27A1 that they could use for the Japanese carriers? rightward black arrow; at that time U+27A1 was still implicitly unified with the other black arrows. It seems to be possible to change the emoji data to more logically match the intended usage of the new U+B295. My questions are thus: 1. Should U+B295 be added to the set of emoji characters as given by UTR #51 and emoji-data.txt, with the intent to complete the harmonization with U+2B05?U+2B07 in 2014? 2. If #1?s answer is yes, then should U+B295 be given text/emoji standardized variation sequences, just as U+2B05?U+2B07 already do? 3. Regardless of the answers to the above, should clarification on the conceptual differences between U+B295 (the right black arrow completing U+2B05?U+2B07) and U+27A1 (the Zapf Dingbat) be added to their entries in the Standard?s code charts? This might clear up a lot of confusion from users and font creators, and would only make clearer what has already been made explicit by 7.0?s glyph changes. ## Possible objections There are two objections to #1 and #2 that I could foresee: First is that, when using emoji, a user might perceive redundancy between an emoji form of U+B295 and the already existent emoji form of U+27A1, and this might cause user confusion over which one to use. However, this redundancy has already existed since Unicode 7.0, when U+B295 was added in the first place. The Consortium apparently decided at the time that the risk of user confusion between U+B295 and U+27A1 was worth it in regular-text contexts; I don?t see why it would be significantly different in emoji contexts. Vendors? emoji input palettes could merely present only U+B295, rather than U+27A1, to the user, with little disadvantage. Second is that compatibility mappings with Japanese carrier sets already use U+27A1, and mappings should generally be stable across versions of Unicode. However, the Unicode emoji data are not yet formally set in stone; there has only been preliminary discussion and the initial publication of UTR #51 (Whistler 2015-10; Davis 2015-10). The mappings with the carrier sets are probably thus not under the same stability guarantees that other formal mappings are under (and, even if they are, I could find no policy in that prohibits modifying formal mappings in general). In any case, I might make a formal proposal in the future, but I first want to determine here how probable that such a proposal would be discussed. What would you say the answers to those three questions are? Sincerely, J. S. Choi From petercon at microsoft.com Fri Oct 30 17:03:31 2015 From: petercon at microsoft.com (Peter Constable) Date: Fri, 30 Oct 2015 22:03:31 +0000 Subject: Latin glottal stop in ID in NWT, Canada In-Reply-To: <20151030190903.4a97dc45@JRWUBU2> References: <1786035177.2903.1446107355879.JavaMail.www@wwinf1n04> <20151030190903.4a97dc45@JRWUBU2> Message-ID: This is more plausible. The Tlingit peoples live in coastal regions, SW parts of Yukon Territory and Alaska. That's not what I would have referred to as "Northwest Territories". And it's totally not related to the thread, which was clearly about Northwest Territories, not Yukon Territory. Can you point to information on Tlingit materials in Cyrillic script? Peter -----Original Message----- From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Richard Wordingham Sent: Friday, October 30, 2015 12:09 PM To: Unicode Discussion Subject: Re: Latin glottal stop in ID in NWT, Canada On Fri, 30 Oct 2015 06:07:36 +0000 Peter Constable wrote: > From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of > Philippe Verdy Sent: Thursday, October 29, 2015 6:26 AM > > > On the opposite, Native Americans HAVE used the Cyrillic script in > > Alaska and probably as well in North-Western territories in Canada? > > In Alaska, yes, because the languages in question are, in fact, > Siberian languages. I wouldn't describe Tlingit as a Siberian language. There are some old Cyrillic script Christian materials in Tlingit. The Canadian connections are in British Columbia and Yukon. Richard. From richard.wordingham at ntlworld.com Fri Oct 30 18:34:10 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Fri, 30 Oct 2015 23:34:10 +0000 Subject: Latin glottal stop in ID in NWT, Canada In-Reply-To: References: <1786035177.2903.1446107355879.JavaMail.www@wwinf1n04> <20151030190903.4a97dc45@JRWUBU2> Message-ID: <20151030233410.01374bfb@JRWUBU2> On Fri, 30 Oct 2015 22:03:31 +0000 Peter Constable wrote: > This is more plausible. The Tlingit peoples live in coastal regions, > SW parts of Yukon Territory and Alaska. That's not what I would have > referred to as "Northwest Territories". And it's totally not related > to the thread, which was clearly about Northwest Territories, not > Yukon Territory. I think Cyrillic got into the thread by mistake. > Can you point to information on Tlingit materials in Cyrillic script? Google ('Tlingit Cyrillic') does a better job than me! There's an example linked to from the Wikipedia article https://en.wikipedia.org/wiki/Tlingit_alphabet 'Indication of the Pathway into the Kingdom of Heaven'. I presume the original spelling has been preserved. There's an interesting account in 'Russian Orthodox Church Of Alaska And The Aleutian Islands And Its Relation To Native American Traditions: An Attempt At A Multicultural Society, 1794-1912' by Viacheslav Vsevolodovich Ivanov. It's interesting that much of the action happened under American rule - allegedly Orthodox Christianity did well because it wasn't American! Richard. From richard.wordingham at ntlworld.com Fri Oct 30 18:40:21 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Fri, 30 Oct 2015 23:40:21 +0000 Subject: Terminology In-Reply-To: <165409839.4124.1446110244182.JavaMail.www@wwinf1n04> References: <1022406759.1183.1445583561682.JavaMail.www@wwinf1h14> <20151023085315.443d56b6@JRWUBU2> <786526406.6308.1445600066983.JavaMail.www@wwinf1h14> <165409839.4124.1446110244182.JavaMail.www@wwinf1n04> Message-ID: <20151030234021.0ad92851@JRWUBU2> On Thu, 29 Oct 2015 10:17:23 +0100 (CET) Marcel Schneider wrote: > But ?script? for ?writing system? looks > good to me, and certainly to all people familiar with Unicode thanks > to some training. The concept of an English *script* as opposed to a French *script* is a bad idea. I have recently had occasion to contrast writing systems in the Thai script (for Thai and for Pali respectively), and also to contrast writing systems in the Hebrew script (for Yiddish, and for two systems for Hebrew). > Thanks to the Unicode Consortium! TUS 8.0 Section 6.1 Paragraph 2 makes it clear that the concepts are quite distinct. Richard. From verdy_p at wanadoo.fr Fri Oct 30 19:19:14 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Sat, 31 Oct 2015 01:19:14 +0100 Subject: On emoji and the two rightwards black arrows In-Reply-To: References: Message-ID: IMHO, all mappings from other encodings are just best efforts but not normative. In many cases, those mappings are ambiguous, including for some legacy encodingfs that have been widely used since many decades and still used today (such as CP437): The reason for that is that the old registrations for legacy 8-bit charsets only showed charts of glyphs with approximative glyphs (often with poor quality, with low resolution rendering on printed papers, and various polluting dots of inks, later scanned with poor resolution), but no actual properties (and often without even listing any name for them). And for long those charts have been interpreted differently by different vendors (such as printer or screen manufacturers, in a time where dot-matrix printers or displays had poor resolution), and sometimes with glyphs changing slightly between devices models or versions from the same vendor. So characters in those mapping tables were widely used to mean different variants of characters that are now distinguished in the UCS (e.g. in CP437, the symbol that looks either like an big epsilon or as a "is member of" math symbol ; the mappings to the UCS for other symbols that look like Greek letters in CP437 charsets and similar are not really in stone, it is not even clear if they will map to UCS symbols or to UCS Greek letters ; the same applies to various geometric symbols, including arrows, and bullets). Those mappings are just there to help converting some old documents to the UCS, but the choice is sometimes questionable and some corrections may need to be done to select another character, depending on the context of use. Unfortunately, the existing mappings only document mappings of legacy code positions to a single suggested codepoint, and not their other possible alternatives. Then we fall into the categories of characters that are easily confusable: may be these mappings tables do not need to be changed, but used together with the datafiles related to confusable characters (the list was initiated during the development of IDNA). There are other data available (visible in Unicode charts) that also indicate a few related/similar characters, but these are mostly notes are not engraved in stone, and this data is difficult ot use. So in summary, those mapping tables are just suggestions and implementers may still map legacy encodings to different subsets of the UCS. But we should be concerned by the conversion to the other direction, from the UCS to legacy mappings : all candidate UCS code points should be reversed mapped to the same legacy code position (as much as possible). Those mapping tables are then not part of the stable standard and there's no stability policy about them (IMHO, such policy should not be adopted). They are just contributions in order to help the transition to the UCS, and they are also subject to updates when needed if there are better mappings developed later, and some applications or vendors will still develop their own preferences. If you consider the two UCS characters in question, my opinion is that they are basically the same and mappings from Zapf Dingbats or DPRK or Windings/Webdings are just kept for historical reasons, but not necessarily the best ones. And I would see no violation of the standard if a font was made that mapped both UCS characters to exactly the same glyph, using metrics that create a coherent set of black arrows using either the DPRK metrics for all 4 arrows, or the Zapf Dingbats metrics for all 4 arrows. Their disunification is not really justified, except to work with applications or documents that used fonts not mapping all of them but made to work only with DPRK-encoded documents, or with Dingbats-encoded documents: the disunification is based only on those specific old (defective) fonts, and modern fonts should not be defective and should map all of these characters as if they were aliased, without any need to distinguish them visually. But because they are not canonically equivalent, these characters should be explicitly listed in the list of confusables (which version will be preferred, and which versions will be aliased to the prefered form, for applications like IDNA, is a question to develop as this is a possible security concern if some of these characters are allowed in identifiers intended to be secured). 2015-10-30 19:51 GMT+01:00 J.S. Choi : > # On emoji and the two rightwards black arrows > > This is a long post, and I apologize for that; it?s a somewhat complicated > topic. The post is about two encoded characters: > U+27A1 Black Rightwards Arrow > > and U+2B95 Rightwards Black Arrow < > http://www.unicode.org/charts/PDF/U2B00.pdf>. > > (...) In any case, I might make a formal proposal in the future, but I first want > to determine here how probable that such a proposal would be discussed. > What would you say the answers to those three questions are? > -------------- next part -------------- An HTML attachment was scrubbed... URL: From petercon at microsoft.com Sat Oct 31 01:40:45 2015 From: petercon at microsoft.com (Peter Constable) Date: Sat, 31 Oct 2015 06:40:45 +0000 Subject: Latin glottal stop in ID in NWT, Canada In-Reply-To: <20151030233410.01374bfb@JRWUBU2> References: <1786035177.2903.1446107355879.JavaMail.www@wwinf1n04> <20151030190903.4a97dc45@JRWUBU2> , <20151030233410.01374bfb@JRWUBU2> Message-ID: The Aleutian islands are a long way from NWT. I don't associate Tlingit with the Aleutians, and wasn't aware of an early Cyrillic orthography. But it's also not a language of NWT. It's spoken in areas near the coast. My sister lives in Carcross, which is a Tlingit village. This is hundreds of miles from NWT. Peter Sent from my IBM 3277/APL ________________________________ From: Richard Wordingham Sent: ?10/?30/?2015 16:37 To: Unicode Discussion Subject: Re: Latin glottal stop in ID in NWT, Canada On Fri, 30 Oct 2015 22:03:31 +0000 Peter Constable wrote: > This is more plausible. The Tlingit peoples live in coastal regions, > SW parts of Yukon Territory and Alaska. That's not what I would have > referred to as "Northwest Territories". And it's totally not related > to the thread, which was clearly about Northwest Territories, not > Yukon Territory. I think Cyrillic got into the thread by mistake. > Can you point to information on Tlingit materials in Cyrillic script? Google ('Tlingit Cyrillic') does a better job than me! There's an example linked to from the Wikipedia article https://en.wikipedia.org/wiki/Tlingit_alphabet 'Indication of the Pathway into the Kingdom of Heaven'. I presume the original spelling has been preserved. There's an interesting account in 'Russian Orthodox Church Of Alaska And The Aleutian Islands And Its Relation To Native American Traditions: An Attempt At A Multicultural Society, 1794-1912' by Viacheslav Vsevolodovich Ivanov. It's interesting that much of the action happened under American rule - allegedly Orthodox Christianity did well because it wasn't American! Richard. -------------- next part -------------- An HTML attachment was scrubbed... URL: From charupdate at orange.fr Sat Oct 31 11:04:26 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Sat, 31 Oct 2015 17:04:26 +0100 (CET) Subject: Terminology (is still: Latin glottal stop in ID in NWT, Canada; and referring to: Emoji Proposal: Face With One Eyebrow Raised) Message-ID: <790085166.11530.1446307466304.JavaMail.www@wwinf1p13> On Fri, 30 Oct 2015 23:40:21 +0000, Richard Wordingham" wrote: > On Thu, 29 Oct 2015 10:17:23 +0100 (CET) > Marcel Schneider wrote: > > > But ?script? for ?writing system? looks > > good to me, and certainly to all people familiar with Unicode thanks > > to some training. > > The concept of an English *script* as opposed to a French *script* is a > bad idea. Today, when talking about the Latin writing system as referred to in TUS (good idea to point to the relevant section!), it makes no more sense to think in terms of national subsets. Globalization brings the necessity of using universal Latin keyboard layouts, one (or several, e.g. ergonomical) per country, each one optimized for *all* the national languages but mutually identical in extension. Covering *all* Latin characters regardless of their frequency (and of their eventual deprecation), is a working principle. IMHO implementers should refrain from leading users by the arm. That glottal stops are present on the German T3 keyboard, was already mentioned in some way. We never can insist enough upon the fact that Germany, where no glottal stop is used, has a standardized keyboard enabling users to write *all* official languages that are written in Latin script. I see no reason (well, I'm back wrtiting about) that officials refuse to update their practice. They don't, in fact. It's just a database and a printer somewhere in Northwest Territories. Without these two, everybody would be pleased to install an up-to-date keyboard layout and input a fine birth certificate for everybody regardless of what letters of the official Latin script would be in. > I have recently had occasion to contrast writing systems in > the Thai script (for Thai and for Pali respectively), and also to > contrast writing systems in the Hebrew script (for Yiddish, and for two > systems for Hebrew). > > > Thanks to the Unicode Consortium! > > TUS 8.0 Section 6.1 Paragraph 2 makes it clear that the concepts are > quite distinct. The concept of "the writing system of the Latin script" for example, as opposed to "the way a particular language is written", for example Chipewyan? When I was talking about me, I meant 'script' as in 'Latin script.' Chipewyan is written in Latin script, Bamanankan is written in Latin script, English is written in Latin script: Either all these three statements are obvious, or none of them is; in any case, it is no longer allowed to raise one eyebrow (emoji needed) at the sight of a glottal stop to put into a birth certificate, into a passport, or on a credit card. Go buy new database hosts and new printers! In case no other argument is convincing, here is another one: That will boost national IT companies, vendors, and technicians. All good! Hopeful, Marcel -------------- next part -------------- An HTML attachment was scrubbed... URL: