From me at ophir.li Fri Apr 1 09:39:31 2022 From: me at ophir.li (Ophir Lifshitz) Date: Fri, 1 Apr 2022 10:39:31 -0400 Subject: Line-breaking algorithm: Unexpected break in multiple consecutive numeric prefixes In-Reply-To: References: Message-ID: Hello again, I hope it's not an issue to re-ask this question I had from a while back. Thanks! On Sun, Sep 19, 2021 at 5:13 AM Ophir Lifshitz wrote: > I have a question about the line-breaking algorithm. Apologies if it > is uninformed or if this is the wrong venue. > > I recently experienced an unexpected line break[1] after the first > character in the following sequence[2]: > > ?? 2212 MINUS SIGN (line-breaking class PR) > ?$ 0024 DOLLAR SIGN (line-breaking class PR) > ?4 0034 DIGIT FOUR (line-breaking class NU) > ?5 0035 DIGIT FIVE (line-breaking class NU) > > (However, if the first character is replaced by 002B PLUS SIGN (also > class PR), a line break does not occur.) > > I also noticed that there is no "PR ? PR" rule in (e.g.) LB25. > > Is this intended, perhaps an oversight, or is it up to implementation > discretion i.e. "tailored"? > > If it is an oversight, what is the process for correcting it or filing > a bug? It is hard to find that information on the Unicode website. > > Thank you. > > > [1] The line break appeared in Chrome 93 and Safari 13.1 on Mac 10.13, > but not in Firefox 85. > I tested by navigating in my browser to the following data URIs: > > data:text/html;charset=utf-8,%E2%88%92$45

> data:text/html;charset=utf-8,%2B$45

> > [2] This sequence is intended to behave as a single unit (word), and > refers to a price discount in the original text. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mark at kli.org Fri Apr 1 11:50:47 2022 From: mark at kli.org (Mark E. Shoulson) Date: Fri, 1 Apr 2022 12:50:47 -0400 Subject: Setting Numeric_Value for well-known constants Message-ID: <8ebad448-5c00-64e4-e454-226cdbef11de@shoulson.com> So I was thinking, we have ? U+210E PLANCK CONSTANT, and SI has defined an exact value for it (6.62607015?-34 J/s (I think ? is very appropriate and much cooler than "e" for scientific notation)).? So it has a numeric value, and we have a Unicode property especially for that, so it seems obvious that we should set the Numeric_Value property of U+210E to 6.62607015?-34. If we're willing to accept limited decimal precision, we can also set the value for ? U+210F PLANCK CONSTANT OVER TWO PI.? Which of course means we could also do ? U+03C0 GREEK SMALL LETTER PI. Actually, setting the value of ? U+00B5 MICRO SIGN to 0.000001 is even more obvious, and of course is precise.? The Numeric_Value of ? U+03C6 GREEK SMALL LETTER PHI could be set exactly, as "(sqrt(5)+1)/2", if we allow the use of mathematical expressions in the field?which we clearly do, with expressions such a "1/3" in the Numeric_Value field of ? U+2153 VULGAR FRACTION ONE THIRD, etc. And of course there's ? U+2107 EULER CONSTANT, which is also specified as a constant. So, wouldn't it be a good idea to set Numeric_Value fields for all these well-known numbers? (Answer:? No.? No, it would not be a good idea.? See the date.) ~mark From markus.icu at gmail.com Fri Apr 1 12:21:18 2022 From: markus.icu at gmail.com (Markus Scherer) Date: Fri, 1 Apr 2022 10:21:18 -0700 Subject: Setting Numeric_Value for well-known constants In-Reply-To: <8ebad448-5c00-64e4-e454-226cdbef11de@shoulson.com> References: <8ebad448-5c00-64e4-e454-226cdbef11de@shoulson.com> Message-ID: On Fri, Apr 1, 2022 at 9:54 AM Mark E. Shoulson via Unicode < unicode at corp.unicode.org> wrote: > So, wouldn't it be a good idea to set Numeric_Value fields for all these > well-known numbers? > > (Answer: No. No, it would not be a good idea. See the date.) > Thank you! You had me revved up there for a moment :-) markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From andy.heninger at gmail.com Fri Apr 1 14:37:57 2022 From: andy.heninger at gmail.com (Andy Heninger) Date: Fri, 1 Apr 2022 12:37:57 -0700 Subject: =?UTF-8?Q?Re=3A_Clarification_on_Annex_29=2C_GB12=E2=80=9313?= In-Reply-To: References: Message-ID: This is a misunderstanding of the way the break rules are meant to be applied. The rules test for the presence (?) or absence (?) of a boundary at a single location in the subject text. When there is an extended context, as in GB12 or GB13, the rules do not imply anything about boundaries, or the lack thereof, within that context. Although it is the case for other rules with context, like WB6 and 7, or the various sentence break rules, that there aren't boundaries within the context. This can all get pretty confusing. -- Andy On Thu, Mar 31, 2022 at 7:28 AM Don Hosek via Unicode < unicode at corp.unicode.org> wrote: > Annex 29 says: > > Do not break within emoji flag sequences. That is, do not break between > regional indicator (RI) symbols if there is an odd number of RI characters > before the break point. > > GB12 sot (RI RI)* RI ? RI > > GB13 [^RI] (RI RI)* RI ? RI > > This would seem to indicate that any even number of RI tags should be > treated as a single grapheme so given, e.g., ?????? this should be a > single grapheme rather than the expected three. There is no test in > https://www.unicode.org/Public/14.0.0/ucd/auxiliary/GraphemeBreakTest.txt > that would enforce this however. Or is this just a case of my misreading > the spec and there is an implicit ? after each pair of RI characters? (if > the latter, it might be helpful for future implementors to have a note to > that effect). > > -dh > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mark at kli.org Fri Apr 1 16:43:04 2022 From: mark at kli.org (Mark E. Shoulson) Date: Fri, 1 Apr 2022 17:43:04 -0400 Subject: Setting Numeric_Value for well-known constants In-Reply-To: References: <8ebad448-5c00-64e4-e454-226cdbef11de@shoulson.com> Message-ID: <6dbd5d3a-4c5a-a81f-5f64-6e238f6408e5@shoulson.com> On 4/1/22 13:21, Markus Scherer via Unicode wrote: > On Fri, Apr 1, 2022 at 9:54 AM Mark E. Shoulson via Unicode > wrote: > > So, wouldn't it be a good idea to set Numeric_Value fields for all > these > well-known numbers? > > (Answer:? No.? No, it would not be a good idea.? See the date.) > > > Thank you! You had me revved up there for a moment :-) > markus Heh!? Glad it was appreciated!? Of course, not sure people telling me "you had me going there" says about me, that people could think I was serious about that.? Either that or it isn't quite *that* ridiculous of an idea.? (Actually, the latter is probably true.? There's a lot of space in ridiculousness; an idea can easily be "not *that* ridiculous" and still be very bad.) ~mark -------------- next part -------------- An HTML attachment was scrubbed... URL: From textexin at xencraft.com Mon Apr 4 18:23:08 2022 From: textexin at xencraft.com (Tex) Date: Mon, 4 Apr 2022 16:23:08 -0700 Subject: global password strategies Message-ID: <004c01d8487a$f244ba30$d6ce2e90$@xencraft.com> What is the modern recommendation for globalization of passwords? 1) If your application (web, mobile, desktop, etc.) is used worldwide, which characters do you allow or restrict? 2) How do you deal with writing direction? My concerns are that confirming and displaying a password might look different depending on how well the browser or OS implements RTL writing direction or features like dir=auto. A user may then not be able to log in because they are instructed to type it in a way that is inconsistent with what they have seen on the screen. 3) Do you allow control or other invisible characters that a user may be used to typing in certain phrases? If these are allowed, how to indicate to the user that they have been used? 4) Also, should passwords be Unicode normalized? Seems damned if you do and if you don?t. Do text input methods generate test the same way or is it possible for a user to create a password on one system and then not be able to log in on another device? (Not normalization related, but I have experienced difficulty logging in to foreign systems, in hotels etc., when the keyboard is different and it takes a while to realize I have to abandon muscle memory and remember the actual password and look for the keys on the keyboard.) -------------- next part -------------- An HTML attachment was scrubbed... URL: From matthias.reitinger at gmx.de Wed Apr 6 20:33:08 2022 From: matthias.reitinger at gmx.de (Matthias Reitinger) Date: Thu, 7 Apr 2022 03:33:08 +0200 Subject: Unqualified vs. minimally-qualified emoji Message-ID: <08b3fded-7ae3-81a8-c223-2a878d53d929@gmx.de> Dear all, UTS #51 [1] defines the following terms: ED-17a. qualified emoji character ? An emoji character in a string that (a) has default emoji presentation or (b) is the first character in an emoji modifier sequence or (c) is not a default emoji presentation character, but is the first character in an emoji presentation sequence. ED-18. fully-qualified emoji ? A qualified emoji character, or an emoji sequence in which each emoji character is qualified. ED-18a. minimally-qualified emoji ? An emoji sequence in which the first character is qualified but the sequence is not fully qualified. ED-19. unqualified emoji ? An emoji that is neither fully-qualified nor minimally qualified. With this definitions I would expect the code point sequence 1F441 FE0F 200D 1F5E8 (EYE, VARIATION SELECTOR-16, ZERO WIDTH JOINER, LEFT SPEECH BUBBLE) to be a minimally-qualified emoji: * It is an emoji sequence (ED-17), specifically an emoji zwj sequence (ED-16). * The first character is qualified (ED-17a (c)), because it is the first character in an emoji presentation sequence (ED-9a). * The sequence is not fully qualified (ED-18), because the second emoji character U+1F5E8 is not qualified (it is not a default emoji presentation character, and is not part of an emoji presentation sequence). However, emoji-test.txt [2] lists this sequence as "unqualified". Can someone please explain why? Did I misinterpret the definitions, or is this an error in the emoji-test.txt file? --Matthias [1] https://www.unicode.org/reports/tr51/tr51-21.html [2] https://www.unicode.org/Public/emoji/14.0/emoji-test.txt From duerst at it.aoyama.ac.jp Thu Apr 7 19:37:53 2022 From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J=2e_D=c3=bcrst?=) Date: Fri, 8 Apr 2022 09:37:53 +0900 Subject: global password strategies In-Reply-To: <004c01d8487a$f244ba30$d6ce2e90$@xencraft.com> References: <004c01d8487a$f244ba30$d6ce2e90$@xencraft.com> Message-ID: <88843b04-3809-dc47-0f1c-caebdbab0b28@it.aoyama.ac.jp> Hello Tex, I'm surprised I haven't seen any answers to your post yet, I think it's a very interesting and important topic. On 2022-04-05 08:23, Tex via Unicode wrote: > What is the modern recommendation for globalization of passwords? > > > > 1) If your application (web, mobile, desktop, etc.) is used worldwide, which characters do you allow or restrict? I don't have an example of an own application where I made such decisions (in most cases, such decisions are made at a framework/library level). But in Japan at least, nobody expects to use anything other than ASCII in passwords. There are two interrelated reasons for this: 1) Kanji, Hiragana, and Katakana would require conversion, which would mean users have to visually check whether they got the right character. That's not a good idea for passwords. 2) Conversion choices get stored on the user's system to make future choices easier, but that would establish a side channel. An attacker may get access to that data, and when comparing before/after, can narrow down the choices for passwords considerably. I'd expect this to at least apply for Chinese, too. I'd also guess that many password-related libraries restrict input to ASCII. But with the deep penetration of smartphones around the world, the need for non-ASCII passwords is definitely increasing. As we are working on giving people fully non-ASCII email addresses, we shouldn't ignore passwords. > 2) How do you deal with writing direction? > > My concerns are that confirming and displaying a password might look different depending on how well the browser or OS implements RTL writing direction or features like dir=auto. A user may then not be able to log in because they are instructed to type it in a way that is inconsistent with what they have seen on the screen. This is definitely a problem, but maybe not such a serious one. On such a system, the user may be used to such inconsistencies. The user knows what characters they intended to typed, in what order. When they do a visual check, they don't need to verify the order, they only need to verify character identity. On smartphone, there are also many password input methods that only show the last character. > > 3) Do you allow control or other invisible characters that a user may be used to typing in certain phrases? If these are allowed, how to indicate to the user that they have been used? I'd just say the less allowed, the better. > 4) Also, should passwords be Unicode normalized? Seems damned if you do and if you don?t. Do text input methods generate test the same way or is it possible for a user to create a password on one system and then not be able to log in on another device? The Mac used to do decomposition (NFD), and Windows uses composition (NFC), at least for file systems. I'm not sure this is still the case. And there are other issues. In Arabic/Persian for example, there are different forms of the letter YEH, with different encodings, for things that may look the same on screen. An Arabic keyboard and a Farsi keyboard may produce different character codes. > (Not normalization related, but I have experienced difficulty logging in to foreign systems, in hotels etc., when the keyboard is different and it takes a while to realize I have to abandon muscle memory and remember the actual password and look for the keys on the keyboard.) The most important point is not "damned if you do and damned if you don't", but "whatever you do, make sure you always do exactly the same thing". This starts way before you get into normalization. For example, do you remove leading/trailing white space? (The user may have copied the password from some text file. (That's not very good security, but some people still do it.)) Another example: Do you always have the same length restriction? I remember a case where I had set a password for a site, and on a sister site, it only worked after I tried to shorten it. What had happened was that when I set it, it got accepted but truncated without telling me, which worked well on the same site because the same truncation happened again. But the sister site didn't truncate, and this produced a mismatch. Make sure you tell people about such issues when they are setting a password, don't just 'fix' things behind the scenes. Also remember that password encryption algorithms work on binary data, not on characters. For ASCII-only, that doesn't usually cause problems, but when working with Unicode, you want to make sure you have a single encoding before the encryption. Please also note that "whatever you do, make sure you always do exactly the same thing" and using libraries or frameworks may not work well together, because different libraries/frameworks may do different things. Regards, Martin. From asmusf at ix.netcom.com Thu Apr 7 22:18:52 2022 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Thu, 7 Apr 2022 20:18:52 -0700 Subject: global password strategies In-Reply-To: <88843b04-3809-dc47-0f1c-caebdbab0b28@it.aoyama.ac.jp> References: <004c01d8487a$f244ba30$d6ce2e90$@xencraft.com> <88843b04-3809-dc47-0f1c-caebdbab0b28@it.aoyama.ac.jp> Message-ID: <596363ec-1d33-6246-fdc4-1cea4e34264b@ix.netcom.com> An HTML attachment was scrubbed... URL: From textexin at xencraft.com Thu Apr 7 22:30:18 2022 From: textexin at xencraft.com (Tex) Date: Thu, 7 Apr 2022 20:30:18 -0700 Subject: global password strategies In-Reply-To: <88843b04-3809-dc47-0f1c-caebdbab0b28@it.aoyama.ac.jp> References: <004c01d8487a$f244ba30$d6ce2e90$@xencraft.com> <88843b04-3809-dc47-0f1c-caebdbab0b28@it.aoyama.ac.jp> Message-ID: <00a901d84af8$f8cb2f80$ea618e80$@xencraft.com> Thanks for this Martin. Yes, the list is surprisingly quiet. The industry is increasingly expanding to users that have no idea about Latin characters or digits. Also, there are some apps that are serving users that are not computer literate. They may be handed a tablet to enter information or similar scenarios. So they can't be expected to know to not enter Kanji for a password. Restricting to ASCII can make it hard to remember for some users. I liken the experience to network passwords that are lengthy hex strings which are totally unmemorable. The issue of control or other invisible characters is more problematic. If a user requires them for correct spelling restricting them makes the passwords awkward. And the harder problem is to define the correct list of characters to restrict. Yes normalization is problematic too. Yes, writing direction is a lesser problem. My inclination is to do nothing on the software side, no restrictions, no normalization, when displaying passwords setting direction to LTR (for stability and consistency). I think trimming leading and trailing white space is reasonable. Having a length requirement is enforcing good practices that protect both the user and the software provider. And as you say, telling the user (and the developers) whatever you do, make sure you always do exactly the same thing is probably the best we can do. For developers this means make sure you consistently don?t apply any functions between accepting the text and encrypting it. The downside is that some users will have problems when they use a different system or go through upgrades of their input methods. Almost makes me a believer in fingerprint ID, retinal scans, embedded body chips, etc. tex -----Original Message----- From: Martin J. D?rst [mailto:duerst at it.aoyama.ac.jp] Sent: Thursday, April 7, 2022 5:38 PM To: Tex; unicode at corp.unicode.org Subject: Re: global password strategies Hello Tex, I'm surprised I haven't seen any answers to your post yet, I think it's a very interesting and important topic. On 2022-04-05 08:23, Tex via Unicode wrote: > What is the modern recommendation for globalization of passwords? > > > > 1) If your application (web, mobile, desktop, etc.) is used worldwide, which characters do you allow or restrict? I don't have an example of an own application where I made such decisions (in most cases, such decisions are made at a framework/library level). But in Japan at least, nobody expects to use anything other than ASCII in passwords. There are two interrelated reasons for this: 1) Kanji, Hiragana, and Katakana would require conversion, which would mean users have to visually check whether they got the right character. That's not a good idea for passwords. 2) Conversion choices get stored on the user's system to make future choices easier, but that would establish a side channel. An attacker may get access to that data, and when comparing before/after, can narrow down the choices for passwords considerably. I'd expect this to at least apply for Chinese, too. I'd also guess that many password-related libraries restrict input to ASCII. But with the deep penetration of smartphones around the world, the need for non-ASCII passwords is definitely increasing. As we are working on giving people fully non-ASCII email addresses, we shouldn't ignore passwords. > 2) How do you deal with writing direction? > > My concerns are that confirming and displaying a password might look different depending on how well the browser or OS implements RTL writing direction or features like dir=auto. A user may then not be able to log in because they are instructed to type it in a way that is inconsistent with what they have seen on the screen. This is definitely a problem, but maybe not such a serious one. On such a system, the user may be used to such inconsistencies. The user knows what characters they intended to typed, in what order. When they do a visual check, they don't need to verify the order, they only need to verify character identity. On smartphone, there are also many password input methods that only show the last character. > > 3) Do you allow control or other invisible characters that a user may be used to typing in certain phrases? If these are allowed, how to indicate to the user that they have been used? I'd just say the less allowed, the better. > 4) Also, should passwords be Unicode normalized? Seems damned if you do and if you don?t. Do text input methods generate test the same way or is it possible for a user to create a password on one system and then not be able to log in on another device? The Mac used to do decomposition (NFD), and Windows uses composition (NFC), at least for file systems. I'm not sure this is still the case. And there are other issues. In Arabic/Persian for example, there are different forms of the letter YEH, with different encodings, for things that may look the same on screen. An Arabic keyboard and a Farsi keyboard may produce different character codes. > (Not normalization related, but I have experienced difficulty logging in to foreign systems, in hotels etc., when the keyboard is different and it takes a while to realize I have to abandon muscle memory and remember the actual password and look for the keys on the keyboard.) The most important point is not "damned if you do and damned if you don't", but "whatever you do, make sure you always do exactly the same thing". This starts way before you get into normalization. For example, do you remove leading/trailing white space? (The user may have copied the password from some text file. (That's not very good security, but some people still do it.)) Another example: Do you always have the same length restriction? I remember a case where I had set a password for a site, and on a sister site, it only worked after I tried to shorten it. What had happened was that when I set it, it got accepted but truncated without telling me, which worked well on the same site because the same truncation happened again. But the sister site didn't truncate, and this produced a mismatch. Make sure you tell people about such issues when they are setting a password, don't just 'fix' things behind the scenes. Also remember that password encryption algorithms work on binary data, not on characters. For ASCII-only, that doesn't usually cause problems, but when working with Unicode, you want to make sure you have a single encoding before the encryption. Please also note that "whatever you do, make sure you always do exactly the same thing" and using libraries or frameworks may not work well together, because different libraries/frameworks may do different things. Regards, Martin. From textexin at xencraft.com Thu Apr 7 22:48:50 2022 From: textexin at xencraft.com (Tex) Date: Thu, 7 Apr 2022 20:48:50 -0700 Subject: global password strategies In-Reply-To: <596363ec-1d33-6246-fdc4-1cea4e34264b@ix.netcom.com> References: <004c01d8487a$f244ba30$d6ce2e90$@xencraft.com> <88843b04-3809-dc47-0f1c-caebdbab0b28@it.aoyama.ac.jp> <596363ec-1d33-6246-fdc4-1cea4e34264b@ix.netcom.com> Message-ID: <00ad01d84afb$8fa79180$aef6b480$@xencraft.com> Aren?t keystrokes device dependent, since keyboards vary, physically and virtually? We would have to restrict passwords to the minimal keys that are universal- it is the same problem with a smaller character set. From: Unicode [mailto:unicode-bounces at corp.unicode.org] On Behalf Of Asmus Freytag via Unicode Sent: Thursday, April 7, 2022 8:19 PM To: unicode at corp.unicode.org Subject: Re: global password strategies It sounds to me that a general principle ought to be that passwords should be limited to sequences of "keystrokes", not specific characters. The problem is that what that means is becoming device-dependent. But we don't really want device-dependent password rules? Do we? A./ On 4/7/2022 5:37 PM, Martin J. D?rst via Unicode wrote: Hello Tex, I'm surprised I haven't seen any answers to your post yet, I think it's a very interesting and important topic. On 2022-04-05 08:23, Tex via Unicode wrote: What is the modern recommendation for globalization of passwords? 1) If your application (web, mobile, desktop, etc.) is used worldwide, which characters do you allow or restrict? I don't have an example of an own application where I made such decisions (in most cases, such decisions are made at a framework/library level). But in Japan at least, nobody expects to use anything other than ASCII in passwords. There are two interrelated reasons for this: 1) Kanji, Hiragana, and Katakana would require conversion, which would mean users have to visually check whether they got the right character. That's not a good idea for passwords. 2) Conversion choices get stored on the user's system to make future choices easier, but that would establish a side channel. An attacker may get access to that data, and when comparing before/after, can narrow down the choices for passwords considerably. I'd expect this to at least apply for Chinese, too. I'd also guess that many password-related libraries restrict input to ASCII. But with the deep penetration of smartphones around the world, the need for non-ASCII passwords is definitely increasing. As we are working on giving people fully non-ASCII email addresses, we shouldn't ignore passwords. 2) How do you deal with writing direction? My concerns are that confirming and displaying a password might look different depending on how well the browser or OS implements RTL writing direction or features like dir=auto. A user may then not be able to log in because they are instructed to type it in a way that is inconsistent with what they have seen on the screen. This is definitely a problem, but maybe not such a serious one. On such a system, the user may be used to such inconsistencies. The user knows what characters they intended to typed, in what order. When they do a visual check, they don't need to verify the order, they only need to verify character identity. On smartphone, there are also many password input methods that only show the last character. 3) Do you allow control or other invisible characters that a user may be used to typing in certain phrases? If these are allowed, how to indicate to the user that they have been used? I'd just say the less allowed, the better. 4) Also, should passwords be Unicode normalized? Seems damned if you do and if you don?t. Do text input methods generate test the same way or is it possible for a user to create a password on one system and then not be able to log in on another device? The Mac used to do decomposition (NFD), and Windows uses composition (NFC), at least for file systems. I'm not sure this is still the case. And there are other issues. In Arabic/Persian for example, there are different forms of the letter YEH, with different encodings, for things that may look the same on screen. An Arabic keyboard and a Farsi keyboard may produce different character codes. (Not normalization related, but I have experienced difficulty logging in to foreign systems, in hotels etc., when the keyboard is different and it takes a while to realize I have to abandon muscle memory and remember the actual password and look for the keys on the keyboard.) The most important point is not "damned if you do and damned if you don't", but "whatever you do, make sure you always do exactly the same thing". This starts way before you get into normalization. For example, do you remove leading/trailing white space? (The user may have copied the password from some text file. (That's not very good security, but some people still do it.)) Another example: Do you always have the same length restriction? I remember a case where I had set a password for a site, and on a sister site, it only worked after I tried to shorten it. What had happened was that when I set it, it got accepted but truncated without telling me, which worked well on the same site because the same truncation happened again. But the sister site didn't truncate, and this produced a mismatch. Make sure you tell people about such issues when they are setting a password, don't just 'fix' things behind the scenes. Also remember that password encryption algorithms work on binary data, not on characters. For ASCII-only, that doesn't usually cause problems, but when working with Unicode, you want to make sure you have a single encoding before the encryption. Please also note that "whatever you do, make sure you always do exactly the same thing" and using libraries or frameworks may not work well together, because different libraries/frameworks may do different things. Regards, Martin. -------------- next part -------------- An HTML attachment was scrubbed... URL: From asmusf at ix.netcom.com Thu Apr 7 23:18:34 2022 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Thu, 7 Apr 2022 21:18:34 -0700 Subject: global password strategies In-Reply-To: <00ad01d84afb$8fa79180$aef6b480$@xencraft.com> References: <004c01d8487a$f244ba30$d6ce2e90$@xencraft.com> <88843b04-3809-dc47-0f1c-caebdbab0b28@it.aoyama.ac.jp> <596363ec-1d33-6246-fdc4-1cea4e34264b@ix.netcom.com> <00ad01d84afb$8fa79180$aef6b480$@xencraft.com> Message-ID: <4de421c6-9062-2e26-dca8-4ff2cebd2040@ix.netcom.com> An HTML attachment was scrubbed... URL: From abrahamgross at disroot.org Thu Apr 7 23:31:18 2022 From: abrahamgross at disroot.org (ag disroot) Date: Fri, 8 Apr 2022 04:31:18 +0000 (UTC) Subject: global password strategies In-Reply-To: <4de421c6-9062-2e26-dca8-4ff2cebd2040@ix.netcom.com> References: <004c01d8487a$f244ba30$d6ce2e90$@xencraft.com> <88843b04-3809-dc47-0f1c-caebdbab0b28@it.aoyama.ac.jp> <596363ec-1d33-6246-fdc4-1cea4e34264b@ix.netcom.com> <00ad01d84afb$8fa79180$aef6b480$@xencraft.com> <4de421c6-9062-2e26-dca8-4ff2cebd2040@ix.netcom.com> Message-ID: <6660dd09-cc04-4080-b373-dde769b5d0f0@disroot.org> Since passwords are meant to be typed and not viewed (hence the "?????"), then you can strip /all/ control characters when you process a password. since this control-character-removal function will run on password creation and on login it should be fine -------------- next part -------------- An HTML attachment was scrubbed... URL: From jr at qsm.co.il Fri Apr 8 00:57:25 2022 From: jr at qsm.co.il (Jonathan Rosenne) Date: Fri, 8 Apr 2022 05:57:25 +0000 Subject: global password strategies In-Reply-To: <6660dd09-cc04-4080-b373-dde769b5d0f0@disroot.org> References: <004c01d8487a$f244ba30$d6ce2e90$@xencraft.com> <88843b04-3809-dc47-0f1c-caebdbab0b28@it.aoyama.ac.jp> <596363ec-1d33-6246-fdc4-1cea4e34264b@ix.netcom.com> <00ad01d84afb$8fa79180$aef6b480$@xencraft.com> <4de421c6-9062-2e26-dca8-4ff2cebd2040@ix.netcom.com> <6660dd09-cc04-4080-b373-dde769b5d0f0@disroot.org> Message-ID: The issue has been addressed by NIST, in NIST SP 800-63B DIGITAL IDENTITY GUIDELINES: AUTHENTICATION & LIFECYCLE MANAGEMENT: https://nvlpubs.nist.gov/nistpubs/specialpublications/nist.sp.800-63b.pdf Dated June 2017 5.1.1.2 Memorized Secret Verifiers Verifiers SHALL require subscriber-chosen memorized secrets to be at least 8 characters in length. Verifiers SHOULD permit subscriber-chosen memorized secrets at least 64 characters in length. All printing ASCII [RFC 20] characters as well as the space character SHOULD be acceptable in memorized secrets. Unicode [ISO/ISC 10646] characters SHOULD be accepted as well. To make allowances for likely mistyping, verifiers MAY replace multiple consecutive space characters with a single space character prior to verification, provided that the result is at least 8 characters in length. Truncation of the secret SHALL NOT be performed. For purposes of the above length requirements, each Unicode code point SHALL be counted as a single character. If Unicode characters are accepted in memorized secrets, the verifier SHOULD apply the Normalization Process for Stabilized Strings using either the NFKC or NFKD normalization defined in Section 12.1 of Unicode Standard Annex 15 [UAX 15]. This process is applied before hashing the byte string representing the memorized secret. Subscribers choosing memorized secrets containing Unicode characters SHOULD be advised that some characters may be represented differently by some endpoints, which can affect their ability to authenticate successfully. NIST guidelines are widely accepted worldwide, although theoretically ?NIST is responsible for developing information security standards and guidelines, including minimum requirements for federal systems, ?? Personally, I use Hebrew passwords in systems that allow it. Since my passwords are all Hebrew I don?t have directionality concerns. Best Regards, Jonathan Rosenne From: Unicode On Behalf Of ag disroot via Unicode Sent: Friday, April 8, 2022 7:31 AM To: unicode at corp.unicode.org Subject: Re: global password strategies Since passwords are meant to be typed and not viewed (hence the "?????"), then you can strip all control characters when you process a password. since this control-character-removal function will run on password creation and on login it should be fine -------------- next part -------------- An HTML attachment was scrubbed... URL: From textexin at xencraft.com Fri Apr 8 01:42:03 2022 From: textexin at xencraft.com (Tex) Date: Thu, 7 Apr 2022 23:42:03 -0700 Subject: global password strategies In-Reply-To: References: <004c01d8487a$f244ba30$d6ce2e90$@xencraft.com> <88843b04-3809-dc47-0f1c-caebdbab0b28@it.aoyama.ac.jp> <596363ec-1d33-6246-fdc4-1cea4e34264b@ix.netcom.com> <00ad01d84afb$8fa79180$aef6b480$@xencraft.com> <4de421c6-9062-2e26-dca8-4ff2cebd2040@ix.netcom.com> <6660dd09-cc04-4080-b373-dde769b5d0f0@disroot.org> Message-ID: <000601d84b13$c226bc90$467435b0$@xencraft.com> Thanks very much for this, Jonathon. Some of the recommendations in the doc are counterintuitive or surprising to me. I will give it a closer read. tex From: Unicode [mailto:unicode-bounces at corp.unicode.org] On Behalf Of Jonathan Rosenne via Unicode Sent: Thursday, April 7, 2022 10:57 PM To: unicode at corp.unicode.org Subject: RE: global password strategies The issue has been addressed by NIST, in NIST SP 800-63B DIGITAL IDENTITY GUIDELINES: AUTHENTICATION & LIFECYCLE MANAGEMENT: https://nvlpubs.nist.gov/nistpubs/specialpublications/nist.sp.800-63b.pdf Dated June 2017 5.1.1.2 Memorized Secret Verifiers Verifiers SHALL require subscriber-chosen memorized secrets to be at least 8 characters in length. Verifiers SHOULD permit subscriber-chosen memorized secrets at least 64 characters in length. All printing ASCII [RFC 20] characters as well as the space character SHOULD be acceptable in memorized secrets. Unicode [ISO/ISC 10646] characters SHOULD be accepted as well. To make allowances for likely mistyping, verifiers MAY replace multiple consecutive space characters with a single space character prior to verification, provided that the result is at least 8 characters in length. Truncation of the secret SHALL NOT be performed. For purposes of the above length requirements, each Unicode code point SHALL be counted as a single character. If Unicode characters are accepted in memorized secrets, the verifier SHOULD apply the Normalization Process for Stabilized Strings using either the NFKC or NFKD normalization defined in Section 12.1 of Unicode Standard Annex 15 [UAX 15]. This process is applied before hashing the byte string representing the memorized secret. Subscribers choosing memorized secrets containing Unicode characters SHOULD be advised that some characters may be represented differently by some endpoints, which can affect their ability to authenticate successfully. NIST guidelines are widely accepted worldwide, although theoretically ?NIST is responsible for developing information security standards and guidelines, including minimum requirements for federal systems, ?? Personally, I use Hebrew passwords in systems that allow it. Since my passwords are all Hebrew I don?t have directionality concerns. Best Regards, Jonathan Rosenne From: Unicode On Behalf Of ag disroot via Unicode Sent: Friday, April 8, 2022 7:31 AM To: unicode at corp.unicode.org Subject: Re: global password strategies Since passwords are meant to be typed and not viewed (hence the "?????"), then you can strip all control characters when you process a password. since this control-character-removal function will run on password creation and on login it should be fine -------------- next part -------------- An HTML attachment was scrubbed... URL: From costello at mitre.org Fri Apr 8 06:22:42 2022 From: costello at mitre.org (Roger L Costello) Date: Fri, 8 Apr 2022 11:22:42 +0000 Subject: Why is pattern-matching of NULs slow? Message-ID: Hi Folks, "Flex" is a tool for tokenizing a string. The Flex manual says this: Pattern-matching of NULs is substantially slower than matching other characters. Is this peculiar to Flex or is pattern-matching NULs slow in all pattern-matching tools? Why would pattern-matching NULs be slower than pattern-matching other characters? /Roger From lyratelle at gmx.de Fri Apr 8 07:23:33 2022 From: lyratelle at gmx.de (Dominikus Dittes Scherkl) Date: Fri, 8 Apr 2022 14:23:33 +0200 Subject: global password strategies In-Reply-To: References: <004c01d8487a$f244ba30$d6ce2e90$@xencraft.com> <88843b04-3809-dc47-0f1c-caebdbab0b28@it.aoyama.ac.jp> <596363ec-1d33-6246-fdc4-1cea4e34264b@ix.netcom.com> <00ad01d84afb$8fa79180$aef6b480$@xencraft.com> <4de421c6-9062-2e26-dca8-4ff2cebd2040@ix.netcom.com> <6660dd09-cc04-4080-b373-dde769b5d0f0@disroot.org> Message-ID: Am 08.04.22 um 07:57 schrieb Jonathan Rosenne via Unicode: > Personally, I use Hebrew passwords in systems that allow it. Since my > passwords are all Hebrew I don?t have directionality concerns. I won't disclose what scripts I use in my passwords, as that would reduce the security level - I well could only use digits then, which nobody would consider a good idea. In fact, I would recommend to use characters from different scripts in a password. But directionality should be no problem at all, as I remember which keys to press in which order. It's not displayed, so why should I care? But I hope that whatever processing is done to the password to come to its hashed value, it should be always the very same processing, so that the same character sequence will result in the same hash on any device. -- Dominikus Dittes Scherkl From jr at qsm.co.il Fri Apr 8 07:46:41 2022 From: jr at qsm.co.il (Jonathan Rosenne) Date: Fri, 8 Apr 2022 12:46:41 +0000 Subject: global password strategies In-Reply-To: References: <004c01d8487a$f244ba30$d6ce2e90$@xencraft.com> <88843b04-3809-dc47-0f1c-caebdbab0b28@it.aoyama.ac.jp> <596363ec-1d33-6246-fdc4-1cea4e34264b@ix.netcom.com> <00ad01d84afb$8fa79180$aef6b480$@xencraft.com> <4de421c6-9062-2e26-dca8-4ff2cebd2040@ix.netcom.com> <6660dd09-cc04-4080-b373-dde769b5d0f0@disroot.org> Message-ID: I doubt a 64 character Hebrew passphrase is easy to hack. Certainly it would not be included in the several million common passwords database that hackers use. Mixed directionality could possibly be a problem, depending on how the application handles text. And many applications do allow one to optionally display the password they typed. Best Regards, Jonathan Rosenne -----Original Message----- From: Unicode On Behalf Of Dominikus Dittes Scherkl via Unicode Sent: Friday, April 8, 2022 3:24 PM To: unicode at corp.unicode.org Cc: Dominikus Dittes Scherkl Subject: Re: global password strategies Am 08.04.22 um 07:57 schrieb Jonathan Rosenne via Unicode: > Personally, I use Hebrew passwords in systems that allow it. Since my > passwords are all Hebrew I don?t have directionality concerns. I won't disclose what scripts I use in my passwords, as that would reduce the security level - I well could only use digits then, which nobody would consider a good idea. In fact, I would recommend to use characters from different scripts in a password. But directionality should be no problem at all, as I remember which keys to press in which order. It's not displayed, so why should I care? But I hope that whatever processing is done to the password to come to its hashed value, it should be always the very same processing, so that the same character sequence will result in the same hash on any device. -- Dominikus Dittes Scherkl From haberg-1 at telia.com Fri Apr 8 08:21:43 2022 From: haberg-1 at telia.com (=?utf-8?Q?Hans_=C3=85berg?=) Date: Fri, 8 Apr 2022 15:21:43 +0200 Subject: Why is pattern-matching of NULs slow? In-Reply-To: References: Message-ID: <7D6DB83A-44A9-445D-822C-3CF98405946E@telia.com> > On 8 Apr 2022, at 13:22, Roger L Costello via Unicode wrote: > > "Flex" is a tool for tokenizing a string. The Flex manual says this: > > Pattern-matching of NULs is substantially slower > than matching other characters. > > Is this peculiar to Flex or is pattern-matching NULs slow in all pattern-matching tools? The underlying DFA algorithm treats all symbols equally. So it must have something to do with its implementation. > Why would pattern-matching NULs be slower than pattern-matching other characters? One chooses ones set of symbols, for example, for Unicode one can chose to convert to UTF-8 byte classes instead of using code points. So perhaps in Flex, NUL is not a part of that symbol set and treated specially. Just a wild guess. From wjgo_10009 at btinternet.com Fri Apr 8 05:48:58 2022 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Fri, 8 Apr 2022 11:48:58 +0100 (BST) Subject: global password strategies In-Reply-To: <00a901d84af8$f8cb2f80$ea618e80$@xencraft.com> References: <004c01d8487a$f244ba30$d6ce2e90$@xencraft.com> <88843b04-3809-dc47-0f1c-caebdbab0b28@it.aoyama.ac.jp> <00a901d84af8$f8cb2f80$ea618e80$@xencraft.com> Message-ID: <5b43faa6.3100f.18008ca9e0b.Webtop.102@btinternet.com> Tex wrote: > Also, there are some apps that are serving users that are not computer > literate. They may be handed a tablet to enter information or similar > scenarios. There could be a password input format that is both script-independent and language-independent and platform-independent where the end user is presented with a display of a, say, 8 by 8 grid of emoji and the end user needs to have a password by clicking on of at least eight of them in a sequence, with the only restriction being that no emoji shall be used twice in immediate succession, this to avoid accidental double clicking causing problems. The emoji chosen to be used in the format would need to be chosen carefully so as to be clearly distinct from each other whether the display is in colour or in monochrome and taking account of colour vision issues (so not both RED APPLE and GREEN APPLE), non-violent, respectful to cultures and religions, and so that the format could be used both in left to right situations and in right to left situations without ambiguity if characters are mirrored horizontally. If the format were published by Unicode Inc. it could become widely used around the world in some situations. William Overington Friday 8 April 2022 -------------- next part -------------- An HTML attachment was scrubbed... URL: From jonathanhwchan at gmail.com Fri Apr 8 13:26:14 2022 From: jonathanhwchan at gmail.com (Jonathan Chan) Date: Fri, 8 Apr 2022 11:26:14 -0700 Subject: Dentistry notation symbols Message-ID: What are code points U+23BE..U+23CC in Miscellaneous Technical used for? ??????????????? They're all named DENTISTRY SYMBOL LIGHT..., and the Standard only says they're for dental notation: *Dental Symbols.* The set of symbols from U+23BE to U+23CC form a set of > symbols from JIS X 0213 for use in dental notation. > According to Wikipedia the first two and the last two are used in Palmer notation , but it doesn't explain what the rest of them are used for. The only historical document I could find with some sort of explanation is document N2195 , but it only explains how they're used and not what they're meant to represent, why they need to exist, or what the circle, triangle, and tilde mean. Based on some cursory searching it doesn't seem like those symbols are standard in modern dental notation either. Jonathan -------------- next part -------------- An HTML attachment was scrubbed... URL: From mark at kli.org Fri Apr 8 15:58:40 2022 From: mark at kli.org (Mark E. Shoulson) Date: Fri, 8 Apr 2022 16:58:40 -0400 Subject: global password strategies In-Reply-To: <5b43faa6.3100f.18008ca9e0b.Webtop.102@btinternet.com> References: <004c01d8487a$f244ba30$d6ce2e90$@xencraft.com> <88843b04-3809-dc47-0f1c-caebdbab0b28@it.aoyama.ac.jp> <00a901d84af8$f8cb2f80$ea618e80$@xencraft.com> <5b43faa6.3100f.18008ca9e0b.Webtop.102@btinternet.com> Message-ID: <52f3e650-1c67-5167-44e3-2e7cc5c54005@shoulson.com> That's not a format, it's a user interface ("the user is presented...")? Unicode doesn't standardize user interfaces. Restricting the permissible alphabet to emoji is just about as bad/annoying (for many users) as restricting it to ASCII or Cyrillic or whatever, except that it's more evenly hard on everyone. ~mark On 4/8/22 06:48, William_J_G Overington via Unicode wrote: > Tex wrote: > > > > Also, there are some apps that are serving users that are not > computer literate. They may be handed a tablet to enter information or > similar scenarios. > > There could be a password input format that is both script-independent > and language-independent and platform-independent where the end user > is presented with a display of a, say, 8 by 8 grid of emoji and the > end user needs to have a password by clicking on of at least eight of > them in a sequence, with the only restriction being that no emoji > shall be used twice in immediate succession, this to avoid accidental > double clicking causing problems. > > The emoji chosen to be used in the format would need to be chosen > carefully so as to be clearly distinct from each other whether the > display is in colour or in monochrome and taking account of colour > vision issues (so not both RED APPLE and GREEN APPLE), non-violent, > respectful to cultures and religions, and so that the format could be > used both in left to right situations and in right to left situations > without ambiguity if characters are mirrored horizontally. > > If the format were published by Unicode Inc. it could become widely > used around the world in some situations. > > William Overington > > Friday 8 April 2022 > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From philip_chastney at yahoo.com Fri Apr 8 18:24:05 2022 From: philip_chastney at yahoo.com (philip chastney) Date: Fri, 8 Apr 2022 23:24:05 +0000 (UTC) Subject: Dentistry notation symbols In-Reply-To: References: Message-ID: <1696506833.45390.1649460245263@mail.yahoo.com> the symbols in your message are all APL symbols,not dentistry symbols at all /phil On Friday, 8 April 2022, 19:39:06 UTC, Jonathan Chan via Unicode wrote: What are code points U+23BE..U+23CC in Miscellaneous Technical used for? ??????????????? They're all named DENTISTRY SYMBOL LIGHT..., and the Standard only says they're for dental notation: Dental Symbols. The set of symbols from U+23BE to U+23CC form a set of symbols from JIS X 0213 for use in dental notation. According to Wikipedia the first two and the last two are used in Palmer notation, but it doesn't explain what the rest of them are used for. The only historical document I could find with some sort of explanation is?document N2195, but it only explains how they're used and not what they're meant to represent, why they need to exist, or what the circle, triangle, and tilde mean. Based on some cursory searching it doesn't seem like those symbols are standard in modern dental notation either. Jonathan -------------- next part -------------- An HTML attachment was scrubbed... URL: From prosfilaes at gmail.com Fri Apr 8 18:27:33 2022 From: prosfilaes at gmail.com (David Starner) Date: Fri, 8 Apr 2022 18:27:33 -0500 Subject: Why is pattern-matching of NULs slow? In-Reply-To: References: Message-ID: On Fri, Apr 8, 2022 at 6:25 AM Roger L Costello via Unicode wrote: > Why would pattern-matching NULs be slower than pattern-matching other characters? Flex is written in C, and C strings use NUL as a terminator, and can't include NUL. The demand for Flex to handle NULs would be pretty minimal, it's mostly used on text documents that don't have NUL, and so I suspect someone tossed in a hack to make it work with NUL when it had to, and nobody has been back to fix it. It's mature software, without a release in five years, so I don't see that changing. -- The standard is written in English . If you have trouble understanding a particular section, read it again and again and again . . . Sit up straight. Eat your vegetables. Do not mumble. -- _Pascal_, ISO 7185 (1991) From prosfilaes at gmail.com Fri Apr 8 18:29:18 2022 From: prosfilaes at gmail.com (David Starner) Date: Fri, 8 Apr 2022 18:29:18 -0500 Subject: Dentistry notation symbols In-Reply-To: <1696506833.45390.1649460245263@mail.yahoo.com> References: <1696506833.45390.1649460245263@mail.yahoo.com> Message-ID: On Fri, Apr 8, 2022 at 6:27 PM philip chastney via Unicode wrote: > > the symbols in your message are all APL symbols, > not dentistry symbols at all They look like APL symbols, but they're not from that block at all, and they're all named "Dentistry Symbol ..." -- The standard is written in English . If you have trouble understanding a particular section, read it again and again and again . . . Sit up straight. Eat your vegetables. Do not mumble. -- _Pascal_, ISO 7185 (1991) From jason.dusek at gmail.com Fri Apr 8 18:32:10 2022 From: jason.dusek at gmail.com (Jason Dusek) Date: Fri, 8 Apr 2022 18:32:10 -0500 Subject: Dentistry notation symbols In-Reply-To: <1696506833.45390.1649460245263@mail.yahoo.com> References: <1696506833.45390.1649460245263@mail.yahoo.com> Message-ID: Consider U+23BE: it is labelled as "Dentistry Symbol Light Vertical and Right". https://www.compart.com/en/unicode/U+23BE Now, I have no idea what these are about; but it is clear that the mystery of the dentistry symbols is not about APL. philip chastney via Unicode schrieb am Fr. 8. Apr. 2022 um 18:26: > the symbols in your message are all APL symbols, > not dentistry symbols at all > > /phil > > On Friday, 8 April 2022, 19:39:06 UTC, Jonathan Chan via Unicode < > unicode at corp.unicode.org> wrote: > > > What are code points U+23BE..U+23CC in Miscellaneous Technical used for? > > ??????????????? > > They're all named DENTISTRY SYMBOL LIGHT..., and the Standard only says > they're for dental notation: > > *Dental Symbols.* The set of symbols from U+23BE to U+23CC form a set of > symbols from JIS X 0213 for use in dental notation. > > > According to Wikipedia the first two and the last two are used in Palmer > notation , but it doesn't > explain what the rest of them are used for. The only historical document I > could find with some sort of explanation is document N2195 > , but it only explains > how they're used and not what they're meant to represent, why they need to > exist, or what the circle, triangle, and tilde mean. Based on some cursory > searching it doesn't seem like those symbols are standard in modern dental > notation either. > > Jonathan > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mandel59 at gmail.com Fri Apr 8 19:35:37 2022 From: mandel59 at gmail.com (Ryusei Yamaguchi) Date: Sat, 9 Apr 2022 09:35:37 +0900 Subject: Dentistry notation symbols In-Reply-To: References: Message-ID: <85EBD71B-43B0-4B17-85A4-A7E13DCFA2FC@gmail.com> I've found some documents about dental formula used in Japan. (I'm not a dentist, so the translation of technical terms may be incorrect.) Dentists in Japan use modified Zsigmondy-Palmer notation. [1] The notation is used in medical receipts or academic papers and preferred because of its intuitiveness. [2] shows a modified notations used in electronic receipts: ?: ??? abutment tooth ???: ??????? ?: ? diastema ? and ? are used for ellipsis across ?? (center) [3][4]. [5] is an app to input dental formula, which supports Unicode format output. [1] ??????????????????????????????? (Standardization in Dentistry in Japan?Situation of Overseas and Consideration of Japanese Standard Masters form the Position of International Trends?) https://www.jstage.jst.go.jp/article/jami/34/4/34_183/_article/-char/ja/ [2] ???????????? (Guide to Creating Electronic Receipts) https://www.ssk.or.jp/seikyushiharai/rezept/iryokikan/iryokikan_02.files/jiki_d01.pdf [3] ????????? (About Notation of Dental Formula) http://endai.umin.ac.jp/endai/jss57/jss_sisiki.htm [4] ?????? (About Dental Formula) https://ehiro-shika.com/blog_articles/1604298267.html [5] ???????/?????? https://www.kartemaker.com/shishiki/ Ryusei > 2022/04/09 3:26?Jonathan Chan via Unicode ????: > > What are code points U+23BE..U+23CC in Miscellaneous Technical used for? > > ??????????????? > > They're all named DENTISTRY SYMBOL LIGHT..., and the Standard only says they're for dental notation: > > Dental Symbols. The set of symbols from U+23BE to U+23CC form a set of symbols from JIS X 0213 for use in dental notation. > > According to Wikipedia the first two and the last two are used in Palmer notation , but it doesn't explain what the rest of them are used for. The only historical document I could find with some sort of explanation is document N2195 , but it only explains how they're used and not what they're meant to represent, why they need to exist, or what the circle, triangle, and tilde mean. Based on some cursory searching it doesn't seem like those symbols are standard in modern dental notation either. > > Jonathan -------------- next part -------------- An HTML attachment was scrubbed... URL: From kenwhistler at sonic.net Fri Apr 8 20:07:20 2022 From: kenwhistler at sonic.net (Ken Whistler) Date: Fri, 8 Apr 2022 18:07:20 -0700 Subject: Dentistry notation symbols In-Reply-To: References: Message-ID: <16f3ce10-a144-9f9a-ad9c-9287ba3ab7a4@sonic.net> Jonathan, What you are looking for is WG2 N2093: https://www.unicode.org/wg2/docs/n2093.pdf The proximal cause for the encoding of these characters was compatibility with JIS X 0213. And in WG2 N2093 you can see them cited in the chart "Enufour Gaiji for Dentists". Then on p. 11 of the pdf you can see citations of the corner angle notation claimed as part of the Palmer (1870) method of recording teeth. Then you can see some other citations in Japanese dentistry documents on pp. 13 through 16 of the pdf. The scribbled Japanese next to each of the circled examples on those pages says "example". The little triangle between two teeth seems to refer to a bridge. Many of the characters in the range 23BE..23CC are not actually exemplified in these four pages, but rather only in that Enufour Gaiji for Dentists listing. And no, I can't provide any more interpretation of what each of them is intended to mean. By the way, the two medical records shown on pp. 15 - 16 of the pdf are dated 1997, only a couple years earlier than the 1999 date of WG2 N2093. So this usage of Palmer notation with these symbol extensions was not some completely obsolete convention at the time. --Ken On 4/8/2022 11:26 AM, Jonathan Chan via Unicode wrote: > They're all named DENTISTRY SYMBOL LIGHT..., and the Standard only > says they're for dental notation: > > *Dental Symbols.* The set of symbols from U+23BE to U+23CC form a > set of symbols from JIS X 0213 for use in dental notation. > > > According to Wikipedia the first two and the last two are used in > Palmer notation , but > it doesn't explain what the rest of them are used for. The only > historical document I could find with some sort of explanation is > document N2195 , but > it only explains how they're used and not what they're meant to > represent, why they need to exist, or what the circle, triangle, and > tilde mean. Based on some cursory searching it doesn't seem like those > symbols are standard in modern dental notation either. -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Sat Apr 9 03:28:51 2022 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Sat, 9 Apr 2022 09:28:51 +0100 Subject: global password strategies In-Reply-To: <00a901d84af8$f8cb2f80$ea618e80$@xencraft.com> References: <004c01d8487a$f244ba30$d6ce2e90$@xencraft.com> <88843b04-3809-dc47-0f1c-caebdbab0b28@it.aoyama.ac.jp> <00a901d84af8$f8cb2f80$ea618e80$@xencraft.com> Message-ID: <20220409092851.043e3b6e@JRWUBU2> On Thu, 7 Apr 2022 20:30:18 -0700 Tex via Unicode wrote: > And as you say, telling the user (and the developers) whatever you > do, make sure you always do exactly the same thing is probably the > best we can do. For developers this means make sure you consistently > don?t apply any functions between accepting the text and encrypting > it. > > The downside is that some users will have problems when they use a > different system or go through upgrades of their input methods. One problem is indeed that some smart keyboards will alter text themselves. I had a keyboard that enabled one to enter much NFC Latin text. I switched keyboard interpreter (from corrected fcitx to ibus) when 32-bit operating systems were discontinued, and now for the same M17N keyboard definition it generates NFD at least some of the time. Richard. From public at khwilliamson.com Sat Apr 9 11:26:34 2022 From: public at khwilliamson.com (Karl Williamson) Date: Sat, 9 Apr 2022 10:26:34 -0600 Subject: Why is pattern-matching of NULs slow? In-Reply-To: References: Message-ID: On 4/8/22 17:27, David Starner via Unicode wrote: > On Fri, Apr 8, 2022 at 6:25 AM Roger L Costello via Unicode > wrote: >> Why would pattern-matching NULs be slower than pattern-matching other characters? > > Flex is written in C, and C strings use NUL as a terminator, and can't > include NUL. The demand for Flex to handle NULs would be pretty > minimal, it's mostly used on text documents that don't have NUL, and > so I suspect someone tossed in a hack to make it work with NUL when it > had to, and nobody has been back to fix it. It's mature software, > without a release in five years, so I don't see that changing. > I can take Perl as a teaching example. Right off the bat it was used for parsing binary data, so had to accept embedded NULs. Things had to be written by hand to duplicate libc functions but allow those NULs. Over the years various libc functions have been added such as memchr(), memmem() that did allow for embedded NULs, and Perl converted to use them on platforms where provided. But there remain many functions that accept only NUL-terminated strings, and so workarounds are used. In some cases, that means re-implementing the libc function in C code. Often the libc version will be implemented in assembly language, making it faster than Perl's C version. A particularly flagrant example is strxfrm() for collating text. Perl did not want to re-implement the complex locale handling that this function handles. So, there is a wrapper for it that splits the string into NUL-terminated segments, and plays some shenanigans, all of which take extra cycles. From harjitmoe at outlook.com Sat Apr 9 16:14:38 2022 From: harjitmoe at outlook.com (Harriet Riddle) Date: Sat, 9 Apr 2022 22:14:38 +0100 Subject: global password strategies In-Reply-To: <20220409092851.043e3b6e@JRWUBU2> References: <004c01d8487a$f244ba30$d6ce2e90$@xencraft.com> <88843b04-3809-dc47-0f1c-caebdbab0b28@it.aoyama.ac.jp> <00a901d84af8$f8cb2f80$ea618e80$@xencraft.com> <20220409092851.043e3b6e@JRWUBU2> Message-ID: > One problem is indeed that some smart keyboards will alter text > themselves. I had a keyboard that enabled one to enter much NFC Latin > text. I switched keyboard interpreter (from corrected fcitx to > ibus) when 32-bit operating systems were discontinued, and now for the > same M17N keyboard definition it generates NFD at least some of the > time. Of course, then you come to CJK input methods.? My favourite go-to example is (depending on both vendor and locale) that some will generate U+301C ? while others will generate U+FF5E ? for what is supposed to be the same character.? Normalisation doesn't help here: U+FF5E NFKCs to U+007E, while U+301C NFKCs to itself. Another example is U+2212 ? and U+FF0D ?, which was causing enough problems that they had to be explicitly written into W3C/WHATWG standards as converting to the same JIS, even though the WHATWG Encoding Standard doesn't normally include one-way mappings like that, since Apple Japanese IMEs were entering U+2212 and Microsoft ones were entering U+FF0D.? Although there are multiple other examples where the two vendors map the same Shift JIS character to different Unicode, and this also affects codepoint choice by their respective IMEs, apparently this particular example was causing problems with systems expecting Shift JIS postal codes or something like that.? https://www.w3.org/Bugs/Public/show_bug.cgi?id=28661 Again, normalisation doesn't help there, since U+FF0D NFKCs to U+002D and U+2212 NFKCs to itself.? These inter-vendor issues are not unique to Japanese either. Suffice it to say, you don't want your password to fail to match because you typed it on a different operating system. -- Har. From wjgo_10009 at btinternet.com Sat Apr 9 14:47:52 2022 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Sat, 9 Apr 2022 20:47:52 +0100 (BST) Subject: global password strategies In-Reply-To: <52f3e650-1c67-5167-44e3-2e7cc5c54005@shoulson.com> References: <004c01d8487a$f244ba30$d6ce2e90$@xencraft.com> <88843b04-3809-dc47-0f1c-caebdbab0b28@it.aoyama.ac.jp> <00a901d84af8$f8cb2f80$ea618e80$@xencraft.com> <5b43faa6.3100f.18008ca9e0b.Webtop.102@btinternet.com> <52f3e650-1c67-5167-44e3-2e7cc5c54005@shoulson.com> Message-ID: <8fe18c9.32c5a.1800fde5cfa.Webtop.102@btinternet.com> Mark E. Shoulson via Unicode wrote: > That's not a format, it's a user interface ("the user is presented...") Fine, it can be described as a user interface. Better still, as a script-independent and language-independent user interface. > Unicode doesn't standardize user interfaces. I don't know, as I am no expert on categorizing all of the standardization content that Unicode Inc. has published, nor am I am aware of any policies that may exist that would prevent Unicode Inc. choosing sixty-four emoji to populate an 8 by 8 grid of emoji and Unicode Inc. publishing it is some form. > Restricting the permissible alphabet to emoji is just about as > bad/annoying (for many users) as restricting it to ASCII or Cyrillic > or whatever, except that it's more evenly hard on everyone. Quite possibly it would be, yet that is not what I am suggesting. I am suggesting an additional facility that could be a very useful facility to have available for use in some circumstances. There is absolutely nothing in my suggestion that would restrict other user interface systems for passwords from being used. Indeed, on a tablet computer end users could be presented with a choice of how to enter a password, with two or more choices presented each in its own tile, with the tiles presented side by side so that the end user can choose which one to use to set up and use a password. Certainly any individual or organization could choose to select sixty-four emoji to populate an 8 by 8 grid of emoji and publish it, make it open source in a document, or even produce it as a pop art style poster. The poster might end up as an exhibit at MoMA in New York as an example of emoji being applied for a practical purpose. However, for the particular layout to become used in practice by lots of independent app producers, the particular layout needs to have the provenance of being published by a widely-respected standardization body. So I am hoping that Unicode Inc. will take up this idea and publish a particular layout of emoji and some notes about how to use it, doing that either within one of the existing projects or as a stand-alone project as Unicode Inc. decides. William Overington Saturday 9 April 2022 -------------- next part -------------- An HTML attachment was scrubbed... URL: From textexin at xencraft.com Sun Apr 10 01:05:21 2022 From: textexin at xencraft.com (Tex) Date: Sat, 9 Apr 2022 23:05:21 -0700 Subject: global password strategies In-Reply-To: <8fe18c9.32c5a.1800fde5cfa.Webtop.102@btinternet.com> References: <004c01d8487a$f244ba30$d6ce2e90$@xencraft.com> <88843b04-3809-dc47-0f1c-caebdbab0b28@it.aoyama.ac.jp> <00a901d84af8$f8cb2f80$ea618e80$@xencraft.com> <5b43faa6.3100f.18008ca9e0b.Webtop.102@btinternet.com> <52f3e650-1c67-5167-44e3-2e7cc5c54005@shoulson.com> <8fe18c9.32c5a.1800fde5cfa.Webtop.102@btinternet.com> Message-ID: <002801d84ca0$f65c1da0$e31458e0$@xencraft.com> William, There are many problems that having a standard would resolve. Simply stating an incomplete idea and then expecting Unicode Consortium or any other standards body to implement it is an arrogant and unreasonable proposition. To become a standard the idea has to have support from many communities, and it has to be a fit for the organization?s responsibilities. It isn?t clear the grid idea meets the needs of password entry, and isn?t specified in detail. It isn?t clear emoji are needed or optimal for this purpose, compared to just using shapes (triangle up, triangle down, etc.) or for that matter that any images are needed, since it could be select row3 column 4. Ultimately, the password this generates does not need Unicode since the output reduces to a series of row and column pairs. (Which is why this is just an interface.) So if you think this should be a standard, establish the requirements for password entry, show that the proposal satisfies the requirements, find communities that agree and support the idea, and find a standards body that will make it a standard. You might try NIST for example as a standards body that might support a solution. You have not acknowledged the requirements for password entry (see the NIST document). Hth Tex From: Unicode [mailto:unicode-bounces at corp.unicode.org] On Behalf Of William_J_G Overington via Unicode Sent: Saturday, April 9, 2022 12:48 PM To: unicode at corp.unicode.org Subject: Re: global password strategies Mark E. Shoulson via Unicode wrote: > That's not a format, it's a user interface ("the user is presented...") Fine, it can be described as a user interface. Better still, as a script-independent and language-independent user interface. > Unicode doesn't standardize user interfaces. I don't know, as I am no expert on categorizing all of the standardization content that Unicode Inc. has published, nor am I am aware of any policies that may exist that would prevent Unicode Inc. choosing sixty-four emoji to populate an 8 by 8 grid of emoji and Unicode Inc. publishing it is some form. > Restricting the permissible alphabet to emoji is just about as bad/annoying (for many users) as restricting it to ASCII or Cyrillic or whatever, except that it's more evenly hard on everyone. Quite possibly it would be, yet that is not what I am suggesting. I am suggesting an additional facility that could be a very useful facility to have available for use in some circumstances. There is absolutely nothing in my suggestion that would restrict other user interface systems for passwords from being used. Indeed, on a tablet computer end users could be presented with a choice of how to enter a password, with two or more choices presented each in its own tile, with the tiles presented side by side so that the end user can choose which one to use to set up and use a password. Certainly any individual or organization could choose to select sixty-four emoji to populate an 8 by 8 grid of emoji and publish it, make it open source in a document, or even produce it as a pop art style poster. The poster might end up as an exhibit at MoMA in New York as an example of emoji being applied for a practical purpose. However, for the particular layout to become used in practice by lots of independent app producers, the particular layout needs to have the provenance of being published by a widely-respected standardization body. So I am hoping that Unicode Inc. will take up this idea and publish a particular layout of emoji and some notes about how to use it, doing that either within one of the existing projects or as a stand-alone project as Unicode Inc. decides. William Overington Saturday 9 April 2022 -------------- next part -------------- An HTML attachment was scrubbed... URL: From calvin at wardbox.co.uk Sun Apr 10 12:03:32 2022 From: calvin at wardbox.co.uk (Calvin Southwood) Date: Sun, 10 Apr 2022 18:03:32 +0100 Subject: Addition of Latin theta as a separate codepoint Message-ID: I'm curious about the current state of adding a Latin theta to the Unicode standard as a separate codepoint. I've seen various proposals that include adding such a character as part of support for certain European languages, but I'm not sure as to the progress of these. My personal interest in a Latin theta stems from the fact that it's the only Greek character used as part of the IPA that doesn't have a separate codepoint for a Latin version of the character. From a linguistic perspective this is a barrier to implementing a general abugida that can be represented using the IPA, because many current operating systems and browsers (including Windows and Chrome) will parse a Greek theta next to adjacent Latin characters as separate text runs, and therefore major font rendering engines such as HarfBuzz will never be able to render a ligature between a Greek theta and a Latin vowel as a single character. After an exchange with the maintainers of Harfbuzz, the proposed solutions I've seen are either adding a Latin theta as a separate codepoint, or adding Script_Extensions "Latn Grek" to the Greek theta so that parsing engines that use these extensions for text segmentation can parse a string of Latin characters including Greek theta as a single string. While I am somewhat aware of arguments against this involving disunification, I feel there is a strong argument in favour of such a character from a linguistic and accessibility perspective due to its nature as an IPA character. Can anyone give me some insight as to the current thinking on this issue? Thanks very much, Calvin Southwood -------------- next part -------------- An HTML attachment was scrubbed... URL: From abrahamgross at disroot.org Sun Apr 10 13:34:50 2022 From: abrahamgross at disroot.org (ag disroot) Date: Sun, 10 Apr 2022 18:34:50 +0000 (UTC) Subject: Addition of Latin theta as a separate codepoint In-Reply-To: References: Message-ID: <30c65106-a3ab-41a1-9c22-82e4d1b63470@disroot.org> Speaking of which, katakana, hiragana, and the japanese kanji should all also get a script extension of japanese. because as it stands, harfbuzz seems to think that japanese is a mix of 3 languages. which means that I can't do any sort of ligature or contextual things on japanese bc of that -------------- next part -------------- An HTML attachment was scrubbed... URL: From gwalla at gmail.com Sun Apr 10 14:52:29 2022 From: gwalla at gmail.com (Garth Wallace) Date: Sun, 10 Apr 2022 12:52:29 -0700 Subject: Addition of Latin theta as a separate codepoint In-Reply-To: <30c65106-a3ab-41a1-9c22-82e4d1b63470@disroot.org> References: <30c65106-a3ab-41a1-9c22-82e4d1b63470@disroot.org> Message-ID: What ligature or contextual things are you trying to do? The only cross-script ligature I can think of in Japanese (???) is basically obsolete, and even intra-script ligatures are rare in modern writing. On Sun, Apr 10, 2022 at 11:37 AM ag disroot via Unicode < unicode at corp.unicode.org> wrote: > Speaking of which, katakana, hiragana, and the japanese kanji should all > also get a script extension of japanese. because as it stands, harfbuzz > seems to think that japanese is a mix of 3 languages. which means that I > can't do any sort of ligature or contextual things on japanese bc of that > -------------- next part -------------- An HTML attachment was scrubbed... URL: From abrahamgross at disroot.org Sun Apr 10 15:21:45 2022 From: abrahamgross at disroot.org (ag disroot) Date: Sun, 10 Apr 2022 20:21:45 +0000 (UTC) Subject: Addition of Latin theta as a separate codepoint In-Reply-To: References: <30c65106-a3ab-41a1-9c22-82e4d1b63470@disroot.org> Message-ID: <25f9fb19-0c94-491f-bf75-96a08cbf7953@disroot.org> Thats a good example Heres another example: https://gitlab.com/ekaunt/auto-kyuujitai it was supposed to turn ??? into ???, but the contextual scanning can't go between kanji and hiragana. see this issue https://github.com/fontforge/fontforge/issues/4545 -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Mon Apr 11 09:46:25 2022 From: doug at ewellic.org (Doug Ewell) Date: Mon, 11 Apr 2022 08:46:25 -0600 Subject: Addition of Latin theta as a separate codepoint In-Reply-To: <30c65106-a3ab-41a1-9c22-82e4d1b63470@disroot.org> References: <30c65106-a3ab-41a1-9c22-82e4d1b63470@disroot.org> Message-ID: <000a01d84db2$ebb980f0$c32c82d0$@ewellic.org> ag disroot wrote: > Speaking of which, katakana, hiragana, and the japanese kanji should > all also get a script extension of japanese. because as it stands, > harfbuzz seems to think that japanese is a mix of 3 languages. which > means that I can't do any sort of ligature or contextual things on > japanese bc of that Hiragana and Katakana are both used to write only one language, Japanese. If HarfBuzz tries to correlate scripts (which Unicode encodes) to languages (which it doesn't) for rendering purposes, then out of all of the world's scripts, it ought to get these two right. The ISO 15924 code element [Jpan] is defined as "Japanese (alias for Han + Hiragana + Katakana)". It is intended for Japanese texts written in a combination of these three scripts, as most Japanese texts are, not for characters in isolation. It seems wrong for Unicode to identify kana as "kanji plus kana" and also identify kanji as "kanji plus kana" to try to fix one broken implementation. There are decades of precedent that Unicode does not do that. -- Doug Ewell, CC, ALB | Lakewood, CO, US | ewellic.org From andy.heninger at gmail.com Mon Apr 11 16:31:55 2022 From: andy.heninger at gmail.com (Andy Heninger) Date: Mon, 11 Apr 2022 14:31:55 -0700 Subject: Line-breaking algorithm: Unexpected break in multiple consecutive numeric prefixes In-Reply-To: References: Message-ID: I tried the sequences you identified against ICU line breaking, ? 2212 MINUS SIGN (line-breaking class PR) ?$ 0024 DOLLAR SIGN (line-breaking class PR) ?4 0034 DIGIT FOUR (line-breaking class NU) ?5 0035 DIGIT FIVE (line-breaking class NU) and + 002B PLUS SIGN (line-breaking class PR) ?$ 0024 DOLLAR SIGN (line-breaking class PR) ?4 0034 DIGIT FOUR (line-breaking class NU) ?5 0035 DIGIT FIVE (line-breaking class NU) In both cases there was a boundary after the first character (? or +), which is consistent with the UAX-14 rules. Whether this is desirable or not is a separate question. Perhaps Safari has done some additional tailoring of the rules in question. For what it's worth, for Numbers, ICU uses the full regular expression ( PR | PO ) ? ( OP | HY ) ? NU (NU | SY | IS ) * (CL | CP ) ? ( PR | PO ) ? instead of the short fragments of rules from LB24 and LB25. The main difference is that a "number" sequence must contain at least one NU character. -- Andy On Fri, Apr 1, 2022 at 8:38 AM Ophir Lifshitz via Unicode < unicode at corp.unicode.org> wrote: > Hello again, > > I hope it's not an issue to re-ask this question I had from a while back. > > Thanks! > > On Sun, Sep 19, 2021 at 5:13 AM Ophir Lifshitz wrote: > >> I have a question about the line-breaking algorithm. Apologies if it >> is uninformed or if this is the wrong venue. >> >> I recently experienced an unexpected line break[1] after the first >> character in the following sequence[2]: >> >> ?? 2212 MINUS SIGN (line-breaking class PR) >> ?$ 0024 DOLLAR SIGN (line-breaking class PR) >> ?4 0034 DIGIT FOUR (line-breaking class NU) >> ?5 0035 DIGIT FIVE (line-breaking class NU) >> >> (However, if the first character is replaced by 002B PLUS SIGN (also >> class PR), a line break does not occur.) >> >> I also noticed that there is no "PR ? PR" rule in (e.g.) LB25. >> >> Is this intended, perhaps an oversight, or is it up to implementation >> discretion i.e. "tailored"? >> >> If it is an oversight, what is the process for correcting it or filing >> a bug? It is hard to find that information on the Unicode website. >> >> Thank you. >> >> >> [1] The line break appeared in Chrome 93 and Safari 13.1 on Mac 10.13, >> but not in Firefox 85. >> I tested by navigating in my browser to the following data URIs: >> >> data:text/html;charset=utf-8,%E2%88%92$45

>> data:text/html;charset=utf-8,%2B$45

>> >> [2] This sequence is intended to behave as a single unit (word), and >> refers to a price discount in the original text. >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From wjgo_10009 at btinternet.com Mon Apr 11 14:33:36 2022 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Mon, 11 Apr 2022 20:33:36 +0100 (BST) Subject: global password strategies In-Reply-To: <002801d84ca0$f65c1da0$e31458e0$@xencraft.com> References: <004c01d8487a$f244ba30$d6ce2e90$@xencraft.com> <88843b04-3809-dc47-0f1c-caebdbab0b28@it.aoyama.ac.jp> <00a901d84af8$f8cb2f80$ea618e80$@xencraft.com> <5b43faa6.3100f.18008ca9e0b.Webtop.102@btinternet.com> <52f3e650-1c67-5167-44e3-2e7cc5c54005@shoulson.com> <8fe18c9.32c5a.1800fde5cfa.Webtop.102@btinternet.com> <002801d84ca0$f65c1da0$e31458e0$@xencraft.com> Message-ID: <4bf6e9fc.3568d.1801a1e0419.Webtop.102@btinternet.com> Tex wrote: > There are many problems that having a standard would resolve. Yes. > Simply stating an incomplete idea and then expecting Unicode > Consortium or any other standards body to implement it is an arrogant > and unreasonable proposition. Well, I never wrote that I expected anything. I wrote "So I am hoping ...", I simply put forward what seems to me a good idea that could be very useful in some circumstances, > To become a standard the idea has to have support from many > communities, and it has to be a fit for the organization?s > responsibilities. If Unicode Inc. were to specify a specific choice of 64 emoji set out in an 8 by 8 array, then it would be a de facto standard which people could use or not use as they chose, with no concern that the specific layout were proprietary and that someone or some organization might come along later and request royalties for using that particular layout. > It isn?t clear emoji are needed or optimal for this purpose, compared > to just using shapes (triangle up, triangle down, etc.) or for that > matter that any images are needed, since it could be select row3 > column 4. I am not suggesting emoji to the exclusion of other possibilities. For me, using emoji has the advantage that the pictures are mostly of everyday things, so someone would possibly or even probably know for each picture the word to describe the picture in the language that he or she uses. > Ultimately, the password this generates does not need Unicode since > the output reduces to a series of row and column pairs. (Which is why > this is just an interface.) Well, I was not thinking of the output being a series of row and column pairs, I have, and am, thinking of the output being a sequence of Unicode characters, the 8 by 8 array of emoji being just as a way for an end user to enter a sequence of Unicode characters as if an end user enters a sequence of Unicode characters as a password in a text box. Indeed perhaps there could be a text box like display below the 8 by 8 array and as the emoji are clicked the text box fills up, either with dots or an emoji display, depending whether the text box is in Hide mode or Show mode. The 8 by 8 array method of password entry would just work in parallel with the conventional text box method of password entry. It is sort of like how a built in keyboard on a laptop computer can work in parallel with an external keyboard. > So if you think this should be a standard, establish the requirements > for password entry, show that the proposal satisfies the requirements, > find communities that agree and support the idea, and find a standards > body that will make it a standard. Well, I opine that it could be helpful in some circumstances if a particular layout of 64 emoji in an 8 by 8 array so as to facilitate password entry in a manner not linked to any particular script or an particular language were to become published by Unicode Inc. as an app developer would have a list available to use if so desired and if various producers of apps were to use the same particular layout that that could be helpful to end users. What I am suggesting is just a simple sort of gadget to metaphorically bolt on to an existing password entry system to give the existing method an extra way for an end user to set up a password and to enter a previously set up password. > You have not acknowledged the requirements for password entry (see the > NIST document). Well, I did to the extent in that I mentioned a minimum of eight characters in a password. This method of entry would produce the possibility of 64 times 63 to the power of 7 possible passwords for eight character passwords alone. For longer passwords there would be many more possibilities. I have put forward an idea that I opine could be very useful in some circumstances. I hope that it gets implemented. Although I could publish a particular suggested layout of 64 emoji in an 8 by 8 array myself, I consider that such a layout may never be taken up by app producers, yet if a suggested layout of 64 emoji in an 8 by 8 array were published by Unicode Inc. then it might well be taken up by many app producers and be of practical benefit to some end users. William Overington Monday 11 April 2022 -------------- next part -------------- An HTML attachment was scrubbed... URL: From jr at qsm.co.il Tue Apr 12 11:59:24 2022 From: jr at qsm.co.il (Jonathan Rosenne) Date: Tue, 12 Apr 2022 16:59:24 +0000 Subject: global password strategies In-Reply-To: <4bf6e9fc.3568d.1801a1e0419.Webtop.102@btinternet.com> References: <004c01d8487a$f244ba30$d6ce2e90$@xencraft.com> <88843b04-3809-dc47-0f1c-caebdbab0b28@it.aoyama.ac.jp> <00a901d84af8$f8cb2f80$ea618e80$@xencraft.com> <5b43faa6.3100f.18008ca9e0b.Webtop.102@btinternet.com> <52f3e650-1c67-5167-44e3-2e7cc5c54005@shoulson.com> <8fe18c9.32c5a.1800fde5cfa.Webtop.102@btinternet.com> <002801d84ca0$f65c1da0$e31458e0$@xencraft.com> <4bf6e9fc.3568d.1801a1e0419.Webtop.102@btinternet.com> Message-ID: Designing password strategies is a science and expertise in itself. It is not easy. I am involved in implementing such designs but not in the design. The emoji suggestion does not meet the NIST recommendations at least in the following points: * An emoji is a character just as A, B or C. 8 emojis are 8 characters, which is too short. It is strange that I would feel the need to say so in this forum. * It would be difficult to remember a long non trivial sequence of emojis. The recommendation is a phrase. My personal preference is a long phrase I can easily remember into which I introduce an error in order to baffle dictionary attacks. For example: ?Lorem ipsum dolor sit amet, consequetur adipiscing elit? * NIST recommends allowing and using the whole range of Unicode rather than any subset. Best Regards, Jonathan Rosenne From: Unicode On Behalf Of William_J_G Overington via Unicode Sent: Monday, April 11, 2022 10:34 PM To: unicode at corp.unicode.org Subject: RE: global password strategies Tex wrote: > There are many problems that having a standard would resolve. Yes. > Simply stating an incomplete idea and then expecting Unicode Consortium or any other standards body to implement it is an arrogant and unreasonable proposition. Well, I never wrote that I expected anything. I wrote "So I am hoping ...", I simply put forward what seems to me a good idea that could be very useful in some circumstances, > To become a standard the idea has to have support from many communities, and it has to be a fit for the organization?s responsibilities. If Unicode Inc. were to specify a specific choice of 64 emoji set out in an 8 by 8 array, then it would be a de facto standard which people could use or not use as they chose, with no concern that the specific layout were proprietary and that someone or some organization might come along later and request royalties for using that particular layout. > It isn?t clear emoji are needed or optimal for this purpose, compared to just using shapes (triangle up, triangle down, etc.) or for that matter that any images are needed, since it could be select row3 column 4. I am not suggesting emoji to the exclusion of other possibilities. For me, using emoji has the advantage that the pictures are mostly of everyday things, so someone would possibly or even probably know for each picture the word to describe the picture in the language that he or she uses. > Ultimately, the password this generates does not need Unicode since the output reduces to a series of row and column pairs. (Which is why this is just an interface.) Well, I was not thinking of the output being a series of row and column pairs, I have, and am, thinking of the output being a sequence of Unicode characters, the 8 by 8 array of emoji being just as a way for an end user to enter a sequence of Unicode characters as if an end user enters a sequence of Unicode characters as a password in a text box. Indeed perhaps there could be a text box like display below the 8 by 8 array and as the emoji are clicked the text box fills up, either with dots or an emoji display, depending whether the text box is in Hide mode or Show mode. The 8 by 8 array method of password entry would just work in parallel with the conventional text box method of password entry. It is sort of like how a built in keyboard on a laptop computer can work in parallel with an external keyboard. > So if you think this should be a standard, establish the requirements for password entry, show that the proposal satisfies the requirements, find communities that agree and support the idea, and find a standards body that will make it a standard. Well, I opine that it could be helpful in some circumstances if a particular layout of 64 emoji in an 8 by 8 array so as to facilitate password entry in a manner not linked to any particular script or an particular language were to become published by Unicode Inc. as an app developer would have a list available to use if so desired and if various producers of apps were to use the same particular layout that that could be helpful to end users. What I am suggesting is just a simple sort of gadget to metaphorically bolt on to an existing password entry system to give the existing method an extra way for an end user to set up a password and to enter a previously set up password. > You have not acknowledged the requirements for password entry (see the NIST document). Well, I did to the extent in that I mentioned a minimum of eight characters in a password. This method of entry would produce the possibility of 64 times 63 to the power of 7 possible passwords for eight character passwords alone. For longer passwords there would be many more possibilities. I have put forward an idea that I opine could be very useful in some circumstances. I hope that it gets implemented. Although I could publish a particular suggested layout of 64 emoji in an 8 by 8 array myself, I consider that such a layout may never be taken up by app producers, yet if a suggested layout of 64 emoji in an 8 by 8 array were published by Unicode Inc. then it might well be taken up by many app producers and be of practical benefit to some end users. William Overington Monday 11 April 2022 -------------- next part -------------- An HTML attachment was scrubbed... URL: From asmusf at ix.netcom.com Tue Apr 12 12:49:49 2022 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Tue, 12 Apr 2022 10:49:49 -0700 Subject: global password strategies In-Reply-To: References: <004c01d8487a$f244ba30$d6ce2e90$@xencraft.com> <88843b04-3809-dc47-0f1c-caebdbab0b28@it.aoyama.ac.jp> <00a901d84af8$f8cb2f80$ea618e80$@xencraft.com> <5b43faa6.3100f.18008ca9e0b.Webtop.102@btinternet.com> <52f3e650-1c67-5167-44e3-2e7cc5c54005@shoulson.com> <8fe18c9.32c5a.1800fde5cfa.Webtop.102@btinternet.com> <002801d84ca0$f65c1da0$e31458e0$@xencraft.com> <4bf6e9fc.3568d.1801a1e0419.Webtop.102@btinternet.com> Message-ID: <9bd09728-361c-0bff-f03d-88db46e6afd2@ix.netcom.com> An HTML attachment was scrubbed... URL: From pgcon6 at msn.com Wed Apr 13 18:18:20 2022 From: pgcon6 at msn.com (Peter Constable) Date: Wed, 13 Apr 2022 23:18:20 +0000 Subject: Weird Unicode math symbols (XKCD) Message-ID: https://xkcd.com/2606 Peter -------------- next part -------------- An HTML attachment was scrubbed... URL: From pgcon6 at msn.com Wed Apr 13 18:33:51 2022 From: pgcon6 at msn.com (Peter Constable) Date: Wed, 13 Apr 2022 23:33:51 +0000 Subject: global password strategies In-Reply-To: <52f3e650-1c67-5167-44e3-2e7cc5c54005@shoulson.com> References: <004c01d8487a$f244ba30$d6ce2e90$@xencraft.com> <88843b04-3809-dc47-0f1c-caebdbab0b28@it.aoyama.ac.jp> <00a901d84af8$f8cb2f80$ea618e80$@xencraft.com> <5b43faa6.3100f.18008ca9e0b.Webtop.102@btinternet.com> <52f3e650-1c67-5167-44e3-2e7cc5c54005@shoulson.com> Message-ID: Mark?s right, it?s a user interface? but one specifically intended for creating something equivalent to a password: a sequence of user actions that is able to provide high entropy relative to the set of possible sequences. And Windows has had this type of authentication UI since Windows 8: it?s called ?picture password?. [cid:image001.png at 01D84F54.374D4E50] Peter From: Unicode On Behalf Of Mark E. Shoulson via Unicode Sent: Friday, April 8, 2022 1:59 PM To: unicode at corp.unicode.org Subject: Re: global password strategies That's not a format, it's a user interface ("the user is presented...") Unicode doesn't standardize user interfaces. Restricting the permissible alphabet to emoji is just about as bad/annoying (for many users) as restricting it to ASCII or Cyrillic or whatever, except that it's more evenly hard on everyone. ~mark On 4/8/22 06:48, William_J_G Overington via Unicode wrote: Tex wrote: > Also, there are some apps that are serving users that are not computer literate. They may be handed a tablet to enter information or similar scenarios. There could be a password input format that is both script-independent and language-independent and platform-independent where the end user is presented with a display of a, say, 8 by 8 grid of emoji and the end user needs to have a password by clicking on of at least eight of them in a sequence, with the only restriction being that no emoji shall be used twice in immediate succession, this to avoid accidental double clicking causing problems. The emoji chosen to be used in the format would need to be chosen carefully so as to be clearly distinct from each other whether the display is in colour or in monochrome and taking account of colour vision issues (so not both RED APPLE and GREEN APPLE), non-violent, respectful to cultures and religions, and so that the format could be used both in left to right situations and in right to left situations without ambiguity if characters are mirrored horizontally. If the format were published by Unicode Inc. it could become widely used around the world in some situations. William Overington Friday 8 April 2022 -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 5776 bytes Desc: image001.png URL: From haberg-1 at telia.com Thu Apr 14 03:24:50 2022 From: haberg-1 at telia.com (=?utf-8?Q?Hans_=C3=85berg?=) Date: Thu, 14 Apr 2022 10:24:50 +0200 Subject: Weird Unicode math symbols (XKCD) In-Reply-To: References: Message-ID: <2B6312C6-56C9-43C9-BC57-A535F54CEC2F@telia.com> Here are the correct Unicode names: U+29CD ? TRIANGLE WITH SERIFS AT BOTTOM U+23E7 ? ELECTRICAL INTERSECTION U+2A33 ? SMASH PRODUCT U+2A7C ? GREATER-THAN WITH QUESTION MARK ABOVE U+299E ? ANGLE WITH S INSIDE U+2A04 ? N-ARY UNION OPERATOR WITH PLUS U+2B48 ? RIGHTWARDS ARROW ABOVE REVERSE ALMOST EQUAL TO U+225D ? EQUAL TO BY DEFINITION U+237C ? RIGHT ANGLE WITH DOWNWARDS ZIGZAG ARROW U+2A50 ? CLOSED UNION WITH SERIFS AND SMASH PRODUCT U+2A69 ? TRIPLE HORIZONTAL BAR WITH TRIPLE VERTICAL STROKE U+2368 ? APL FUNCTIONAL SYMBOL TILDE DIAERESIS U+2118 ? SCRIPT CAPITAL P U+2AC1 ? SUBSET WITH MULTIPLICATION SIGN BELOW U+232D ? CYLINDRICITY U+2A13 ? LINE INTEGRATION WITH SEMICIRCULAR PATH AROUND POLE U+2A0B ? SUMMATION WITH INTEGRAL > On 14 Apr 2022, at 01:18, Peter Constable via Unicode wrote: > > https://xkcd.com/2606 > > > Peter From wjgo_10009 at btinternet.com Thu Apr 14 07:31:24 2022 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Thu, 14 Apr 2022 13:31:24 +0100 (BST) Subject: global password strategies In-Reply-To: References: <004c01d8487a$f244ba30$d6ce2e90$@xencraft.com> <88843b04-3809-dc47-0f1c-caebdbab0b28@it.aoyama.ac.jp> <00a901d84af8$f8cb2f80$ea618e80$@xencraft.com> <5b43faa6.3100f.18008ca9e0b.Webtop.102@btinternet.com> <52f3e650-1c67-5167-44e3-2e7cc5c54005@shoulson.com> Message-ID: <1dac4f7f.3a073.180280e8f6d.Webtop.102@btinternet.com> Peter Constable wrote: > And Windows has had this type of authentication UI since Windows 8: > it?s called ?picture password?. Until I read that sentence I had never known of "picture password". As far as I can tell the nearest presently existing Microsoft facility on my laptop computer that is running Windows 10 to what I am suggesting is the on-screen keyboard facility. The facility that I am suggesting would work in a similar way to the Microsoft on-screen keyboard in that the response to a click on a glyph would output a Unicode character. The difference being that each of the characters available would be an emoji character, thus a script-independent and language independent character. Maybe a copy of the source code of the Microsoft on-screen keyboard could be adapted to produce what I am suggesting, just changing the layout and changing the set of characters that can be keyed. As I cannot do that myself I am not going to speculate upon how difficult that would be to do by a person expert in that type of programming. William Overington Thursday 14 April 2022 -------------- next part -------------- An HTML attachment was scrubbed... URL: From mark at kli.org Thu Apr 14 17:18:27 2022 From: mark at kli.org (Mark E. Shoulson) Date: Thu, 14 Apr 2022 18:18:27 -0400 Subject: global password strategies In-Reply-To: <1dac4f7f.3a073.180280e8f6d.Webtop.102@btinternet.com> References: <004c01d8487a$f244ba30$d6ce2e90$@xencraft.com> <88843b04-3809-dc47-0f1c-caebdbab0b28@it.aoyama.ac.jp> <00a901d84af8$f8cb2f80$ea618e80$@xencraft.com> <5b43faa6.3100f.18008ca9e0b.Webtop.102@btinternet.com> <52f3e650-1c67-5167-44e3-2e7cc5c54005@shoulson.com> <1dac4f7f.3a073.180280e8f6d.Webtop.102@btinternet.com> Message-ID: There might be something to this.? Some semi-standardized set of keyboard layouts and input methods that can be immediately and temporarily activated in nearly any state of the computer, by a well-known keystroke or menu or whatever, so you need to know how to type your password in *one* of them and how to select it, and then you can just click it out on an emulated keyboard onscreen (or better, have the keyboard actually remap itself.) Kind of tricky to get the details right and decide on what and how and all; the market and vendors will probably have to converge (slowly) on some consensus.? No, Unicode can't dictate this as a standard, it's out of scope for Unicode, and pretty much every other standards organization too: there is no standard dictating user interface.? Just some popular conventions that have become fairly universal. ~mark On 4/14/22 08:31, William_J_G Overington via Unicode wrote: > Peter Constable wrote: > > > And Windows has had this type of authentication UI since Windows 8: > it?s called ?picture password?. > > Until I read that sentence I had never known of "picture password". > > As far as I can tell the nearest presently existing Microsoft facility > on my laptop computer that is running Windows 10 to what I am > suggesting is the on-screen keyboard facility. > > The facility that I am suggesting would work in a similar way to the > Microsoft on-screen keyboard in that the response to a click on a > glyph would output a Unicode character. The difference being that each > of the characters available would be an emoji character, thus a > script-independent and language independent character. > > Maybe a copy of the source code of the Microsoft on-screen keyboard > could be adapted to produce what I am suggesting, just changing the > layout and changing the set of characters that can be keyed. > > As I cannot do that myself I am not going to speculate upon how > difficult that would be to do by a person expert in that type of > programming. > > William Overington > > Thursday 14 April 2022 > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From costello at mitre.org Fri Apr 15 07:46:51 2022 From: costello at mitre.org (Roger L Costello) Date: Fri, 15 Apr 2022 12:46:51 +0000 Subject: How the C programming language bridges the man-machine gap Message-ID: Hi Folks, I learned two neat things this week. First, I learned how C bridges the gap between man and machine. The C programming language allows you to create a char variable: char ch; and assign it a character: ch = 'a'; Interestingly, you can do addition and subtraction on the char variable. That always puzzled me. Until today. Humans interact with the computer (mostly) using characters. The computer, however, (mostly) uses numbers. How does the C programming language bridge the gap between human-friendly characters and machine-friendly numbers? Answer: by allowing addition and subtraction on char variables. For example, what is this: 1 You might answer: It is the number 1. And you would be wrong. It is a character not a number. Programs are written using characters. To instruct the computer that we really mean the number 1 not the character 1, we must convert the character 1 to the number 1. That's where addition and subtraction of characters enters. (As you know) Every character is represented inside the computer as a number. To convert the character 1 to the number 1 we subtract the numeric representation of the character 1 from the numeric representation of the character 0: int val = ch - '0'; To convert the sequence of characters 123 to the number 123 we iterate through each character, converting the character to a number, and then multiplying by 1 or 10 or 100, etc. depending on the position in the string: while ( ch = getchar() ) val = val * 10 + ch - '0'; That little loop bridges the gap between the human-friendly string 123 and the machine-friendly number 123. Neat! The second thing I learned this week is to liberally sprinkle assertions throughout a program. Oddly enough, this is why mastery of Boolean logic is essential. In his book The Science of Programming, David Gries has a great example of converting degrees Fahrenheit to degrees Celsius. He shows the value of adding assertions throughout the code to ensure that the code behaves as expected. He says that using assertions provide a way to "develop a program and its proof hand-in-hand, with the proof ideas leading the way!" When I went to school the computer science department emphasized learning Boolean logic. While I enjoyed learning Boolean logic, it puzzled me why there would be such emphasis on it. As I was reading David Gries' book the answer dawned on me: assertions are Boolean expressions; master Boolean expressions and it will take you a long way toward writing provably correct programs. Neat! Well, those are the two revelations for me this week. I realize they are pretty basic. Sometimes I don't fully internalize a concept until I see the right use case. This week I saw the right use case. /Roger From abrahamgross at disroot.org Fri Apr 15 08:15:39 2022 From: abrahamgross at disroot.org (ag disroot) Date: Fri, 15 Apr 2022 13:15:39 +0000 (UTC) Subject: How the C programming language bridges the man-machine gap In-Reply-To: References: Message-ID: This sounds like either an essay for a school assignment or an AI written essay From doug at ewellic.org Fri Apr 15 11:20:49 2022 From: doug at ewellic.org (Doug Ewell) Date: Fri, 15 Apr 2022 10:20:49 -0600 Subject: How the C programming language bridges the man-machine gap In-Reply-To: References: Message-ID: <000201d850e4$c5be7ab0$513b7010$@ewellic.org> Roger L Costello wrote: > (As you know) Every character is represented inside the computer as a > number. To convert the character 1 to the number 1 we subtract the > numeric representation of the character 1 from the numeric > representation of the character 0: > > int val = ch - '0'; > > To convert the sequence of characters 123 to the number 123 we iterate > through each character, converting the character to a number, and then > multiplying by 1 or 10 or 100, etc. depending on the position in the > string: > > while ( ch = getchar() ) > val = val * 10 + ch - '0'; > > That little loop bridges the gap between the human-friendly string 123 > and the machine-friendly number 123. Neat! It seems you had already discovered this about sixteen months earlier: https://corp.unicode.org/pipermail/unicode/2020-December/009200.html The Unicode list is just about the last place you want to promote these ASCII-constrained code snippets. There are plenty of examples of integer-to-string and string-to-integer conversion functions that are much more Unicode-aware. In fact, Fr?d?ric Grosshans addressed exactly that in his posts dated 2020-12-16 on the thread mentioned above. -- Doug Ewell, CC, ALB | Lakewood, CO, US | ewellic.org From marius.spix at web.de Fri Apr 15 12:26:06 2022 From: marius.spix at web.de (Marius Spix) Date: Fri, 15 Apr 2022 19:26:06 +0200 Subject: Aw: How the C programming language bridges the man-machine gap In-Reply-To: References: Message-ID: char literals are not reliable for arithmetic expressions. '1' - '0' = 1 may be true for Windows-1252 or EBCDIC systems, but you cannot expect that this works in all character sets. > Gesendet: Freitag, den 15.04.2022 um 14:46 Uhr > Von: "Roger L Costello via Unicode" > An: "unicode at unicode.org" > Betreff: How the C programming language bridges the man-machine gap > > Hi Folks, > > I learned two neat things this week. > > First, I learned how C bridges the gap between man and machine. > > The C programming language allows you to create a char variable: > > char ch; > > and assign it a character: > > ch = 'a'; > > Interestingly, you can do addition and subtraction on the char variable. That always puzzled me. Until today. > > Humans interact with the computer (mostly) using characters. The computer, however, (mostly) uses numbers. How does the C programming language bridge the gap between human-friendly characters and machine-friendly numbers? Answer: by allowing addition and subtraction on char variables. > > For example, what is this: 1 > > You might answer: It is the number 1. > > And you would be wrong. It is a character not a number. > > Programs are written using characters. To instruct the computer that we really mean the number 1 not the character 1, we must convert the character 1 to the number 1. That's where addition and subtraction of characters enters. > > (As you know) Every character is represented inside the computer as a number. To convert the character 1 to the number 1 we subtract the numeric representation of the character 1 from the numeric representation of the character 0: > > int val = ch - '0'; > > To convert the sequence of characters 123 to the number 123 we iterate through each character, converting the character to a number, and then multiplying by 1 or 10 or 100, etc. depending on the position in the string: > > while ( ch = getchar() ) > val = val * 10 + ch - '0'; > > That little loop bridges the gap between the human-friendly string 123 and the machine-friendly number 123. Neat! > > The second thing I learned this week is to liberally sprinkle assertions throughout a program. Oddly enough, this is why mastery of Boolean logic is essential. > > In his book The Science of Programming, David Gries has a great example of converting degrees Fahrenheit to degrees Celsius. He shows the value of adding assertions throughout the code to ensure that the code behaves as expected. He says that using assertions provide a way to "develop a program and its proof hand-in-hand, with the proof ideas leading the way!" > > When I went to school the computer science department emphasized learning Boolean logic. While I enjoyed learning Boolean logic, it puzzled me why there would be such emphasis on it. > > As I was reading David Gries' book the answer dawned on me: assertions are Boolean expressions; master Boolean expressions and it will take you a long way toward writing provably correct programs. Neat! > > Well, those are the two revelations for me this week. I realize they are pretty basic. Sometimes I don't fully internalize a concept until I see the right use case. This week I saw the right use case. > > /Roger > From doug at ewellic.org Fri Apr 15 13:02:03 2022 From: doug at ewellic.org (Doug Ewell) Date: Fri, 15 Apr 2022 12:02:03 -0600 Subject: How the C programming language bridges the man-machine gap In-Reply-To: References: Message-ID: <000701d850f2$e9f73670$bde5a350$@ewellic.org> Marius Spix wrote: > char literals are not reliable for arithmetic expressions. > '1' - '0' = 1 may be true for Windows-1252 or EBCDIC systems, but you > cannot expect that this works in all character sets. Modern C language specifications (at least C99) ensure that you can: > 5.2.1 Character sets > > Both the basic source and basic execution character sets shall have > the following members: the 26 uppercase letters of the Latin alphabet > > A B C D E F G H I J K L M > N O P Q R S T U V W X Y Z > > the 26 lowercase letters of the Latin alphabet > > a b c d e f g h i j k l m > n o p q r s t u v w x y z > > the 10 decimal digits > > 0 1 2 3 4 5 6 7 8 9 > > the following 29 graphic characters > > ! " # % & ' ( ) * + , - . / : > ; < = > ? [ \ ] ^ _ { | } ~ > > [...] > > In both the source and execution basic character sets, the value of > each character after 0 in the above list of decimal digits shall be > one greater than the value of the previous. I'm not aware of any character set that meets the repertoire requirement but not the digit-sequencing requirement. -- Doug Ewell, CC, ALB | Lakewood, CO, US | ewellic.org From haberg-1 at telia.com Sat Apr 16 08:17:07 2022 From: haberg-1 at telia.com (=?utf-8?Q?Hans_=C3=85berg?=) Date: Sat, 16 Apr 2022 15:17:07 +0200 Subject: How the C programming language bridges the man-machine gap In-Reply-To: <000701d850f2$e9f73670$bde5a350$@ewellic.org> References: <000701d850f2$e9f73670$bde5a350$@ewellic.org> Message-ID: > On 15 Apr 2022, at 20:02, Doug Ewell via Unicode wrote: > > Marius Spix wrote: > >> char literals are not reliable for arithmetic expressions. >> '1' - '0' = 1 may be true for Windows-1252 or EBCDIC systems, but you >> cannot expect that this works in all character sets. > > Modern C language specifications (at least C99) ensure that you can: > >> 5.2.1 Character sets >> >> Both the basic source and basic execution character sets shall have >> the following members: the 26 uppercase letters of the Latin alphabet >> >> A B C D E F G H I J K L M >> N O P Q R S T U V W X Y Z >> >> the 26 lowercase letters of the Latin alphabet >> >> a b c d e f g h i j k l m >> n o p q r s t u v w x y z >> >> the 10 decimal digits >> >> 0 1 2 3 4 5 6 7 8 9 >> >> the following 29 graphic characters >> >> ! " # % & ' ( ) * + , - . / : >> ; < = > ? [ \ ] ^ _ { | } ~ >> >> [...] >> >> In both the source and execution basic character sets, the value of >> each character after 0 in the above list of decimal digits shall be >> one greater than the value of the previous. > > I'm not aware of any character set that meets the repertoire requirement but not the digit-sequencing requirement. One can't use say the Unicode superscript numbers and their code points directly for C. One way to use such superscript integers is to first pick the string up, say using a lexer like Flex, and then convert the string to ordinary digits, which then can be used in C/C++ functions. From john.w.kennedy at gmail.com Sat Apr 16 08:31:03 2022 From: john.w.kennedy at gmail.com (John W Kennedy) Date: Sat, 16 Apr 2022 09:31:03 -0400 Subject: Fwd: How the C programming language bridges the man-machine gap References: Message-ID: There were some pre-360 systems, certainly (such as the IBM 650 and 7070), and perhaps, too, the Soviet machines that used trits instead of bits, where it is not correct to assume that '1' + 1 == '2', but I doubt those machines ever ran C. But in this age when there are already major languages (e.g., Java and Swift) that support only Unicode, this entire discussion is rather moot. -- John W. Kennedy "Compact is becoming contract, Man only earns and pays." -- Charles Williams. "Bors to Elayne: On the King's Coins" >>> On Apr 15, 2022, at 2:05 PM, Doug Ewell via Unicode wrote: >>> >>> ?Marius Spix wrote: >>> >>> char literals are not reliable for arithmetic expressions. >>> '1' - '0' = 1 may be true for Windows-1252 or EBCDIC systems, but you >>> cannot expect that this works in all character sets. >> >> Modern C language specifications (at least C99) ensure that you can: >> >>> 5.2.1 Character sets >>> >>> Both the basic source and basic execution character sets shall have >>> the following members: the 26 uppercase letters of the Latin alphabet >>> >>> A B C D E F G H I J K L M >>> N O P Q R S T U V W X Y Z >>> >>> the 26 lowercase letters of the Latin alphabet >>> >>> a b c d e f g h i j k l m >>> n o p q r s t u v w x y z >>> >>> the 10 decimal digits >>> >>> 0 1 2 3 4 5 6 7 8 9 >>> >>> the following 29 graphic characters >>> >>> ! " # % & ' ( ) * + , - . / : >>> ; < = > ? [ \ ] ^ _ { | } ~ >>> >>> [...] >>> >>> In both the source and execution basic character sets, the value of >>> each character after 0 in the above list of decimal digits shall be >>> one greater than the value of the previous. >> >> I'm not aware of any character set that meets the repertoire requirement but not the digit-sequencing requirement. >> >> -- >> Doug Ewell, CC, ALB | Lakewood, CO, US | ewellic.org >> >> >> From doug at ewellic.org Sun Apr 17 13:33:17 2022 From: doug at ewellic.org (Doug Ewell) Date: Sun, 17 Apr 2022 12:33:17 -0600 Subject: How the C programming language bridges the man-machine gap In-Reply-To: References: <000701d850f2$e9f73670$bde5a350$@ewellic.org> Message-ID: <006901d85289$9bb9c570$d32d5050$@ewellic.org> Hans ?berg wrote: > One can't use say the Unicode superscript numbers and their code > points directly for C. Was that part of the use case? -- Doug Ewell, CC, ALB | Lakewood, CO, US | ewellic.org From haberg-1 at telia.com Sun Apr 17 13:58:23 2022 From: haberg-1 at telia.com (=?utf-8?Q?Hans_=C3=85berg?=) Date: Sun, 17 Apr 2022 20:58:23 +0200 Subject: How the C programming language bridges the man-machine gap In-Reply-To: <006901d85289$9bb9c570$d32d5050$@ewellic.org> References: <000701d850f2$e9f73670$bde5a350$@ewellic.org> <006901d85289$9bb9c570$d32d5050$@ewellic.org> Message-ID: <355FC218-079C-46F1-8FA9-6D76A47EBE4E@telia.com> > On 17 Apr 2022, at 20:33, Doug Ewell wrote: > > Hans ?berg wrote: > >> One can't use say the Unicode superscript numbers and their code >> points directly for C. > > Was that part of the use case? 5.2.1 Character sets ? In both the source and execution basic character sets, the value of each character after 0 in the above list of decimal digits shall be one greater than the value of the previous. From doug at ewellic.org Mon Apr 18 00:51:05 2022 From: doug at ewellic.org (Doug Ewell) Date: Sun, 17 Apr 2022 23:51:05 -0600 Subject: Basic Latin digits, not everything else (was: RE: How the C programming language bridges the man-machine gap) Message-ID: <007801d852e8$4bea5b90$e3bf12b0$@ewellic.org> Hans ?berg wrote: >>> One can't use say the Unicode superscript numbers and their code >>> points directly for C. >> >> Was that part of the use case? > > 5.2.1 Character sets > ? > In both the source and execution basic character sets, the value of > each character after 0 in the above list of decimal digits shall be > one greater than the value of the previous. I think it's abundantly clear that the C standard, specifically "the above list of decimal digits," applies to the Basic Latin digits U+0030 through U+0039, and not to superscript digits, subscript digits, negative circled digits, mathematical sans-serif bold digits, or any other digits encoded in Unicode. The digits shown in the PDF version of ISO/IEC 9899 are, visually, the size and alignment one would expect of Basic Latin digits. The accompanying text makes no mention of superscript digits. If superscript digits had been intended, one would think they would have been explicitly mentioned along with, or in opposition to, normal digits. Although Roger Costello's original post in this thread was elementary for this list, it did clearly describe the normal process of converting integer values to and from strings of Basic Latin digits, not strings of other kinds of digits. THAT is what we are talking about here. I'm not sure exactly what Murray Sargent meant by using superscript and subscript digits in C++ programs. If there are standard libraries for converting those to integer values and back, I wasn't aware of them. But in any case, yes! It is true! Processing the superscript digits does require dealing with non-contiguous code point allocations and non?strictly-increasing code point order, because some of them were encoded in Latin-1 Supplement as part of ISO 8859-1 direct convertibility while the rest were added to the Superscripts and Subscripts block. But I haven't got a clue what this has to do with handling of Basic Latin digits. It seems we are having a hard time staying focused in this thread. I received another response, offline, that the C standard doesn't promise anything about order and contiguity of hex digits (i.e. decimal digits interspersed with letters), nor does it discuss how to handle non-positional number systems, such as Greek and Roman numerals. No, it doesn't, but how do these prove or disprove anything about the handling of '0' through '9' either? Perhaps I'm dense and simply need the responses to be more clear as to whether they are disputing something stated in the C standard, or just demonstrating knowledge that there are other digits and number systems for which the stated rules don't apply. Or perhaps I'm just being grumpy. -- Doug Ewell, CC, ALB | Lakewood, CO, US | ewellic.org From haberg-1 at telia.com Mon Apr 18 02:46:09 2022 From: haberg-1 at telia.com (=?utf-8?Q?Hans_=C3=85berg?=) Date: Mon, 18 Apr 2022 09:46:09 +0200 Subject: Basic Latin digits, not everything else (was: RE: How the C programming language bridges the man-machine gap) In-Reply-To: <007801d852e8$4bea5b90$e3bf12b0$@ewellic.org> References: <007801d852e8$4bea5b90$e3bf12b0$@ewellic.org> Message-ID: <5206DCC9-2AF1-4607-9BD5-744117ED9833@telia.com> > On 18 Apr 2022, at 07:51, Doug Ewell via Unicode wrote: > > Hans ?berg wrote: > >>>> One can't use say the Unicode superscript numbers and their code >>>> points directly for C. >>> >>> Was that part of the use case? >> >> 5.2.1 Character sets >> ? >> In both the source and execution basic character sets, the value of >> each character after 0 in the above list of decimal digits shall be >> one greater than the value of the previous. > > I think it's abundantly clear that the C standard, specifically "the above list of decimal digits," applies to the Basic Latin digits U+0030 through U+0039, and not to superscript digits, subscript digits, negative circled digits, mathematical sans-serif bold digits, or any other digits encoded in Unicode. The standard only says that from the point of view of C that those should be available, not how they should be represented. From costello at mitre.org Fri Apr 15 10:52:30 2022 From: costello at mitre.org (Roger L Costello) Date: Fri, 15 Apr 2022 15:52:30 +0000 Subject: How the C programming language bridges the man-machine gap Message-ID: > This sounds like either an essay for a school assignment or an AI written essay Sorry if it came across like that. It's neither. Just sharing something that I found interesting/enlightening and thought it might be of interest to others. /Roger -----Original Message----- From: Unicode On Behalf Of ag disroot via Unicode Sent: Friday, April 15, 2022 9:16 AM To: unicode at corp.unicode.org Subject: [EXT] How the C programming language bridges the man-machine gap This sounds like either an essay for a school assignment or an AI written essay From wjgo_10009 at btinternet.com Mon Apr 18 08:34:11 2022 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Mon, 18 Apr 2022 14:34:11 +0100 (BST) Subject: global password strategies Message-ID: <20de6c11.7b3.1803ce17b77.Webtop.102@btinternet.com> Mark E. Shoulson wrote: > There might be something to this. Some semi-standardized set of > keyboard layouts and input methods that can be immediately and > temporarily activated in nearly any state of the computer, by a > well-known keystroke or menu or whatever, ... How about CONTROL 7 for activating from a keyboard, and by clicking the logo as in the attached graphics file for activating by clicking a logo on a start up screen. The use of CONTROL 7 would be capable of being extended to clicking on CONTROL (the digit for 7 in any script) I appreciate that if a language uses a different set of letters from those used for English yet uses the same set of digit glyphs as does English then something extra is needed. > so you need to know how to type your password in *one* of them and how > to select it, and then you can just click it out on an emulated > keyboard onscreen (or better, have the keyboard actually remap > itself.) Yes, always getting the emoji keyboard and another keyboard. Perhaps the other keyboard could be changed by clicking on a flag from amongst a display of flags. > Kind of tricky to get the details right and decide on what and how and > all; Yes. > the market and vendors will probably have to converge Yes. > (slowly) Well, if CONTROL 7 and my suggested logo are used to get started, either or both could be retained or replaced as consensus emerges, and using CONTROL 7 and my suggested logo is independent of each of the vendors, so a level start for trying to reach a consensus, so a possibility for a prompt start. > on some consensus. Yes. > No, Unicode can't dictate this as a standard, Well, I don't think that Unicode Inc. *dictates* anything does it. > it's out of scope for Unicode, For people to use a computer system to produce, say, stories and poems in their own language using Unicode and safely conserve them on a shared system, the people need to be able to get onto the computer system. So for me, a standardized, though optional, way to conveniently enter a password into a computer in order to be able to apply Unicode to produce, say, stories and poems, is part of the goal of helping people to use their own language on computer systems. But it is not a matter for me to decide whether it is or is not in scope for Unicode Inc. to be involved in publishing a password entry format for computer systems. But anyway, if Unicode Inc. were to take this topic on then it would probably help it become implemented much faster than it would otherwise be implemented, if indeed it ever would be implemented otherwise. > and pretty much every other standards organization too: I don't know one way or the other. > there is no standard dictating user interface. Just some popular > conventions that have become fairly universal. Well some of user interface is product styling, so not for standardizing at all. So, thank you for your input, progress is being made as we iterate towards a solution. William Overington Monday 18 April 2022 -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: on_screen_logo_to_click_to_open_the_password_panel.svg Type: image/svg+xml Size: 823 bytes Desc: not available URL: From doug at ewellic.org Mon Apr 18 12:41:59 2022 From: doug at ewellic.org (Doug Ewell) Date: Mon, 18 Apr 2022 11:41:59 -0600 Subject: Basic Latin digits, not everything else (was: RE: How the C programming language bridges the man-machine gap) In-Reply-To: <5206DCC9-2AF1-4607-9BD5-744117ED9833@telia.com> References: <007801d852e8$4bea5b90$e3bf12b0$@ewellic.org> <5206DCC9-2AF1-4607-9BD5-744117ED9833@telia.com> Message-ID: <009f01d8534b$9b6b9040$d242b0c0$@ewellic.org> Hans ?berg wrote: >> I think it's abundantly clear that the C standard, specifically "the >> above list of decimal digits," applies to the Basic Latin digits >> U+0030 through U+0039, and not to superscript digits, subscript >> digits, negative circled digits, mathematical sans-serif bold digits, >> or any other digits encoded in Unicode. > > The standard only says that from the point of view of C that those > should be available, not how they should be represented. The superscript European digits are not the same characters as the regular, full-size European digits, by either Unicode's definition of "same" or that of any other character encoding standard. Thus the C standard is only talking about 0123456789, not ??????????. -- Doug Ewell, CC, ALB | Lakewood, CO, US | ewellic.org From jr at qsm.co.il Mon Apr 18 12:57:25 2022 From: jr at qsm.co.il (Jonathan Rosenne) Date: Mon, 18 Apr 2022 17:57:25 +0000 Subject: Basic Latin digits, not everything else (was: RE: How the C programming language bridges the man-machine gap) In-Reply-To: <009f01d8534b$9b6b9040$d242b0c0$@ewellic.org> References: <007801d852e8$4bea5b90$e3bf12b0$@ewellic.org> <5206DCC9-2AF1-4607-9BD5-744117ED9833@telia.com> <009f01d8534b$9b6b9040$d242b0c0$@ewellic.org> Message-ID: The Arabic-Hindi digits, 0660 to 0669, and 06F0 to 06F9, also have this property. Maybe the C standard should be updated. Best Regards, Jonathan Rosenne -----Original Message----- From: Unicode On Behalf Of Doug Ewell via Unicode Sent: Monday, April 18, 2022 8:42 PM To: 'Hans ?berg' Cc: 'Marius Spix' ; 'Roger L Costello via Unicode' Subject: RE: Basic Latin digits, not everything else (was: RE: How the C programming language bridges the man-machine gap) Hans ?berg wrote: >> I think it's abundantly clear that the C standard, specifically "the >> above list of decimal digits," applies to the Basic Latin digits >> U+0030 through U+0039, and not to superscript digits, subscript >> digits, negative circled digits, mathematical sans-serif bold digits, >> or any other digits encoded in Unicode. > > The standard only says that from the point of view of C that those > should be available, not how they should be represented. The superscript European digits are not the same characters as the regular, full-size European digits, by either Unicode's definition of "same" or that of any other character encoding standard. Thus the C standard is only talking about 0123456789, not ??????????. -- Doug Ewell, CC, ALB | Lakewood, CO, US | ewellic.org From haberg-1 at telia.com Mon Apr 18 13:26:50 2022 From: haberg-1 at telia.com (=?utf-8?Q?Hans_=C3=85berg?=) Date: Mon, 18 Apr 2022 20:26:50 +0200 Subject: Basic Latin digits, not everything else (was: RE: How the C programming language bridges the man-machine gap) In-Reply-To: <009f01d8534b$9b6b9040$d242b0c0$@ewellic.org> References: <007801d852e8$4bea5b90$e3bf12b0$@ewellic.org> <5206DCC9-2AF1-4607-9BD5-744117ED9833@telia.com> <009f01d8534b$9b6b9040$d242b0c0$@ewellic.org> Message-ID: > On 18 Apr 2022, at 19:41, Doug Ewell wrote: > > Hans ?berg wrote: > >>> I think it's abundantly clear that the C standard, specifically "the >>> above list of decimal digits," applies to the Basic Latin digits >>> U+0030 through U+0039, and not to superscript digits, subscript >>> digits, negative circled digits, mathematical sans-serif bold digits, >>> or any other digits encoded in Unicode. >> >> The standard only says that from the point of view of C that those >> should be available, not how they should be represented. > > The superscript European digits are not the same characters as the regular, full-size European digits, by either Unicode's definition of "same" or that of any other character encoding standard. Thus the C standard is only talking about 0123456789, not ??????????. I suggest you check in some C language standard forum. From doug at ewellic.org Mon Apr 18 13:47:19 2022 From: doug at ewellic.org (Doug Ewell) Date: Mon, 18 Apr 2022 12:47:19 -0600 Subject: Basic Latin digits, not everything else (was: RE: How the C programming language bridges the man-machine gap) In-Reply-To: References: <007801d852e8$4bea5b90$e3bf12b0$@ewellic.org> <5206DCC9-2AF1-4607-9BD5-744117ED9833@telia.com> <009f01d8534b$9b6b9040$d242b0c0$@ewellic.org> Message-ID: <00a001d85354$bc495370$34dbfa50$@ewellic.org> Hans ?berg wrote: >> The superscript European digits are not the same characters as the >> regular, full-size European digits, by either Unicode's definition of >> "same" or that of any other character encoding standard. Thus the C >> standard is only talking about 0123456789, not ??????????. > > I suggest you check in some C language standard forum. Then this constraint (consecutive and strictly increasing) cannot be met using Unicode or any other character encoding that I have ever seen. -- Doug Ewell, CC, ALB | Lakewood, CO, US | ewellic.org From Jens.Maurer at gmx.net Mon Apr 18 14:10:58 2022 From: Jens.Maurer at gmx.net (Jens Maurer) Date: Mon, 18 Apr 2022 21:10:58 +0200 Subject: Basic Latin digits, not everything else (was: RE: How the C programming language bridges the man-machine gap) In-Reply-To: <00a001d85354$bc495370$34dbfa50$@ewellic.org> References: <007801d852e8$4bea5b90$e3bf12b0$@ewellic.org> <5206DCC9-2AF1-4607-9BD5-744117ED9833@telia.com> <009f01d8534b$9b6b9040$d242b0c0$@ewellic.org> <00a001d85354$bc495370$34dbfa50$@ewellic.org> Message-ID: On 18/04/2022 20.47, Doug Ewell via Unicode wrote: > Hans ?berg wrote: > >>> The superscript European digits are not the same characters as the >>> regular, full-size European digits, by either Unicode's definition of >>> "same" or that of any other character encoding standard. Thus the C >>> standard is only talking about 0123456789, not ??????????. >> >> I suggest you check in some C language standard forum. > > Then this constraint (consecutive and strictly increasing) cannot be met using Unicode or any other character encoding that I have ever seen. I sense some confusion here, but it's a bit hard for me to pinpoint it. I've been participating in the standardization of C++ for more than 20 years; C++ has a similar provision. The C standard (ISO 9899) says in section 5.2.1 paragraph 3: "In both the source and execution basic character sets, the value of each character after 0 in the above list of decimal digits shall be one greater than the value of the previous." Note the use of the term "basic character set". That is defined directly above based on the Latin alphabet and "the 10 decimal digits" 0...9. This is all understood to be subsets of the ASCII repertoire; any superscript or non-Western representation of digits is not in view here. Jens From haberg-1 at telia.com Mon Apr 18 14:36:54 2022 From: haberg-1 at telia.com (=?utf-8?Q?Hans_=C3=85berg?=) Date: Mon, 18 Apr 2022 21:36:54 +0200 Subject: Basic Latin digits, not everything else (was: RE: How the C programming language bridges the man-machine gap) In-Reply-To: References: <007801d852e8$4bea5b90$e3bf12b0$@ewellic.org> <5206DCC9-2AF1-4607-9BD5-744117ED9833@telia.com> <009f01d8534b$9b6b9040$d242b0c0$@ewellic.org> <00a001d85354$bc495370$34dbfa50$@ewellic.org> Message-ID: <7D868FBF-996C-4105-B19A-7F266072E68F@telia.com> > On 18 Apr 2022, at 21:10, Jens Maurer wrote: > > I sense some confusion here, but it's a bit hard for me > to pinpoint it. I've been participating in the standardization > of C++ for more than 20 years; C++ has a similar provision. > > The C standard (ISO 9899) says in section 5.2.1 paragraph 3: > > "In both the source and execution basic character sets, > the value of each character after 0 in the above list > of decimal digits shall be one greater than the value > of the previous." > > Note the use of the term "basic character set". > > That is defined directly above based on the Latin > alphabet and "the 10 decimal digits" 0...9. This > is all understood to be subsets of the ASCII > repertoire; any superscript or non-Western > representation of digits is not in view here. So in your interpretation, a C or a C++ compiler cannot use EBDIC? ?C++ used to have trigraphs to allow for that encoding. The question is not what is a useful version of a C compiler, but what is acceptable by the C standard. The main intent, as I see it, is allow to define C programs in a fairly portable way. So if one ensures the digits chosen are consecutive, one can write a portable C program using that feature by keeping track of the character translation. Another requirement is that the values must also fit into a C byte. So one must keep track of what a C byte is. One might compare with Unicode, it does not define what binary representation the code points should have, one only gets that by applying encodings like UTF-8 etc., but one does not have to use those standard encodings. From Jens.Maurer at gmx.net Mon Apr 18 14:51:12 2022 From: Jens.Maurer at gmx.net (Jens Maurer) Date: Mon, 18 Apr 2022 21:51:12 +0200 Subject: Basic Latin digits, not everything else (was: RE: How the C programming language bridges the man-machine gap) In-Reply-To: <7D868FBF-996C-4105-B19A-7F266072E68F@telia.com> References: <007801d852e8$4bea5b90$e3bf12b0$@ewellic.org> <5206DCC9-2AF1-4607-9BD5-744117ED9833@telia.com> <009f01d8534b$9b6b9040$d242b0c0$@ewellic.org> <00a001d85354$bc495370$34dbfa50$@ewellic.org> <7D868FBF-996C-4105-B19A-7F266072E68F@telia.com> Message-ID: On 18/04/2022 21.36, Hans ?berg wrote: > >> On 18 Apr 2022, at 21:10, Jens Maurer wrote: >> >> I sense some confusion here, but it's a bit hard for me >> to pinpoint it. I've been participating in the standardization >> of C++ for more than 20 years; C++ has a similar provision. >> >> The C standard (ISO 9899) says in section 5.2.1 paragraph 3: >> >> "In both the source and execution basic character sets, >> the value of each character after 0 in the above list >> of decimal digits shall be one greater than the value >> of the previous." >> >> Note the use of the term "basic character set". >> >> That is defined directly above based on the Latin >> alphabet and "the 10 decimal digits" 0...9. This >> is all understood to be subsets of the ASCII >> repertoire; any superscript or non-Western >> representation of digits is not in view here. > > So in your interpretation, a C or a C++ compiler cannot use EBDIC? ?C++ used to have trigraphs to allow for that encoding. Sure, a compiler can use EBCDIC, and existing compilers do. I said "ASCII repertoire", not "ASCII encoding". According to https://en.wikipedia.org/wiki/EBCDIC EBCDIC does have contiguous digits. (However, letters are not contiguous, but that is not what we're talking about here.) > The question is not what is a useful version of a C compiler, but what is acceptable by the C standard. The main intent, as I see it, is allow to define C programs in a fairly portable way. So if one ensures the digits chosen are consecutive, one can write a portable C program using that feature by keeping track of the character translation. The whole point of a programming language standard is to permit writing portable programs --- portable across compilers and hardware/operating system environments. The requirement in the C and C++ standards about contiguous digits ensures that programs relying on that property are portable to all conforming compilers. (In contrast, programs relying on contiguous Latin letters are not so portable.) > Another requirement is that the values must also fit into a C byte. So one must keep track of what a C byte is. I don't know what you mean by "one must keep track..." > One might compare with Unicode, it does not define what binary representation the code points should have, one only gets that by applying encodings like UTF-8 etc., but one does not have to use those standard encodings. Right, but it seems a programming language standard is at liberty to impose restrictions on the generality of Unicode when deemed practical. Jens From tom at honermann.net Mon Apr 18 14:54:15 2022 From: tom at honermann.net (Tom Honermann) Date: Mon, 18 Apr 2022 15:54:15 -0400 Subject: Basic Latin digits, not everything else (was: RE: How the C programming language bridges the man-machine gap) In-Reply-To: <7D868FBF-996C-4105-B19A-7F266072E68F@telia.com> References: <007801d852e8$4bea5b90$e3bf12b0$@ewellic.org> <5206DCC9-2AF1-4607-9BD5-744117ED9833@telia.com> <009f01d8534b$9b6b9040$d242b0c0$@ewellic.org> <00a001d85354$bc495370$34dbfa50$@ewellic.org> <7D868FBF-996C-4105-B19A-7F266072E68F@telia.com> Message-ID: <18700088-b2f1-98dc-c0e3-d70d02701ff7@honermann.net> On 4/18/22 3:36 PM, Hans ?berg via Unicode wrote: >> On 18 Apr 2022, at 21:10, Jens Maurer wrote: >> >> I sense some confusion here, but it's a bit hard for me >> to pinpoint it. I've been participating in the standardization >> of C++ for more than 20 years; C++ has a similar provision. >> >> The C standard (ISO 9899) says in section 5.2.1 paragraph 3: >> >> "In both the source and execution basic character sets, >> the value of each character after 0 in the above list >> of decimal digits shall be one greater than the value >> of the previous." >> >> Note the use of the term "basic character set". >> >> That is defined directly above based on the Latin >> alphabet and "the 10 decimal digits" 0...9. This >> is all understood to be subsets of the ASCII >> repertoire; any superscript or non-Western >> representation of digits is not in view here. > So in your interpretation, a C or a C++ compiler cannot use EBDIC? ?C++ used to have trigraphs to allow for that encoding. Strictly conforming C and C++ compilers can use EBCDIC so long as the EBCDIC code pages used for character and string literals (at compile-time) and the locale encoding of execution character sets (at run-time) is constrained to EBCDIC code pages that satisfy the property that decimal digits are encoded in sequence. Other code pages can be supported as extensions. The necessary property is probably satisfied by all EBCDIC code pages (but I don't know that for sure) since the decimal digits are encoded in sequence in the invariant subset of EBCDIC. Tom. > > The question is not what is a useful version of a C compiler, but what is acceptable by the C standard. The main intent, as I see it, is allow to define C programs in a fairly portable way. So if one ensures the digits chosen are consecutive, one can write a portable C program using that feature by keeping track of the character translation. > > Another requirement is that the values must also fit into a C byte. So one must keep track of what a C byte is. > > One might compare with Unicode, it does not define what binary representation the code points should have, one only gets that by applying encodings like UTF-8 etc., but one does not have to use those standard encodings. > > > From richard.wordingham at ntlworld.com Mon Apr 18 14:55:16 2022 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Mon, 18 Apr 2022 20:55:16 +0100 Subject: Basic Latin digits, not everything else (was: RE: How the C programming language bridges the man-machine gap) In-Reply-To: <7D868FBF-996C-4105-B19A-7F266072E68F@telia.com> References: <007801d852e8$4bea5b90$e3bf12b0$@ewellic.org> <5206DCC9-2AF1-4607-9BD5-744117ED9833@telia.com> <009f01d8534b$9b6b9040$d242b0c0$@ewellic.org> <00a001d85354$bc495370$34dbfa50$@ewellic.org> <7D868FBF-996C-4105-B19A-7F266072E68F@telia.com> Message-ID: <20220418205516.7d045216@JRWUBU2> On Mon, 18 Apr 2022 21:36:54 +0200 Hans ?berg via Unicode wrote: > So in your interpretation, a C or a C++ compiler cannot use EBDIC? As far as I am aware, the three ranges '0' to '9', 'a' to 'f' and 'A' to 'F' are each contiguous in EBCDIC codes. Richard. From haberg-1 at telia.com Mon Apr 18 15:32:14 2022 From: haberg-1 at telia.com (=?utf-8?Q?Hans_=C3=85berg?=) Date: Mon, 18 Apr 2022 22:32:14 +0200 Subject: Basic Latin digits, not everything else (was: RE: How the C programming language bridges the man-machine gap) In-Reply-To: References: <007801d852e8$4bea5b90$e3bf12b0$@ewellic.org> <5206DCC9-2AF1-4607-9BD5-744117ED9833@telia.com> <009f01d8534b$9b6b9040$d242b0c0$@ewellic.org> <00a001d85354$bc495370$34dbfa50$@ewellic.org> <7D868FBF-996C-4105-B19A-7F266072E68F@telia.com> Message-ID: <5BBEEEE7-3850-4FB6-81F7-0A52EC2D3569@telia.com> > On 18 Apr 2022, at 21:51, Jens Maurer via Unicode wrote: > >>> That is defined directly above based on the Latin >>> alphabet and "the 10 decimal digits" 0...9. This >>> is all understood to be subsets of the ASCII >>> repertoire; any superscript or non-Western >>> representation of digits is not in view here. >> >> So in your interpretation, a C or a C++ compiler cannot use EBDIC? ?C++ used to have trigraphs to allow for that encoding. > > Sure, a compiler can use EBCDIC, and existing compilers do. > I said "ASCII repertoire", not "ASCII encoding". The C standard does not refer to ASCII at all. >> Another requirement is that the values must also fit into a C byte. So one must keep track of what a C byte is. > > I don't know what you mean by "one must keep track..." If the values used do not fit into an octet, one must use a larger byte, and such have used been in the past, but not nowadays, I think. But large enough to carry all the Unicode values in a byte might be a possibility. An expert on C might tune in. But this is carrying too far away into tangents: I just wanted to illustrate one aspect of the C standard, the requirement that the digit values must consecutive. From mark at kli.org Mon Apr 18 16:32:32 2022 From: mark at kli.org (Mark E. Shoulson) Date: Mon, 18 Apr 2022 17:32:32 -0400 Subject: global password strategies In-Reply-To: <20de6c11.7b3.1803ce17b77.Webtop.102@btinternet.com> References: <20de6c11.7b3.1803ce17b77.Webtop.102@btinternet.com> Message-ID: <6b7fe0d6-1148-cc65-dcd7-d2c1703aaa8a@shoulson.com> I'm willing to go out on a limb and assert this is out of scope for Unicode, and even for every other standards body I've heard of.? The way these things get standardized is by adoption by the industry as good ideas (and deep pockets) compete.? So if you want something like this standardized, go for it:? market your products with it (or convince someone who makes products.)? It's not really for here.? Have fun. ~mark On 4/18/22 09:34, William_J_G Overington via Unicode wrote: > Mark E. Shoulson wrote: > > > > There might be something to this. Some semi-standardized set of > keyboard layouts and input methods that can be immediately and > temporarily activated in nearly any state of the computer, by a > well-known keystroke or menu or whatever, ... > > > How about > > > CONTROL 7 > > > for activating from a keyboard, > > > and by clicking the logo as in the attached graphics file for > activating by clicking a logo on a start up screen. > > > The use of CONTROL 7 would be capable of being extended to clicking on > > > CONTROL (the digit for 7 in any script) > > > I appreciate that if a language uses a different set of letters from > those used for English yet uses the same set of digit glyphs as does > English then something extra is needed. > > > > so you need to know how to type your password in *one* of them and how to select it, and then > you can just click it out on an emulated keyboard onscreen (or better, > have the keyboard actually remap itself.) > > > Yes, always getting the emoji keyboard and another keyboard. Perhaps > the other keyboard could be changed by clicking on a flag from amongst > a display of flags. > > > > Kind of tricky to get the details right and decide on what and how and > all; > > > Yes. > > > >the market and vendors will probably have to converge > > > Yes. > > > > (slowly) > > > Well, if CONTROL 7 and my suggested logo are used to get started, > either or both could be retained or replaced as consensus emerges, and > using CONTROL 7 and my suggested logo is independent of each of the > vendors, so a level start for trying to reach a consensus, so a > possibility for a prompt start. > > > > on some consensus. > > > Yes. > > > > No, Unicode can't dictate this as a standard, > > > Well, I don't think that Unicode Inc. *dictates* anything does it. > > > > it's out of scope for Unicode, > > > For people to use a computer system to produce, say, stories and poems > in their own language using Unicode and safely conserve them on a > shared system, the people need to be able to get onto the computer > system. So for me, a standardized, though optional, way to > conveniently enter a password into a computer in order to be able to > apply Unicode to produce, say, stories and poems, is part of the goal > of helping people to use their own language on computer systems. > > > But it is not a matter for me to decide whether it is or is not in > scope for Unicode Inc. to be involved in publishing a password entry > format for computer systems. > > > But anyway, if Unicode Inc. were to take this topic on then it would > probably help it become implemented much faster than it would > otherwise be implemented, if indeed it ever would be implemented > otherwise. > > > > and pretty much every other standards organization too: > > > I don't know one way or the other. > > > > there is no standard dictating user interface. Just some popular conventions that have > become fairly universal. > > > Well some of user interface is product styling, so not for > standardizing at all. > > > So, thank you for your input, progress is being made as we iterate > towards a solution. > > > William Overington > > > Monday 18 April 2022 > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Mon Apr 18 16:42:16 2022 From: doug at ewellic.org (Doug Ewell) Date: Mon, 18 Apr 2022 15:42:16 -0600 Subject: Basic Latin digits, not everything else (was: RE: How the C programming language bridges the man-machine gap) In-Reply-To: <5BBEEEE7-3850-4FB6-81F7-0A52EC2D3569@telia.com> References: <007801d852e8$4bea5b90$e3bf12b0$@ewellic.org> <5206DCC9-2AF1-4607-9BD5-744117ED9833@telia.com> <009f01d8534b$9b6b9040$d242b0c0$@ewellic.org> <00a001d85354$bc495370$34dbfa50$@ewellic.org> <7D868FBF-996C-4105-B19A-7F266072E68F@telia.com> <5BBEEEE7-3850-4FB6-81F7-0A52EC2D3569@telia.com> Message-ID: <00a701d8536d$2ca85680$85f90380$@ewellic.org> Hans ?berg wrote: >> Sure, a compiler can use EBCDIC, and existing compilers do. >> I said "ASCII repertoire", not "ASCII encoding". > > The C standard does not refer to ASCII at all. Of course it doesn't. The digits in question are those EQUIVALENT to Unicode U+0030 through U+0039, or ASCII 0x30 through 0x39, or EBCDIC (yes, any EBCDIC) 0xF0 through 0xF9, using some character encoding. > If the values used do not fit into an octet, one must use a larger > byte, and such have used been in the past, but not nowadays, I think. > But large enough to carry all the Unicode values in a byte might be a > possibility. An expert on C might tune in. Is this related in some way to the topic that was under discussion? Thanks to Jens Maurer for setting the record straight on superscripts from a position of authority. -- Doug Ewell, CC, ALB | Lakewood, CO, US | ewellic.org From marius.spix at web.de Mon Apr 18 17:24:59 2022 From: marius.spix at web.de (Marius Spix) Date: Tue, 19 Apr 2022 00:24:59 +0200 Subject: Basic Latin digits, not everything else (was: RE: How the C programming language bridges the man-machine gap) In-Reply-To: References: <007801d852e8$4bea5b90$e3bf12b0$@ewellic.org> <5206DCC9-2AF1-4607-9BD5-744117ED9833@telia.com> <009f01d8534b$9b6b9040$d242b0c0$@ewellic.org> <00a001d85354$bc495370$34dbfa50$@ewellic.org> Message-ID: <20220419002459.70c87d3a@spixxi> Also note > On Mon, 18 Apr 2022 21:10:58 +0200 Jens Maurer via Unicode wrote: > On 18/04/2022 20.47, Doug Ewell via Unicode wrote: > > Hans ?berg wrote: > "In both the source and execution basic character sets, > the value of each character after 0 in the above list > of decimal digits shall be one greater than the value > of the previous." > > Note the use of the term "basic character set". Also note that SHALL be does not mean MUST be. For example, the basic character set SHALL include certain characters like ?[?, ?]?, ?{? or ?}?, but whenever they do not exist in the current character set, C allows to replace them by digraphs and trigraphs. C++ also adds and alternative tokens (like ?and? or ?xor? instead of ?&&? or ?^?). Trigraphs are not supported in C++17 anymore, which breaks downwards-compatibility. C also expect that the backslash (\, ASCII codepoint 0x5C) is used for escape sequences in string literals, but some users of Shift JIS encoding use the Yen sign (?), with shares the same codepoint 0x5C. Regards, Marius From Jens.Maurer at gmx.net Tue Apr 19 02:25:57 2022 From: Jens.Maurer at gmx.net (Jens Maurer) Date: Tue, 19 Apr 2022 09:25:57 +0200 Subject: Basic Latin digits, not everything else (was: RE: How the C programming language bridges the man-machine gap) In-Reply-To: <20220419002459.70c87d3a@spixxi> References: <007801d852e8$4bea5b90$e3bf12b0$@ewellic.org> <5206DCC9-2AF1-4607-9BD5-744117ED9833@telia.com> <009f01d8534b$9b6b9040$d242b0c0$@ewellic.org> <00a001d85354$bc495370$34dbfa50$@ewellic.org> <20220419002459.70c87d3a@spixxi> Message-ID: <00b471dd-d5a6-d88d-5cea-3aca4fb57db9@gmx.net> On 19/04/2022 00.24, Marius Spix via Unicode wrote: > Also note > > > On Mon, 18 Apr 2022 21:10:58 +0200 > Jens Maurer via Unicode wrote: > >> On 18/04/2022 20.47, Doug Ewell via Unicode wrote: >>> Hans ?berg wrote: >> "In both the source and execution basic character sets, >> the value of each character after 0 in the above list >> of decimal digits shall be one greater than the value >> of the previous." >> >> Note the use of the term "basic character set". > > Also note that SHALL be does not mean MUST be. Let me take exception to that statement. Any standard describes things that may conform to that standard or not. Anyone can do whatever they want, but if you want to claim conformance to some standard for your thing, you need to satisfy all its SHALL prescriptions. So, if your implementation of C does not satisfy the rule about contiguous encoding of (Latin) digits, yours is simply not an implementation of C conforming to ISO 9899. (Whether you're allowed to even call it "C" in that case is a related, but different question.) > For example, the basic > character set SHALL include certain characters like ?[?, ?]?, ?{? or > ?}?, but whenever they do not exist in the current character set, C > allows to replace them by digraphs and trigraphs. The "basic character set" that the C and C++ standards talk about is an abstract set of characters, unrelated to any specific encoding. In order to ease programming in C and C++, character sequences to name characters that might not be easily accessible on some keyboards have been introduced. Conceptually, trigraphs are replaced while reading the individual characters of your source file, while digraphs are just alternative tokens, recognized during lexing. (The different treatment makes a difference in string literals, for instance: trigraphs are replaced in string literals, digraphs are not.) A programmer can use trigraphs and digraphs even if the current character set fully supports all the basic characters of C. > C++ also adds and > alternative tokens (like ?and? or ?xor? instead of ?&&? or ?^?). > Trigraphs are not supported in C++17 anymore, which breaks > downwards-compatibility. Those audiences who care about trigraphs have been assured that their compilers can continue to support trigraphs as a conforming extension, to limit the breakage in practice. > C also expect that the backslash (\, ASCII codepoint 0x5C) is used for > escape sequences in string literals, but some users of Shift JIS > encoding use the Yen sign (?), with shares the same codepoint 0x5C. That sounds like the encoding expected by the compiler is different from the encoding used for screen display. Jens From haberg-1 at telia.com Tue Apr 19 02:43:18 2022 From: haberg-1 at telia.com (=?utf-8?Q?Hans_=C3=85berg?=) Date: Tue, 19 Apr 2022 09:43:18 +0200 Subject: Basic Latin digits, not everything else (was: RE: How the C programming language bridges the man-machine gap) In-Reply-To: <00a701d8536d$2ca85680$85f90380$@ewellic.org> References: <007801d852e8$4bea5b90$e3bf12b0$@ewellic.org> <5206DCC9-2AF1-4607-9BD5-744117ED9833@telia.com> <009f01d8534b$9b6b9040$d242b0c0$@ewellic.org> <00a001d85354$bc495370$34dbfa50$@ewellic.org> <7D868FBF-996C-4105-B19A-7F266072E68F@telia.com> <5BBEEEE7-3850-4FB6-81F7-0A52EC2D3569@telia.com> <00a701d8536d$2ca85680$85f90380$@ewellic.org> Message-ID: <5E951F84-1B4A-47F9-B14F-9A02B42ABB28@telia.com> > On 18 Apr 2022, at 23:42, Doug Ewell via Unicode wrote: > >> If the values used do not fit into an octet, one must use a larger >> byte, and such have used been in the past, but not nowadays, I think. >> But large enough to carry all the Unicode values in a byte might be a >> possibility. An expert on C might tune in. > > Is this related in some way to the topic that was under discussion? If one wants to use Unicode values that do not fit into an octet, the C byte must be enlarged. From haberg-1 at telia.com Tue Apr 19 02:49:08 2022 From: haberg-1 at telia.com (=?utf-8?Q?Hans_=C3=85berg?=) Date: Tue, 19 Apr 2022 09:49:08 +0200 Subject: Basic Latin digits, not everything else (was: RE: How the C programming language bridges the man-machine gap) In-Reply-To: <20220419002459.70c87d3a@spixxi> References: <007801d852e8$4bea5b90$e3bf12b0$@ewellic.org> <5206DCC9-2AF1-4607-9BD5-744117ED9833@telia.com> <009f01d8534b$9b6b9040$d242b0c0$@ewellic.org> <00a001d85354$bc495370$34dbfa50$@ewellic.org> <20220419002459.70c87d3a@spixxi> Message-ID: <8B2ABEB3-0E97-46EF-BEFE-8022263DC4F8@telia.com> > On 19 Apr 2022, at 00:24, Marius Spix via Unicode wrote: > > Also note > > > On Mon, 18 Apr 2022 21:10:58 +0200 > Jens Maurer via Unicode wrote: > >> On 18/04/2022 20.47, Doug Ewell via Unicode wrote: >>> Hans ?berg wrote: >> "In both the source and execution basic character sets, >> the value of each character after 0 in the above list >> of decimal digits shall be one greater than the value >> of the previous." >> >> Note the use of the term "basic character set". > > Also note that SHALL be does not mean MUST be. For example, the basic > character set SHALL include certain characters like ?[?, ?]?, ?{? or > ?}?, but whenever they do not exist in the current character set, C > allows to replace them by digraphs and trigraphs. C++ also adds and > alternative tokens (like ?and? or ?xor? instead of ?&&? or ?^?). > Trigraphs are not supported in C++17 anymore, which breaks > downwards-compatibility. > > C also expect that the backslash (\, ASCII codepoint 0x5C) is used for > escape sequences in string literals, but some users of Shift JIS > encoding use the Yen sign (?), with shares the same codepoint 0x5C. In math, it is common that the symbols vary. For example, in logic, one might use for conjunction ? ? ? or ? & ?, or implication ? ? ?, ? ? ?, or ? ? ?. If defines the logic using a specific set of symbols, it is also valid for other such sets. From sdowney at gmail.com Tue Apr 19 08:25:58 2022 From: sdowney at gmail.com (Steve Downey) Date: Tue, 19 Apr 2022 09:25:58 -0400 Subject: Basic Latin digits, not everything else (was: RE: How the C programming language bridges the man-machine gap) In-Reply-To: <5E951F84-1B4A-47F9-B14F-9A02B42ABB28@telia.com> References: <007801d852e8$4bea5b90$e3bf12b0$@ewellic.org> <5206DCC9-2AF1-4607-9BD5-744117ED9833@telia.com> <009f01d8534b$9b6b9040$d242b0c0$@ewellic.org> <00a001d85354$bc495370$34dbfa50$@ewellic.org> <7D868FBF-996C-4105-B19A-7F266072E68F@telia.com> <5BBEEEE7-3850-4FB6-81F7-0A52EC2D3569@telia.com> <00a701d8536d$2ca85680$85f90380$@ewellic.org> <5E951F84-1B4A-47F9-B14F-9A02B42ABB28@telia.com> Message-ID: In C and C++, byte is primarily about the smallest unit of addressable memory. The property being described is that the abstract characters in the basic sets have single byte encodings, and can therefore be encoded in strings of consecutive chars. There are also other character types. Wchar_t was introduced to support Unicode, and is specified to be able to encode all code points in a single unit. Unfortunately, it does not, because Microsoft introduced it when 16 bits was sufficient, and that is now baked in into to many places. We have since introduced char8_t, char16_t, and char32_t, which encode UTF 8, 16, 32 respectively, and are at least 8, 16, and 32 bits, but might be larger due to memory architecture requirements. C and C++ also support multi-byte character encodings in char strings. UTF-8 matches the requirements, unsurprisingly, as that was one of the design goals for it. The various CJKV encodings are also supported through multi-byte encodings. The requirement for consecutive encodings for 0-9 strictly applies only to the basic character set, and in the C++23 standard we will be making it clear that's the latin digits that are encoded that way. That's for portability. If the literal encoding placed some other digits in the single byte range, and the latin digits elsewhere, the transliteration between source and results would produce broken code. On Tue, Apr 19, 2022, 03:46 Hans ?berg via Unicode wrote: > > > On 18 Apr 2022, at 23:42, Doug Ewell via Unicode < > unicode at corp.unicode.org> wrote: > > > >> If the values used do not fit into an octet, one must use a larger > >> byte, and such have used been in the past, but not nowadays, I think. > >> But large enough to carry all the Unicode values in a byte might be a > >> possibility. An expert on C might tune in. > > > > Is this related in some way to the topic that was under discussion? > > If one wants to use Unicode values that do not fit into an octet, the C > byte must be enlarged. > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From kent.b.karlsson at bahnhof.se Wed Apr 20 12:48:49 2022 From: kent.b.karlsson at bahnhof.se (Kent Karlsson) Date: Wed, 20 Apr 2022 19:48:49 +0200 Subject: Non-egalitarian treatment of mathematical operators (was Re: Basic Latin digits, not everything else (was: RE: How the C programming language bridges the man-machine gap) In-Reply-To: <8B2ABEB3-0E97-46EF-BEFE-8022263DC4F8@telia.com> References: <007801d852e8$4bea5b90$e3bf12b0$@ewellic.org> <5206DCC9-2AF1-4607-9BD5-744117ED9833@telia.com> <009f01d8534b$9b6b9040$d242b0c0$@ewellic.org> <00a001d85354$bc495370$34dbfa50$@ewellic.org> <20220419002459.70c87d3a@spixxi> <8B2ABEB3-0E97-46EF-BEFE-8022263DC4F8@telia.com> Message-ID: <5323B8CC-1298-49CD-9C5E-0D5305B25FCA@bahnhof.se> > In math, it is common that the symbols vary. For example, in logic, one might use for conjunction ? ? ? or ? & ?, or implication > ? ? ?, ? ? ?, or ? ? ?. If defines the logic using a specific set of symbols, it is also valid for other such sets. That is fine. But? The symbol ? is ?bidi mirrored?: 2283;SUPERSET OF;Sm;0;ON;;;;;Y;;;;; and has an entry in BidiMirroring.txt: 2283; 2282 # SUPERSET OF But neither ? nor ? are bidi mirrored, just because they are ?arrows?. /Kent K From haberg-1 at telia.com Fri Apr 22 07:31:16 2022 From: haberg-1 at telia.com (=?utf-8?Q?Hans_=C3=85berg?=) Date: Fri, 22 Apr 2022 14:31:16 +0200 Subject: Non-egalitarian treatment of mathematical operators In-Reply-To: <5323B8CC-1298-49CD-9C5E-0D5305B25FCA@bahnhof.se> References: <007801d852e8$4bea5b90$e3bf12b0$@ewellic.org> <5206DCC9-2AF1-4607-9BD5-744117ED9833@telia.com> <009f01d8534b$9b6b9040$d242b0c0$@ewellic.org> <00a001d85354$bc495370$34dbfa50$@ewellic.org> <20220419002459.70c87d3a@spixxi> <8B2ABEB3-0E97-46EF-BEFE-8022263DC4F8@telia.com> <5323B8CC-1298-49CD-9C5E-0D5305B25FCA@bahnhof.se> Message-ID: <4213A7C7-4F73-4827-8EC8-3EF3DCE0D375@telia.com> > On 20 Apr 2022, at 19:48, Kent Karlsson wrote: > >> In math, it is common that the symbols vary. For example, in logic, one might use for conjunction ? ? ? or ? & ?, or implication >> ? ? ?, ? ? ?, or ? ? ?. If defines the logic using a specific set of symbols, it is also valid for other such sets. > > That is fine. But? > > The symbol ? is ?bidi mirrored?: > > 2283;SUPERSET OF;Sm;0;ON;;;;;Y;;;;; > > and has an entry in BidiMirroring.txt: > > 2283; 2282 # SUPERSET OF > > But neither ? nor ? are bidi mirrored, just because they are ?arrows?. Perhaps they only use RTL on elementary math. From harjitmoe at outlook.com Tue Apr 26 12:48:26 2022 From: harjitmoe at outlook.com (Harriet Riddle) Date: Tue, 26 Apr 2022 18:48:26 +0100 Subject: Fwd: Errors in APL-ISO-IR-68.TXT In-Reply-To: References: Message-ID: Re-sending this since I inadvertently sent it to the bounces address. -------- Forwarded Message -------- Subject: Errors in APL-ISO-IR-68.TXT Date: Mon, 25 Apr 2022 23:48:58 +0100 From: Harriet Riddle To: Unicode Two of the lines in APL-ISO-IR-68.TXT ( https://www.unicode.org/Public/MAPPINGS/VENDORS/MISC/APL-ISO-IR-68.TXT ) read as follows: 0x5A085F??? 0x233E??? #??? APL FUNCTIONAL SYMBOL CIRCLE JOT and: 0x5F085A??? 0x233E??? #??? APL FUNCTIONAL SYMBOL CIRCLE JOT 0x5A is a left shoe (U+2282 ? SUBSET OF) and 0x5F is a minus sign (mapped in this file to 0x002D HYPHEN-MINUS).? While a ISO-IR-68 backspace (0x08) composition of these two could theoretically compose some form of element-of sign, the APL "epsilon" or element-of sign is an atomic glyph at 0x45 (mapped here to 0x220A ? SMALL ELEMENT OF), so these two sequences are probably unused in reality.? In any case, they do not correspond to U+233E ? APL FUNCTIONAL SYMBOL CIRCLE JOT which is, as the name suggests, the composition of the circle (0x25CB ? WHITE CIRCLE, 0x4F in ISO-IR-68) and the jot (0x2218 ? RING OPERATOR, 0x4A in ISO-IR-68). The mappings should probably be amended to: 0x4A084F??? 0x233E??? #??? APL FUNCTIONAL SYMBOL CIRCLE JOT and: 0x4F084A??? 0x233E??? #??? APL FUNCTIONAL SYMBOL CIRCLE JOT ?Har. -------------- next part -------------- An HTML attachment was scrubbed... URL: From 747.neutron at gmail.com Wed Apr 27 02:50:16 2022 From: 747.neutron at gmail.com (=?UTF-8?B?V8OhbmcgWWlmw6Fu?=) Date: Wed, 27 Apr 2022 16:50:16 +0900 Subject: Suspected error in the ISO 15924 standard Message-ID: Is the Unicode contact form a valid channel to report what seems a bug in the ISO 15924 text? Specifically: In the recently revised ISO 15924:2022, page 3, among numeric script codes it says: > 700?799 (unassigned) But the range has been used since 2014 by Duployan, and according to the notice of changes https://unicode.org/iso15924/codechanges.html: > On 2010-07-18 the range 700-799 was assigned to "Shorthands and other notations". So they shouldn't have been marked "unassigned" anymore at the point of revision. From kenwhistler at sonic.net Wed Apr 27 13:19:40 2022 From: kenwhistler at sonic.net (Ken Whistler) Date: Wed, 27 Apr 2022 11:19:40 -0700 Subject: Fwd: Suspected error in the ISO 15924 standard In-Reply-To: <4b84b605-5949-5e20-d911-b7cd1f944c0e@sonic.net> References: <4b84b605-5949-5e20-d911-b7cd1f944c0e@sonic.net> Message-ID: <73592715-32a7-ef5d-e8bc-58d6690dbc82@sonic.net> FYI for the list --Ken -------- Forwarded Message -------- Subject: Re: Suspected error in the ISO 15924 standard Date: Wed, 27 Apr 2022 08:11:51 -0700 From: Ken Whistler To: W?ng Yif?n <747.neutron at gmail.com> Yif?n, On 4/27/2022 12:50 AM, W?ng Yif?n via Unicode wrote: > Is the Unicode contact form a valid channel to report what seems a bug > in the ISO 15924 text? No, not really. The Unicode Consortium maintains the registry for ISO 15924, but TC46 is the relevant ISO committee responsible for the text of the standard. Regarding your observations below, yes, this is a defect in the text of the standard, and was already pointed out to TC46 during the balloting last fall. Those of us who pointed out the problem and who suggested an easy fix ended up overruled by the ISO Central Secretariat, who made (IMO) a bad call on this one. My advice now is just to ignore the issue in the text of the standard. It has no practical effect on the way the registry actually works. --Ken > > Specifically: > > In the recently revised ISO 15924:2022, page 3, among numeric script > codes it says: > >> 700?799 (unassigned) > But the range has been used since 2014 by Duployan, and according to > the notice of changes https://unicode.org/iso15924/codechanges.html: > >> On 2010-07-18 the range 700-799 was assigned to "Shorthands and other >> notations". > So they shouldn't have been marked "unassigned" anymore at the point > of revision. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jiaminglimjm at gmail.com Wed Apr 27 14:43:26 2022 From: jiaminglimjm at gmail.com (Lim Jia Ming) Date: Thu, 28 Apr 2022 03:43:26 +0800 Subject: Jawi's Hamzah Three-Quarters: Providing more context and point of views Message-ID: <96f73cfd-aa64-2121-ca5a-a25f8684917c@gmail.com> Replying to document: [L2/22-068] Recommendations to UTC #171 April 2022 on Script Proposals (III) Arabic (4c) High Hamza I would like to sincerely thank the SAH and the UTC for having seriously considered the handling of Jawi's Hamzah situation. I am here just to provide more context and points of view that has not yet been given about the usage of this character. The first point I would like to establish is the distinction of Hamzah Tiga Suku (Hamzah Three-Quarters; HTQ) as a complete definition in contrast to descriptions of a slightly higher hamza. Even though in practice it is basically that, a roughly shifted up hamza, it is in principle its own precise and unique concept that cannot be replaced by the description 'High Hamza'. This is taught in schools and is widely recognised by users of the Jawi script.[note1] Although I do recognize that other similar characters in other languages have been merged to the same codepoint, I believe that HTQ deserves to have its Three-Quarter concept annotated in the Unicode standard at the very least, and that if it were to be added as a codepoint to be named something like ARABIC LETTER HAMZA THREE QUARTERS. Something important to note is that there is an ongoing movement to simply stop using Hamzah Three-Quarters (HTQ) in favour of the U+0621 Arabic Hamza. It began in July 2019 during a live radio session citing Unicode limitations as one of its reasons.[ref1] However, since the addition of HTQ as U+0674 High Hamza in the Unicode 14.0 standard, the argument has shifted to "it was never part of the original Arabic script, so why must we use it?". Some of us youngsters reject this argument for it having overly-Arabization sentiments. But to their credit, in practice, HTQ and Arabic letter Hamza really do look similar enough to the point that the latter is an effective replacement for the former (for the promotion of the Jawi script overall), despite it being the rarer form of Hamzah. Also, it should be noted that HTQ is by far the most common form of Hamzah in Jawi, such that its absence in common usage is achingly noticeable, with the other forms of Hamza mostly only found in loanwords or when combining prefixes to roots.[note2] The campaign for the eradication of HTQ has been quite successful, evidenced by the most popular Jawi twitter account @koleksijawi (created in January 2021) posting images with custom Jawi fonts,[ref2] and popular book Nirnama by Hilal Asyraf,[ref3] using only the Arabic Hamza instead of the HTQ.[note3] This also explains why there have been very few attempts to implement U+0674 High Hamza as HTQ even after the Unicode 14.0 standard came out: it is much easier to use Arabic Hamza than to support U+0647 High Hamza. The reason being that the former has been advocated by experts, and that the latter is unlikely to gain language tag support in mainstream platforms and messaging apps (via auto-detection of language?) in any reasonable amount of time, if ever. In conclusion, Arabic letter Hamza will continue to be the default 'fallback' without a counter-movement to preserve Hamzah Three-Quarters. The fact is that the Kazakh version of U+0674 High Hamza looks too different and no one will use it unless it fully works on all their platforms. Though we do realize that similar problems exist with the addition of its own codepoint, the difference is that it would have its own rightful name without the need of specifying language. This explains why some of us are so keen on HTQ having its own codepoint, because in name it would recognize the uniqueness of the HTQ and boost an opportunity to promote its usage properly. Therefore, I urge the SAH group to reconsider the consensus built around what to do with HTQ and I hope that it will get its own codepoint. Thank you very much for reading, we patiently await your feedback. Best regards, Jia Ming. Footnotes: [L2/22-068] https://www.unicode.org/L2/L2022/22068-script-adhoc-rept.pdf [ref1] https://www.youtube.com/watch?v=zLPTDng1XcE (in Malay with English subtitles) [ref2] https://twitter.com/koleksijawi/status/1502795097053679619 (one example of many) [ref3] https://twitter.com/No_RuLesz/status/1459435348598030341 (Jawi version, published in 2021) [note1] Hamzah Three-Quarters continues to be in school textbooks and has not been replaced with Arabic letter Hamza (yet?) [note2] Hamzah Three-Quarters is more common by up to several orders of magnitude in terms of occurrences as dictionary entries, though not necessarily so in the wild as that has not yet been properly analysed [note3] Recently, several shops with new signboards in Kelantan appear to be adopting the Arabic Hamza as well, instead of HTQ