From someonesdad1 at gmail.com Thu Jun 3 13:29:42 2021 From: someonesdad1 at gmail.com (Don Peterson) Date: Thu, 3 Jun 2021 12:29:42 -0600 Subject: Suggestion for superscripts Message-ID: For about a decade I have been wanting to be able to print Unicode superscript characters in the output of some programs. The most common use case for this is to print the exponents to physical units. An example is kg?m/s?, which is a bit easier on the eyes and brain than kg*m/s**2. Unfortunately, the current version 13 character set doesn't have enough superscript characters to support common scientific usage. From the ucd.nounihan.grouped XML file for version 13, these are the superscript and subscript characters I could find: Superscripts: ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? Subscripts: ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ??????????????????? Superscript characters are lacking for two fairly common use cases: floating point exponents and fractional exponents. These would be possible with the addition to the superscripts of the two common radix characters '.' and ',' and a solidus character. However, it seems to me that *the Unicode design should aim at least at putting all printable 7-bit ASCII characters and the upper and lower case Greek characters commonly used in technical work in both the subscript and superscript sets*. I've never commented on this before because I thought it was obvious and would be fixed in the next Unicode revision. I remember looking at this pretty carefully around version 7 and being surprised by the lack. Being a lazy retired person for the last 20 years meant I didn't do anything about it, which I now regret. :^) Because of this lack of superscript characters, one of my library functions is forced to produce syntactically-correct but ugly output such as m**0.75?Pa**-1.3?s???K?? for a units string input of "m(3/4) Pa(-1.3)/(s2*K)" (with syntax similar to the GNU units program). -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Thu Jun 3 15:26:55 2021 From: doug at ewellic.org (Doug Ewell) Date: Thu, 3 Jun 2021 14:26:55 -0600 Subject: Suggestion for superscripts In-Reply-To: References: Message-ID: <002e01d758b6$cc0bb350$642319f0$@ewellic.org> Don Peterson wrote: > However, it seems to me that the Unicode design should aim at least at > putting all printable 7-bit ASCII characters and the upper and lower > case Greek characters commonly used in technical work in both the > subscript and superscript sets. https://unicode.org/faq/ligature_digraph.html#Txt5 -- Doug Ewell, CC, ALB | Lakewood, CO, US | ewellic.org From wjgo_10009 at btinternet.com Thu Jun 3 15:16:05 2021 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Thu, 3 Jun 2021 21:16:05 +0100 (BST) Subject: Suggestion for superscripts In-Reply-To: References: Message-ID: <2764e325.fa03.179d384cbd8.Webtop.83@btinternet.com> Interestingly, many years ago Bernard Miller, in his Bytext proposal, suggested what he termed "arrow parentheses". There were eight of them. The glyphs were each either an opening or closing parenthesis character, with either one or two up arrows, or one or two down arrows upon the parenthesis. The single ones opened or closed a sequence of characters that were to be subscript or superscript, the double ones were for limits of definite integrals, summations and so on. It seemed to me then, and does so now, to be a very good idea. I am not an expert on Unicode and maybe there is some structural reason why this could not become implemented, even if people wanted it implemented. Yet I put this forward in the hope that the idea will be considered seriously please. https://www.unicode.org/mail-arch/unicode-ml/y2002-m01/0477.hl Here is a link to The Bytext Standard document. https://web.archive.org/web/20030317065850/http://bytext.org/The_Bytext_Standard.pdf Arrow parentheses are on pages 33 and 34. Oh, and notwithstanding the comments about Bytext made in the mailing list thread at the time, please have a look at pages 37 and 38 and observe what was being suggested in 2002. Hmm. William Overington Thursday 3 June 2021 ------ Original Message ------ From: "Don Peterson via Unicode" To: unicode at corp.unicode.org Sent: Thursday, 2021 Jun 3 At 19:29 Subject: Suggestion for superscripts For about a decade I have been wanting to be able to print Unicode superscript characters in the output of some programs. The most common use case for this is to print the exponents to physical units. An example is kg?m/s?, which is a bit easier on the eyes and brain than kg*m/s**2. Unfortunately, the current version 13 character set doesn't have enough superscript characters to support common scientific usage. From the ucd.nounihan.grouped XML file for version 13, these are the superscript and subscript characters I could find: Superscripts: ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? Subscripts: ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ??????????????????? Superscript characters are lacking for two fairly common use cases: floating point exponents and fractional exponents. These would be possible with the addition to the superscripts of the two common radix characters '.' and ',' and a solidus character. However, it seems to me that the Unicode design should aim at least at putting all printable 7-bit ASCII characters and the upper and lower case Greek characters commonly used in technical work in both the subscript and superscript sets. I've never commented on this before because I thought it was obvious and would be fixed in the next Unicode revision. I remember looking at this pretty carefully around version 7 and being surprised by the lack. Being a lazy retired person for the last 20 years meant I didn't do anything about it, which I now regret. :^) Because of this lack of superscript characters, one of my library functions is forced to produce syntactically-correct but ugly output such as m**0.75?Pa**-1.3?s???K?? for a units string input of "m(3/4) Pa(-1.3)/(s2*K)" (with syntax similar to the GNU units program). -------------- next part -------------- An HTML attachment was scrubbed... URL: From wjgo_10009 at btinternet.com Thu Jun 3 16:36:37 2021 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Thu, 3 Jun 2021 22:36:37 +0100 (BST) Subject: Suggestion for superscripts In-Reply-To: <2764e325.fa03.179d384cbd8.Webtop.83@btinternet.com> References: <2764e325.fa03.179d384cbd8.Webtop.83@btinternet.com> Message-ID: <30c21ada.fb7f.179d3ce8369.Webtop.83@btinternet.com> Oops, one of the links does not work. Here is the correct version. https://www.unicode.org/mail-arch/unicode-ml/y2002-m01/0477.html William Overington Thursday 3 June 2021 -------------- next part -------------- An HTML attachment was scrubbed... URL: From someonesdad1 at gmail.com Thu Jun 3 18:14:51 2021 From: someonesdad1 at gmail.com (Don Peterson) Date: Thu, 3 Jun 2021 17:14:51 -0600 Subject: Suggestion for superscripts In-Reply-To: <002e01d758b6$cc0bb350$642319f0$@ewellic.org> References: <002e01d758b6$cc0bb350$642319f0$@ewellic.org> Message-ID: Alas, that's not a solution for environments like a text editor, bash window, terminal, etc. On Thu, Jun 3, 2021 at 2:26 PM Doug Ewell wrote: > Don Peterson wrote: > > > However, it seems to me that the Unicode design should aim at least at > > putting all printable 7-bit ASCII characters and the upper and lower > > case Greek characters commonly used in technical work in both the > > subscript and superscript sets. > > https://unicode.org/faq/ligature_digraph.html#Txt5 > > -- > Doug Ewell, CC, ALB | Lakewood, CO, US | ewellic.org > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From a.lukyanov at yspu.org Fri Jun 4 00:48:28 2021 From: a.lukyanov at yspu.org (a.lukyanov) Date: Fri, 04 Jun 2021 08:48:28 +0300 Subject: Suggestion for superscripts In-Reply-To: References: Message-ID: <60B9BEAC.10204@yspu.org> 22:59, Don Peterson ?????: > Unfortunately, the current version 13 character set doesn't have > enough superscript characters to support common scientific usage. > From the ucd.nounihan.grouped XML file for version 13, these are the > superscript and subscript characters I could find: > > Superscripts: ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? There are more of them: ????????????????????????? -------------- next part -------------- An HTML attachment was scrubbed... URL: From daniel.buncic at uni-koeln.de Fri Jun 4 02:45:23 2021 From: daniel.buncic at uni-koeln.de (Daniel Buncic) Date: Fri, 4 Jun 2021 09:45:23 +0200 Subject: Suggestion for superscripts In-Reply-To: References: <002e01d758b6$cc0bb350$642319f0$@ewellic.org> Message-ID: Am 03.06.2021 um 22:16 schrieb William_J_G Overington via Unicode: > Interestingly, many years ago Bernard Miller, in his Bytext > proposal, suggested what he termed "arrow parentheses". Am 04.06.2021 um 01:14 schrieb Don Peterson via Unicode: > Alas, that's not a solution for environments like a text editor, bash > window, terminal, etc. Well, an environment where real superscripts can for some reason not be implemented could display those ?arrow parentheses? as control characters. Something like Pa?(-1.3)?s?? would in fact look better and be more unambiguous than Pa**-1.3?s?? or Pa^-1.3?s?? (where one does not really know whether the s?? is part of the exponent or not). We already have lots of characters that influence the rendering of other characters, e.g. combining diacritics, variation selectors, right-to-left marks, zero-width joiner and non-joiner, etc. These ?arrow parentheses? would be some more such characters, and not very difficult to implement for most applications, which already have a way of displaying superscripts and subscripts in rich text. Am 04.06.2021 um 07:48 schrieb a.lukyanov via Unicode: > There are more of them: > > ????????????????????????? Yes, and even more: capital ???????????????????, Greek ???????, Cyrillic ???, and lots of IPA and other phonetic transcription characters, all of them named ?modifier letter?. It somehow seems a waste of codepoints (and of time for all the registration processes) to encode every superscript or subscript character separately that somewhere turns up as relevant instead of just getting away with a couple of control characters and being done with the registration of superscript and subscript characters forever (just as we need to register no more accented characters because we have combining diacritics). And as to the question of whether these are just glyphs that do not deserve being encoded or actual characters: s?? ? s?2, and, in the very same manner, x to the power of 1.3??, which I currently cannot write without rich-text markup, is not the same as x?1.3??. This is a crucial difference that, in my opinion, cannot be left to rich text environments. Daniel -- Prof. Dr. Daniel Bun?i? =================================================== Slavisches Institut der Universit?t zu K?ln Weyertal 137, D-50931 K?ln Telefon: +49 (0)221 470-3355 Telefax: +49 (0)221 470-5001 Sprechstunden: http://ukoeln.de/12FE3 =================================================== Breslauer Stra?e 54, D-50321 Br?hl Telefon: +49 (0)2232 150 42 80 =================================================== E-Mail: daniel at buncic.de Homepage: http://daniel.buncic.de/ Threema: https://threema.id/8M375R5K Skype: danielbuncic Academia: http://uni-koeln.academia.edu/buncic =================================================== From haberg-1 at telia.com Fri Jun 4 03:27:54 2021 From: haberg-1 at telia.com (=?utf-8?Q?Hans_=C3=85berg?=) Date: Fri, 4 Jun 2021 10:27:54 +0200 Subject: Suggestion for superscripts In-Reply-To: References: <002e01d758b6$cc0bb350$642319f0$@ewellic.org> Message-ID: > On 4 Jun 2021, at 09:45, Daniel Buncic via Unicode wrote: > > Am 04.06.2021 um 01:14 schrieb Don Peterson via Unicode: >> Alas, that's not a solution for environments like a text editor, bash >> window, terminal, etc. > > Well, an environment where real superscripts can for some reason not be > implemented could display those ?arrow parentheses? as control > characters. Something like Pa?(-1.3)?s?? would in fact look better and > be more unambiguous than Pa**-1.3?s?? or Pa^-1.3?s?? (where one does not > really know whether the s?? is part of the exponent or not). For plain text input in a program, I use superscript and subscript parentheses, which look good and are easy to read. For example: ??????(-1.3)????? I ditched the arrows approach, which I used first, inspired by programs like TeX, finding the rendering less appealing. From raymond at almanach.co.uk Fri Jun 4 03:30:14 2021 From: raymond at almanach.co.uk (raymond mercier) Date: Fri, 4 Jun 2021 09:30:14 +0100 Subject: Suggestion for superscripts In-Reply-To: References: Message-ID: Mathematicians use TeX for superscripts. It can be extended it to include Unicode, making XeTeX. https://www.overleaf.com/learn/latex/Articles/What's_in_a_Name:_A_Guide_to_the_Many_Flavours_of_TeX Isn?t that the way to go ? Raymond Mercier -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Fri Jun 4 13:45:54 2021 From: doug at ewellic.org (Doug Ewell) Date: Fri, 4 Jun 2021 12:45:54 -0600 Subject: Suggestion for superscripts In-Reply-To: References: <002e01d758b6$cc0bb350$642319f0$@ewellic.org> Message-ID: <000201d75971$d98c3520$8ca49f60$@ewellic.org> ?UnicodeMath,? described in https://www.unicode.org/notes/tn28/ , might be an interesting solution. The plain-text representation is fairly readable, and something that is not ad hoc, but actually discussed and agreed upon by a panel of mathematicians. It could be copied and pasted from the terminal or text editor to a UnicodeMath-enabled processor (if you can find one) for real formatting. I pointed to an FAQ entry because Unicode normally posts those for questions that are, well, frequently asked (like this one) and for which the answer is unlikely to change over time. We know from experience that over time, some such decisions have been modified, or even completely reversed. Most are not. -- Doug Ewell, CC, ALB | Lakewood, CO, US | ewellic.org From haberg-1 at telia.com Fri Jun 4 14:49:17 2021 From: haberg-1 at telia.com (=?utf-8?Q?Hans_=C3=85berg?=) Date: Fri, 4 Jun 2021 21:49:17 +0200 Subject: Suggestion for superscripts In-Reply-To: References: Message-ID: > On 4 Jun 2021, at 10:30, raymond mercier via Unicode wrote: > > Mathematicians use TeX for superscripts. It can be extended it to include Unicode, making XeTeX. > https://www.overleaf.com/learn/latex/Articles/What's_in_a_Name:_A_Guide_to_the_Many_Flavours_of_TeX > Isn?t that the way to go ? There is difference whether you merely want an output math rendition, or an input legible and processable for other purposes than display. If you have a program processing math, then using TeX variants is hard enough to be a distraction, and the formulas are not copiable, not originally even the Unicode input in original TeX as it gets translated. ConTeXt [1] is Unicode friendly, uses UTF-8 as default input, aiming at unifying all those older variants. It is available in the TeX Live distribution [2]: Just typeset using 'context ?'. With a good text only input, a program like ConTeXt can produce a PDF with reasonably copiable formulas. That is, as long as one does not use superscript and subscripts other than what is already available in Unicode. 1. https://wiki.contextgarden.net/Main_Page 2. https://tug.org/texlive/ From eliz at gnu.org Sat Jun 5 03:30:07 2021 From: eliz at gnu.org (Eli Zaretskii) Date: Sat, 05 Jun 2021 11:30:07 +0300 Subject: Suggestion for superscripts In-Reply-To: <000201d75971$d98c3520$8ca49f60$@ewellic.org> (message from Doug Ewell via Unicode on Fri, 4 Jun 2021 12:45:54 -0600) References: <002e01d758b6$cc0bb350$642319f0$@ewellic.org> <000201d75971$d98c3520$8ca49f60$@ewellic.org> Message-ID: <8335twlnn4.fsf@gnu.org> > Date: Fri, 4 Jun 2021 12:45:54 -0600 > From: Doug Ewell via Unicode > > ?UnicodeMath,? described in https://www.unicode.org/notes/tn28/ , might be an interesting solution. The plain-text representation is fairly readable, and something that is not ad hoc, but actually discussed and agreed upon by a panel of mathematicians. It could be copied and pasted from the terminal or text editor to a UnicodeMath-enabled processor (if you can find one) for real formatting. Do you know any editor that implements that TN? From doug at ewellic.org Sat Jun 5 11:17:09 2021 From: doug at ewellic.org (Doug Ewell) Date: Sat, 5 Jun 2021 10:17:09 -0600 Subject: Suggestion for superscripts In-Reply-To: <8335twlnn4.fsf@gnu.org> References: <002e01d758b6$cc0bb350$642319f0$@ewellic.org> <000201d75971$d98c3520$8ca49f60$@ewellic.org> <8335twlnn4.fsf@gnu.org> Message-ID: <003301d75a26$3c5bb860$b5132920$@ewellic.org> Eli Zaretskii wrote: > Do you know any editor that implements that TN [#28]? I do not, but then I haven't looked for one. I might try checking with Murray Sargent to see if he knows any. -- Doug Ewell, CC, ALB | Lakewood, CO, US | ewellic.org From mark at markdawson.io Sun Jun 6 22:48:22 2021 From: mark at markdawson.io (Mark Dawson) Date: Sun, 6 Jun 2021 20:48:22 -0700 Subject: Confusables.txt might be too sensitive Message-ID: Dear Unicode Mailing List, I am a user of the metamask browser extension (which is a cryptocurrency wallet). My name always gets flagged as a potential scam simply because it contains the small Latin letter "m" (codepoint 006D). Someone contributing to the metamask project had the idea to give a warning message if someone is using a name that might contain suspicious characters. Seems like a good idea to me. The contributor decided to use TR39's confusable.txt file to flag suspicious characters. On line 3344 of the confusables.txt, it lists the small Latin letter "m" (codepoint 006D) as a source character for a confusable. Is this intentional? No other small Latin letter is flagged as a confusable. (Not even the letter "o"). Would Unicode consider removing the small Latin letter "m" as a source on the confusable.txt? Thanks, Mark [image: image.png] -- Mark Dawson mark at markdawson.io -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image.png Type: image/png Size: 25412 bytes Desc: not available URL: From asmusf at ix.netcom.com Mon Jun 7 00:43:37 2021 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Sun, 6 Jun 2021 22:43:37 -0700 Subject: Confusables.txt might be too sensitive In-Reply-To: References: Message-ID: <8d2f2ee2-adf4-39d4-4b3d-03ad6d0a1571@ix.netcom.com> An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image.png Type: image/png Size: 25412 bytes Desc: not available URL: From marius.spix at web.de Mon Jun 7 07:16:25 2021 From: marius.spix at web.de (Marius Spix) Date: Mon, 7 Jun 2021 14:16:25 +0200 Subject: Confusables.txt might be too sensitive In-Reply-To: References: Message-ID: <20210607141625.533a3328@spixxi> I guess, the problem is that m looks similar to rn. For example, the domain "pomhub dot com" is easily confusable with a well-known website. But that also would work the other way around, e. g. "rnicrosoft dot com". Italic m also looks identical to italic ? (Cyrillic t). But I agree that m should not be considered to be a mock letter at all, especially in cases where only identifiers [a-zA-Z_] are allowed for user names. But this will be a task of the individual implementation, not for Unicode. On Sun, 6 Jun 2021 20:48:22 -0700 Mark Dawson via Unicode wrote: > Dear Unicode Mailing List, > > I am a user of the metamask browser extension > (which is a cryptocurrency wallet). My name always gets flagged as a > potential scam simply because it contains the small Latin letter "m" > (codepoint 006D). Someone contributing to the metamask project had > the idea > to give a warning message if someone is using a name that might > contain suspicious characters. Seems like a good idea to me. > > The contributor decided to use TR39's confusable.txt > > file to flag suspicious characters. On line 3344 of the > confusables.txt, it lists the small Latin letter "m" (codepoint 006D) > as a source character for a confusable. Is this intentional? > > No other small Latin letter is flagged as a confusable. (Not even the > letter "o"). Would Unicode consider removing the small Latin letter > "m" as a source on the confusable.txt? > > Thanks, > > Mark > > [image: image.png] From sosipiuk at gmail.com Mon Jun 7 12:05:48 2021 From: sosipiuk at gmail.com (=?UTF-8?Q?S=C5=82awomir_Osipiuk?=) Date: Mon, 7 Jun 2021 13:05:48 -0400 Subject: Confusables.txt might be too sensitive In-Reply-To: References: Message-ID: On Mon, Jun 7, 2021 at 1:16 AM Mark Dawson via Unicode < unicode at corp.unicode.org> wrote: > > No other small Latin letter is flagged as a confusable. (Not even the > letter "o"). > All the other latin letters ARE listed as confusable. I'm curious how the implementation decides which ones to flag. The only thing unique about "m", versus the rest of the latin alphabet, seems to be that it's confusable with a two-character sequence. But surely the implementation doesn't restrict itself to only such cases, so what is happening here? Why is "m" causing a problem, but "o" is not, when both are confusable with other characters? Does it have to do with the input being restricted to ASCII (or some other limited set) and so other characters are removed as possibilities, leaving the latin set as non-confusable (aside from "m")? S?awomir Osipiuk -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Mon Jun 7 13:00:28 2021 From: doug at ewellic.org (Doug Ewell) Date: Mon, 7 Jun 2021 12:00:28 -0600 Subject: Confusables.txt might be too sensitive In-Reply-To: References: Message-ID: <000001d75bc7$0003d670$000b8350$@ewellic.org> S?awomir Osipiuk wrote: >> No other small Latin letter is flagged as a confusable. (Not even the >> letter "o"). > > All the other latin letters ARE listed as confusable. But not in confusables.txt. It's entirely likely, as Mark Dawson surmised, that the MetaMask people simply grabbed that one file and used it as their entire security strategy. It would hardly be the first time that someone took a small component of the Unicode (or other) standard and used it as their implementation, instead of actually reading and understanding the standard. Look what happens when someone browses the Unicode code charts and declares that language X isn't fully supported because the contextual forms aren't there. (The same happens in BCP 47 when people look only at the Language Subtag Registry and don't read the document.) > I'm curious how the implementation decides which ones to flag. The > only thing unique about "m", versus the rest of the latin alphabet, > seems to be that it's confusable with a two-character sequence. But > surely the implementation doesn't restrict itself to only such cases, > so what is happening here? Actually, that is probably exactly what is happening: the implementation is taking confusables.txt out of context and using it as a sledgehammer. > Why is "m" causing a problem, but "o" is not, when both are confusable > with other characters? Does it have to do with the input being > restricted to ASCII (or some other limited set) and so other > characters are removed as possibilities, leaving the latin set as non- > confusable (aside from "m")? I think an interesting experiment would be to try other types of confusable scenarios, such as an ENS name wholly or partially in another script such as Greek or Cyrillic, to see if MetaMask allows those while flagging 'm'. In any case, if MetaMask flags all ENS names that contain an 'm' (or '1' or 'I'), then a whole lot of users besides Mark are sure to run into the same problem. Gosh, even the example name at ens.domains ("Yourname.eth") would generate the warning. -- Doug Ewell, CC, ALB | Lakewood, CO, US | ewellic.org From asmusf at ix.netcom.com Mon Jun 7 13:11:51 2021 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Mon, 7 Jun 2021 11:11:51 -0700 Subject: Confusables.txt might be too sensitive In-Reply-To: References: Message-ID: <4bc565ea-000a-8225-ea94-d029393d09a7@ix.netcom.com> An HTML attachment was scrubbed... URL: From sosipiuk at gmail.com Mon Jun 7 13:22:19 2021 From: sosipiuk at gmail.com (=?UTF-8?Q?S=C5=82awomir_Osipiuk?=) Date: Mon, 7 Jun 2021 14:22:19 -0400 Subject: Confusables.txt might be too sensitive In-Reply-To: <000001d75bc7$0003d670$000b8350$@ewellic.org> References: <000001d75bc7$0003d670$000b8350$@ewellic.org> Message-ID: On Mon, Jun 7, 2021 at 2:02 PM Doug Ewell via Unicode wrote: > > > > All the other latin letters ARE listed as confusable. > > But not in confusables.txt. It's entirely likely, as Mark Dawson surmised, that the MetaMask people simply grabbed that one file and used it as their entire security strategy. D'oh! I was looking at http://www.unicode.org/Public/security/latest/confusablesSummary.txt (linked in Mark Dawson's original message) rather than at http://www.unicode.org/Public/security/latest/confusables.txt It's definitely more understandable what is happening with the latter. From doug at ewellic.org Mon Jun 7 14:41:15 2021 From: doug at ewellic.org (Doug Ewell) Date: Mon, 7 Jun 2021 13:41:15 -0600 Subject: Confusables.txt might be too sensitive In-Reply-To: References: <000001d75bc7$0003d670$000b8350$@ewellic.org> Message-ID: <000501d75bd5$15c4f530$414edf90$@ewellic.org> Upon reading the MetaMask PRs and problem statement more closely, it seems they were mainly focused on mixed-script spoofing (e.g. using Greek '?' or Cyrillic '?' in place of Latin 'o') and randomly inserted, invisible control characters like ZWNJ. The author of the original PR (9129, not 9187) seemed to understand the underlying problem, and even suggested an existing library, but instead of using this presumably nuanced and tested solution, someone else applied the confusables.txt sledgehammer. That contributor even commented that his solution "might even be a little too strict because it warns on 'math.eth' being so similar to 'rnath.eth'," but nobody else complained, and so here we are. -- Doug Ewell, CC, ALB | Lakewood, CO, US | ewellic.org From mark at markdawson.io Tue Jun 8 13:51:33 2021 From: mark at markdawson.io (Mark Dawson) Date: Tue, 8 Jun 2021 11:51:33 -0700 Subject: Confusables.txt might be too sensitive In-Reply-To: <000501d75bd5$15c4f530$414edf90$@ewellic.org> References: <000001d75bc7$0003d670$000b8350$@ewellic.org> <000501d75bd5$15c4f530$414edf90$@ewellic.org> Message-ID: Thanks everyone for the feedback on this, and thank you Doug for looking into the MetaMask implementation. It sounds like this is a problem that I should talk to the MetaMask maintainers about. On Mon, Jun 7, 2021 at 12:45 PM Doug Ewell via Unicode < unicode at corp.unicode.org> wrote: > Upon reading the MetaMask PRs and problem statement more closely, it seems > they were mainly focused on mixed-script spoofing (e.g. using Greek '?' or > Cyrillic '?' in place of Latin 'o') and randomly inserted, invisible > control characters like ZWNJ. > > The author of the original PR (9129, not 9187) seemed to understand the > underlying problem, and even suggested an existing library, but instead of > using this presumably nuanced and tested solution, someone else applied the > confusables.txt sledgehammer. That contributor even commented that his > solution "might even be a little too strict because it warns on 'math.eth' > being so similar to 'rnath.eth'," but nobody else complained, and so here > we are. > > -- > Doug Ewell, CC, ALB | Lakewood, CO, US | ewellic.org > > > -- Mark Dawson mark at markdawson.io -------------- next part -------------- An HTML attachment was scrubbed... URL: From wjgo_10009 at btinternet.com Tue Jun 8 13:06:46 2021 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Tue, 8 Jun 2021 19:06:46 +0100 (BST) Subject: The rules about tag sequences Message-ID: <1ccd60c6.19bd0.179ecce2f3a.Webtop.100@btinternet.com> In seeking to draft a reply to the comments in the https://www.unicode.org/L2/L2021/21099-qid-feedback.pdf document, I have wondered about the following questions. Can we discuss them please? Is it within the rules for tag sequences to have as a base a sequence of a character followed by one or more combining characters before the sequence of tag characters starts? The idea being that the particular base character followed by one or more combining characters sequence is chosen is one very unlikely to be used otherwise, thus avoiding confusion. Can any entity, whether a company or an individual, publish a tag sequence and it be valid, or does it need approval by the Unicode Technical Committee? William Overington Tuesday 8 June 2021 From wjgo_10009 at btinternet.com Tue Jun 15 09:54:29 2021 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Tue, 15 Jun 2021 15:54:29 +0100 (BST) Subject: An artistic setting of a poem that uses language-independent glyphs Message-ID: <569c8002.25034.17a102aaa99.Webtop.86@btinternet.com> Recently I started the following thread in the "Share your work" section of the Serif Affinity forums. https://forum.affinity.serif.com/index.php?/topic/143812-informal-design-workshop-idea/ Although not originally intended, I was wondering what design to devise for the workshop thread, and in the event a design for a greetings card with an artistic setting of a poem written using language-independent glyphs was produced. Thus the finished artwork expresses a poem in glyphs of what is possibly in effect a type of pivot language. Some readers might enjoy reading the thread, though the poem using language-independent glyphs is not introduced until the second page of the thread. William Overington Tuesday 15 June 2021 From public at khwilliamson.com Thu Jun 17 09:28:39 2021 From: public at khwilliamson.com (Karl Williamson) Date: Thu, 17 Jun 2021 08:28:39 -0600 Subject: Broken links to Unicode pages in current UCD files Message-ID: <8df1f78b-9070-a6ac-ef68-7ed9bd277887@khwilliamson.com> Someone discovered and reported various broken links. An example is # BidiMirroring-13.0.0.txt # Date: 2019-09-09, 19:34:00 GMT [KW, LI, RP] # ? 2019 Unicode?, Inc. # For terms of use, see http://www.unicode.org/terms_of_use.html On line number 43, there is a link to http://www.unicode.org/unicode/reports/tr9/ Following that link leads to a 404. A link that works is http://www.unicode.org/reports/tr9/ Presumably the link in the file used to work. One of the first rules of web design is to never ever break a published link. It's fine to reorganize your site; just be sure that the original links are available as synonyms to the modern version. In this case, the current UCD contains broken links. From wjgo_10009 at btinternet.com Thu Jun 17 08:29:00 2021 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Thu, 17 Jun 2021 14:29:00 +0100 (BST) Subject: Some language-independent glyphs for museum shops Message-ID: <1e46d904.2a500.17a1a291e78.Webtop.100@btinternet.com> Many museums and art galleries have online shops these days, and some will send items internationally, the customer paying for the items online by card. Yet there is the language barrier. In the 1970s, before the web, before cards that could be used internationally, I purchased some colour slides of paintings from the Uffizi in Florence and from the Louvre in Paris by mail order by writing letters in Italian and French respectively. I do not know what was the quality of my writing in those languages yet I did communicate effectively as I received replies in Italian and French respectively and I received the colour slides. So what if one has symbols, precise emoji, language-independent glyphs, for the fields needed to make a card payment? These symbols could be used either stand-alone or together with text in the language of the country in which the museum is located. Some museums have guides and signage in several languages. Yet not in every language. So language-independent glyphs could be a mini-pivot language to assist communication through the language barrier for a card purchase transaction. I have produced ten glyph designs. Maybe a few more are needed, maybe the colour scheme needs changing, please discuss. Presently I have, for each of the ten language-independent glyphs, a colourful version and a graceful fallback monochrome version. There is an experimental colour font available, free to use. Should such glyphs be encoded in regular Unicode, or as if ligatures of a sequence of characters, or as QID emoji? There is a thread where the font is being applied in examples. https://forum.affinity.serif.com/index.php?/topic/144236-some-language-independent-glyphs-for-museum-shops/ William Overington Thursday 17 June 2021 From asmusf at ix.netcom.com Thu Jun 17 11:36:03 2021 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Thu, 17 Jun 2021 09:36:03 -0700 Subject: Broken links to Unicode pages in current UCD files In-Reply-To: <8df1f78b-9070-a6ac-ef68-7ed9bd277887@khwilliamson.com> References: <8df1f78b-9070-a6ac-ef68-7ed9bd277887@khwilliamson.com> Message-ID: An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Sun Jun 20 23:43:37 2021 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Mon, 21 Jun 2021 05:43:37 +0100 Subject: Question on combining character order In-Reply-To: <457959e2-7564-d1ea-9cbe-324e8628954e@ix.netcom.com> References: <457959e2-7564-d1ea-9cbe-324e8628954e@ix.netcom.com> Message-ID: <20210621054337.0df9d97a@JRWUBU2> On Sun, 20 Jun 2021 02:10:34 -0700 Asmus Freytag via Unicode wrote: > The short answer is "no". > > A longer answer is that typing order, display order and > phonetic/semantic order do not have to agree. > > On 6/20/2021 1:54 AM, Phake Nick via Unicode wrote: > Currently, in Unicode, combining characters like U+20DD or U+20DE, > are to be placed behind the main character to be combined. > > But sometimes, linguistically, it make sense for a combing mark to > > come in front. > > > > For example, the famous instant ramen brand, Maruchan, was > > originally called "Maruto" in Japanese, as a spoken form of its > > initial trade mark with the Japanese hiragana character "To" (Stand > > for the company's official name, Toyo Suisan) being placed insidr a > > circle ("Maru"). To replicate the sign using modern Unicode, users > > would need to first input the Japanese Hiragana character "To", > > then inout the combining circle mark of U+20DD being the maru, and > > would result in reverse linguistic order compares to how such marks > > are being pronounced in Japanese. > > > > Another example, in Cantonese, it is customary to create new > > Chinese characters to express a Cantonese phoneme that don't have > > obvious connection with commonly known Chinese characters, by > > attaching the component of a mouth (U+2F1D) onto other > > similarly-sounded Chinese characters with different meaning. For > > example, Unicode character U+975A, meaning "beautiful", can have > > the component of mouth attached to it, and become U+210C1, meaning > > beautiful. Although in this particular example, the modified > > character have also been encoded, on some platforms it might not be > > supported by input method modifier or are otherwise difficult to > > enter and thus people would input the deconstructed form. But due > > to the lack of a small mouth component for combination, and > > combination of characyers through Ideographic Description Sequence > > is also not being supported on most platforms, it is common for > > people to use Latin small letter o, U+006F, to represent the > > component. As the component is customarily written on the left side > > of Chinese characters, and it is customary for Chinese character to > > be written from left to right, it would be usual for the additional > > component to be keyed in before entering the character itself. As > > such, if a combining character featuring the component of mouth is > > to be introduced, it would make the most sense if the combining > > mouth component is to be typed in before the character to be > > modified, instead of the other way round. > > > > Is there mechanism in Unicode that can support such type of > > combining characters? (Resending) Yes, in various degrees. 1. Coeng characters (i.e. most invisible stackers) convert the input-logically following consonant into a consonant character. Category Mn. 2. 'Buoyant' consonants that sit on the hanging baseline above the rest of the consonant stack, such as U+0D4E MALAYALAM LETTER DOT REPH, and (category Lo) ... 3. ... and eastern U+1A58 TAI THAM SIGN MAI KANG LAI (category Mn). There are problems with most of these: 1. Coeng characters get given a non-zero canonical combining class, which can causes them to be separated from combining marks applied to the previous base character. That happens in Tai Tham, and would happen in Kharoshthi if nuktas were applied to the initial characters of conjoined characters. 3. The properties of U+1A58 are based on western usage, where it functions as a final consonant, so grapheme clustering unites it with the previous consonant. Manipulating an isolated orthographic syllable starting with it is awkward at best. What you want are formally format characters (Cf), like the IDS controls, but with a mandatory graphic effect, more like the control characters for Egyptian hieroglyphs. However, for Chinese character composition, what you want might be better served by an Lo with appropriate clustering and line-breaking operations. It's time to move on to 'every script is complex'. Richard. From richard.wordingham at ntlworld.com Tue Jun 22 02:15:31 2021 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Tue, 22 Jun 2021 08:15:31 +0100 Subject: Question on combining character order In-Reply-To: <457959e2-7564-d1ea-9cbe-324e8628954e@ix.netcom.com> References: <457959e2-7564-d1ea-9cbe-324e8628954e@ix.netcom.com> Message-ID: <20210622081531.0b7688a0@JRWUBU2> On Sun, 20 Jun 2021 02:10:34 -0700 Asmus Freytag via Unicode wrote: > The short answer is "no". > > A longer answer is that typing order, display order and > phonetic/semantic order do not have to agree. > > On 6/20/2021 1:54 AM, Phake Nick via Unicode wrote: > Currently, in Unicode, combining characters like U+20DD or U+20DE, > are to be placed behind the main character to be combined. > > But sometimes, linguistically, it make sense for a combing mark to > > come in front. > > > > For example, the famous instant ramen brand, Maruchan, was > > originally called "Maruto" in Japanese, as a spoken form of its > > initial trade mark with the Japanese hiragana character "To" (Stand > > for the company's official name, Toyo Suisan) being placed insidr a > > circle ("Maru"). To replicate the sign using modern Unicode, users > > would need to first input the Japanese Hiragana character "To", > > then inout the combining circle mark of U+20DD being the maru, and > > would result in reverse linguistic order compares to how such marks > > are being pronounced in Japanese. > > > > Another example, in Cantonese, it is customary to create new > > Chinese characters to express a Cantonese phoneme that don't have > > obvious connection with commonly known Chinese characters, by > > attaching the component of a mouth (U+2F1D) onto other > > similarly-sounded Chinese characters with different meaning. For > > example, Unicode character U+975A, meaning "beautiful", can have > > the component of mouth attached to it, and become U+210C1, meaning > > beautiful. Although in this particular example, the modified > > character have also been encoded, on some platforms it might not be > > supported by input method modifier or are otherwise difficult to > > enter and thus people would input the deconstructed form. But due > > to the lack of a small mouth component for combination, and > > combination of characyers through Ideographic Description Sequence > > is also not being supported on most platforms, it is common for > > people to use Latin small letter o, U+006F, to represent the > > component. As the component is customarily written on the left side > > of Chinese characters, and it is customary for Chinese character to > > be written from left to right, it would be usual for the additional > > component to be keyed in before entering the character itself. As > > such, if a combining character featuring the component of mouth is > > to be introduced, it would make the most sense if the combining > > mouth component is to be typed in before the character to be > > modified, instead of the other way round. > > > > Is there mechanism in Unicode that can support such type of > > combining characters? (Resending) Yes, in various degrees. 1. Coeng characters (i.e. most invisible stackers) convert the input-logically following consonant into a consonant character. Category Mn. 2. 'Buoyant' consonants that sit on the hanging baseline above the rest of the consonant stack, such as U+0D4E MALAYALAM LETTER DOT REPH, and (category Lo) ... 3. ... and eastern U+1A58 TAI THAM SIGN MAI KANG LAI (category Mn). There are problems with most of these: 1. Coeng characters get given a non-zero canonical combining class, which can causes them to be separated from combining marks applied to the previous base character. That happens in Tai Tham, and would happen in Kharoshthi if nuktas were applied to the initial characters of conjoined characters. 3. The properties of U+1A58 are based on western usage, where it functions as a final consonant, so grapheme clustering unites it with the previous consonant. Manipulating an isolated orthographic syllable starting with it is awkward at best. What you want are formally format characters (Cf), like the IDS controls, but with a mandatory graphic effect, more like the control characters for Egyptian hieroglyphs. However, for Chinese character composition, what you want might be better served by an Lo with appropriate clustering and line-breaking operations. It's time to move on to 'every script is complex'. Richard.