From kenwhistler at sonic.net Mon Mar 1 10:00:03 2021 From: kenwhistler at sonic.net (Ken Whistler) Date: Mon, 1 Mar 2021 08:00:03 -0800 Subject: Unicode 14.0 Alpha Review In-Reply-To: <79bb4f68-c841-5cd1-129b-0d2a2489d581@code2001.com> References: <79bb4f68-c841-5cd1-129b-0d2a2489d581@code2001.com> Message-ID: <49a211cc-05d7-3ec4-18a3-b4ad0b97e1dd@sonic.net> James, See the discussion under Unihan-UTC166-R07 in: https://www.unicode.org/L2/L2021/21015-cjk-unihan-group-utc166.pdf --Ken On 2/26/2021 10:11 PM, James Kass via Unicode wrote: > > https://www.unicode.org/charts/PDF/Unicode-14.0/U140-2A700.pdf > > Is the Unicode 14.0 provisional CJK character slated for U+2B736 > a duplicate of existing character U+3B3F ? ? > > Note that Chinese radical # 130 (?) often takes the shape of Chinese > radical # 74 (?). > > From jameskass at code2001.com Tue Mar 2 03:55:17 2021 From: jameskass at code2001.com (James Kass) Date: Tue, 2 Mar 2021 09:55:17 +0000 Subject: Unicode 14.0 Alpha Review In-Reply-To: <49a211cc-05d7-3ec4-18a3-b4ad0b97e1dd@sonic.net> References: <79bb4f68-c841-5cd1-129b-0d2a2489d581@code2001.com> <49a211cc-05d7-3ec4-18a3-b4ad0b97e1dd@sonic.net> Message-ID: <335dbf86-2e6d-9127-8bf8-e24b89a1774d@code2001.com> On 2021-03-01 4:00 PM, Ken Whistler via Unicode wrote: > See the discussion under Unihan-UTC166-R07 in: > > https://www.unicode.org/L2/L2021/21015-cjk-unihan-group-utc166.pdf Thank you for the link.? That's quite an informative document. From copypaste at kittens.ph Sat Mar 13 22:57:56 2021 From: copypaste at kittens.ph (Fredrick Brennan) Date: Sat, 13 Mar 2021 23:57:56 -0500 Subject: HTML entities Message-ID: <3163040.UKa7oIsXr7@laptop> Is the list of HTML entities fixed, or is there a mechanism through which new ones can be requested? I think the right group to contact is the W3C, or is it IETF? For example, I think "²³&sup4;" should expand to "???" and not to "??&sup4;" as it currently does. Best, Fred Brennan From mandel59 at gmail.com Sat Mar 13 23:30:51 2021 From: mandel59 at gmail.com (Ryusei) Date: Sun, 14 Mar 2021 14:30:51 +0900 Subject: HTML entities In-Reply-To: <3163040.UKa7oIsXr7@laptop> References: <3163040.UKa7oIsXr7@laptop> Message-ID: <85D51762-34AC-4528-B08D-41BFC3C6010A@gmail.com> The Living Standard of HTML is maintained by WHATWG. HTML specs by W3C are superseded by it. The list of named character references are specified here: >. I found a related issue on GitHub: >. Regards, Ryusei > 2021/03/14 13:57?Fredrick Brennan via Unicode ????: > > Is the list of HTML entities fixed, or is there a mechanism through which new > ones can be requested? I think the right group to contact is the W3C, or is it > IETF? > > For example, I think "²³&sup4;" should expand to "???" and not to > "??&sup4;" as it currently does. > > Best, > Fred Brennan > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From marius.spix at web.de Mon Mar 15 10:09:38 2021 From: marius.spix at web.de (Marius Spix) Date: Mon, 15 Mar 2021 16:09:38 +0100 Subject: Aw: HTML entities In-Reply-To: <3163040.UKa7oIsXr7@laptop> References: <3163040.UKa7oIsXr7@laptop> Message-ID: An HTML attachment was scrubbed... URL: From jukkakk at gmail.com Mon Mar 15 12:35:30 2021 From: jukkakk at gmail.com (Jukka K. Korpela) Date: Mon, 15 Mar 2021 19:35:30 +0200 Subject: HTML entities In-Reply-To: References: <3163040.UKa7oIsXr7@laptop> Message-ID: Marius Spix via Unicode (unicode at unicode.org) wrote: > > ?, ? and ? are for compatibility reasons in plaintext applications. If you > are already using HTML, you should prefer to use 2, 3 > and 4. > This is a different issue, about the use of superscript characters, not about named entity references for them. The document ?Unicode in XML and other Markup Languages? https://www.w3.org/TR/unicode-xml/#Superscripts suggests the use of markup for superscripting in mathematical texts. but then says: ?However, when super and sub-scripts are to reflect semantic distinctions, it is easier to work with these meanings encoded in text rather than markup, for example, in phonetic or phonemic transcription. Otherwise, they would require markup in the middle of words, and they may also be inadvertently changed to normal style text, when exporting to plain text.? On the practical side, using superscript digits almost always produces better typographic quality than the use of markup like , which is generally implemented in a simplistic manner (some vertical alignment and reduced font size), often resulting in uneven line spacing unless you take some precautions. This applies to text processing software, web browsers, etc.; typesetting tools for mathematical texts are a different issue, and the problem hardly arises there. ? You can see this if you compare 2 and ² in any browser. Yucca, http://jkorpela.fi ? -------------- next part -------------- An HTML attachment was scrubbed... URL: From textexin at xencraft.com Mon Mar 15 13:25:26 2021 From: textexin at xencraft.com (Tex) Date: Mon, 15 Mar 2021 11:25:26 -0700 Subject: HTML entities In-Reply-To: References: <3163040.UKa7oIsXr7@laptop> Message-ID: <001201d719c8$924b9720$b6e2c560$@xencraft.com> Hi Jukka, However, you are quoting a doc that has been withdrawn. See the note at the top of the document in the ?status of this document? section: ? This document has been withdrawn Many of the materials in this document are stale and out of date; the W3C is maintaining this version solely as a historical reference. This document was originally produced as a joint publication between the W3C and the Unicode Consortium. In 2016, Unicode withdrew publication as a Unicode Technical Report.? (Frankly I thought some of its recommendations were questionable from the gitgo.) If there are issues with how is implemented and renders, they should be fixed rather than adding what would be many stylized named entities, which would require the same code fixes. regards tex From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Jukka K. Korpela via Unicode Sent: Monday, March 15, 2021 10:36 AM Cc: via Unicode Subject: Re: HTML entities Marius Spix via Unicode (unicode at unicode.org) wrote: ?, ? and ? are for compatibility reasons in plaintext applications. If you are already using HTML, you should prefer to use 2, 3 and 4. This is a different issue, about the use of superscript characters, not about named entity references for them. The document ?Unicode in XML and other Markup Languages? https://www.w3.org/TR/unicode-xml/#Superscripts suggests the use of markup for superscripting in mathematical texts. but then says: ?However, when super and sub-scripts are to reflect semantic distinctions, it is easier to work with these meanings encoded in text rather than markup, for example, in phonetic or phonemic transcription. Otherwise, they would require markup in the middle of words, and they may also be inadvertently changed to normal style text, when exporting to plain text.? On the practical side, using superscript digits almost always produces better typographic quality than the use of markup like , which is generally implemented in a simplistic manner (some vertical alignment and reduced font size), often resulting in uneven line spacing unless you take some precautions. This applies to text processing software, web browsers, etc.; typesetting tools for mathematical texts are a different issue, and the problem hardly arises there. ? You can see this if you compare 2 and ² in any browser. Yucca, http://jkorpela.fi ? -------------- next part -------------- An HTML attachment was scrubbed... URL: From jameskass at code2001.com Wed Mar 17 20:46:35 2021 From: jameskass at code2001.com (James Kass) Date: Thu, 18 Mar 2021 01:46:35 +0000 Subject: HTML entities In-Reply-To: <001201d719c8$924b9720$b6e2c560$@xencraft.com> References: <3163040.UKa7oIsXr7@laptop> <001201d719c8$924b9720$b6e2c560$@xencraft.com> Message-ID: On 2021-03-15 6:25 PM, Tex via Unicode wrote: > ? > > This document has been withdrawn > > > > Many of the materials in this document are stale and out of date; the W3C is maintaining this version solely as a historical reference. This document was originally produced as a joint publication between the W3C and the Unicode Consortium. In 2016, Unicode withdrew publication as a Unicode Technical Report.? > The document may have been withdrawn, but the portion quoted from it remains valid. > If there are issues with how is implemented and renders, they should be fixed rather than adding what would be many stylized named entities, which would require the same code fixes. That is a sensible approach. Best regards, James Kass From jukkakk at gmail.com Thu Mar 18 03:20:05 2021 From: jukkakk at gmail.com (Jukka K. Korpela) Date: Thu, 18 Mar 2021 10:20:05 +0200 Subject: HTML entities In-Reply-To: <001201d719c8$924b9720$b6e2c560$@xencraft.com> References: <3163040.UKa7oIsXr7@laptop> <001201d719c8$924b9720$b6e2c560$@xencraft.com> Message-ID: Tex (textexin at xencraft.com) wrote: > > > However, you are quoting a doc that has been withdrawn. > It?s a pity that this well-written and useful document was withdrawn, for reasons I don?t understand. Yet, the statement I quoted is valid and relevant on its own. To take an even more understandable example, the use of 104 versus 10? means that when an HTML document is saved as plain text, or copied and pasted to a plain text environment, or rendered in Braille or speech, the expression denoting the number 10,000 suddenly becomes 104. > If there are issues with how is implemented and renders, they should > be fixed rather than adding what would be many stylized named entities, > which would require the same code fixes. > The and elements have been in HTML well over 20 years, with no progress in implementations. I can imagine some of the reasons to this. But this is completely independent of the issue of named character reference. It does not affect the rendering the least whether SUPERSCRIPT FOUR appears in HTML source as such (as character data), as numeric reference ⁴, or as named reference &sup4;. The only differences between the latter two are that 1) the named reference is more mnemonic and therefore easier to write and 2) an HTML user agent needs to have an entry for it in its mapping table from names to numbers (so the implementation is extremely trivial, and the question would be how fast it would be made and how fast the installed browser base would be updated). Personally, I don?t see a problem in writing ⁴ (and ⁵ etc.) after I have learned to remember this. But the point is that when people complain that &sup4; does not work, then the answer should not be ?use 4?. It?s something very different, and there are ways to use SUPERSCRIPT FOUR even in circumstances where you cannot type it directly or as a named reference. Yucca -------------- next part -------------- An HTML attachment was scrubbed... URL: From public at khwilliamson.com Thu Mar 18 21:12:59 2021 From: public at khwilliamson.com (Karl Williamson) Date: Thu, 18 Mar 2021 20:12:59 -0600 Subject: =?UTF-8?Q?Re=3a_Unicode_is_universal=2c_so_how_come_that_universali?= =?UTF-8?Q?ty_doesn=e2=80=99t_apply_to_digits=3f?= In-Reply-To: References: <20201216195746.37c2237b@JRWUBU2> Message-ID: On 12/16/20 2:32 PM, Bill Poser via Unicode wrote: > It seems to me that, in spite of the superficial similarity of the way > numbers are written in many languages, this is NOT, in general, a matter > of encoding conversion or even transliteration but rather one of > translation and therefore not part of Unicode for the same reason that > Unicode does not handle the translation of text from, say, Japanese to > English. > > There is, actually, a library, which I have written, that handles > conversions between Unicode strings and integers for most systems of > writing numbers. (I have yet to update it to handle some of the more > recently encoded systems.) It is a C library which also has a TCL binding: > > http://billposer.org/Software/libuninum.html > > > It handles a number of systems that require algorithms rather different > from that of atoi/strtol. > > Bill > Another tool option is that recent versions of Perl come with the function num() in the Unicode::UCD module. If its input is a string consisting of a single character, and that character has a defined numeric value, it will return that value, converted to floating point if necessary; it returns undef for characters without a numeric value If called with a string consisting entirely of characters with category Nd, all from the same block of 10 consecutive code points, it will return the value they represent, assuming left-to-right positional notation, so that the right-most digit is the one's position, next is the 10's, etc. It returns undef for any other string longer than one character. From duerst at it.aoyama.ac.jp Fri Mar 19 01:40:32 2021 From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J=2e_D=c3=bcrst?=) Date: Fri, 19 Mar 2021 15:40:32 +0900 Subject: HTML entities In-Reply-To: References: <3163040.UKa7oIsXr7@laptop> <001201d719c8$924b9720$b6e2c560$@xencraft.com> Message-ID: <1866eedd-c8d4-9586-26bd-46125690d920@it.aoyama.ac.jp> Hello Jukka, others, On 2021/03/18 17:20, Jukka K. Korpela via Unicode wrote: > Tex (textexin at xencraft.com) wrote: >> However, you are quoting a doc that has been withdrawn. > It?s a pity that this well-written and useful document was withdrawn, for > reasons I don?t understand. Here are the main reasons, as far as I understand them. Unicode gets updated roughly once a year, and Web technology also changes over time. There was not enough manpower to keep the document up to date. In addition, the document was always a kind of tug-of-war between those who pushed for more favorable descriptions of specific Unicode characters (such as ? in this discussion) or more favorable descriptions of markup-based and style-based solutions (such as ). That meant that for each update, in addition to dealing with new characters, there was a tendency to re-negotiate already established text. A consequence of this tug-of-war was that the document was written in a way that made clear that there was some choice between markup/styling and special-purpose Unicode characters, but allowed each side to interpret the document in the way they were seeing things. On top of that, the document was also a joint publication of the Unicode Consortium and W3C. So there were cases where a tug-of-war happened inside the W3C, inside Unicode, or between the two organizations, or all of it at the same time. Publication required approval by both sides, and even a minor tweak from one side had to be approved by the other side. The schedules of both sides had otherwise no reason to be in sync, so the next version of Unicode was often around before the update for the previous version was beginning to settle. So at some point, some brave soul became aware of this situation and proposed a withdrawal, and nobody else had the energy to object. > Yet, the statement I quoted is valid and relevant on its own. To take an > even more understandable example, the use of 104 versus 10? > means that when an HTML document is saved as plain text, or copied and > pasted to a plain text environment, or rendered in Braille or speech, the > expression denoting the number 10,000 suddenly becomes 104. Well, an then somebody else uses 103.5 somewhere. How are you going to express this so that it doesn't turn into 103.5 in plain text? The problem is that there is always a limit somewhere for plain text. There is also always a limit somewhere for markup and styled rendering, but it's in a quite different place. >> If there are issues with how is implemented and renders, they should >> be fixed rather than adding what would be many stylized named entities, >> which would require the same code fixes. >> > > The and elements have been in HTML well over 20 years, with no > progress in implementations. I can imagine some of the reasons to this. Out of the box rendering of and may be rather crude, but I guess it should be possible to do a lot better with some dose of CSS and possibly some Web fonts. > But this is completely independent of the issue of named character > reference. It does not affect the rendering the least whether SUPERSCRIPT > FOUR appears in HTML source as such (as character data), as numeric > reference ⁴, or as named reference &sup4;. The only differences > between the latter two are that 1) the named reference is more mnemonic and > therefore easier to write and 2) an HTML user agent needs to have an entry > for it in its mapping table from names to numbers (so the implementation is > extremely trivial, and the question would be how fast it would be made and > how fast the installed browser base would be updated). In theory, it could be made quite quickly. But it is a slippery slope. There are always more characters for which somebody may want additional named character entities. And so my guess would be that the browser makers would be very cautious. Regards, Martin. > Personally, I don?t see a problem in writing ⁴ (and ⁵ etc.) > after I have learned to remember this. But the point is that when people > complain that &sup4; does not work, then the answer should not be ?use > 4?. It?s something very different, and there are ways to use > SUPERSCRIPT FOUR even in circumstances where you cannot type it directly or > as a named reference. > > Yucca > From christoph.paeper at crissov.de Fri Mar 19 15:43:19 2021 From: christoph.paeper at crissov.de (=?utf-8?Q?Christoph_P=C3=A4per?=) Date: Fri, 19 Mar 2021 21:43:19 +0100 Subject: HTML entities In-Reply-To: <1866eedd-c8d4-9586-26bd-46125690d920@it.aoyama.ac.jp> References: <1866eedd-c8d4-9586-26bd-46125690d920@it.aoyama.ac.jp> Message-ID: <259EE9CC-3215-4535-91EA-29965C9884CB@crissov.de> Martin J. D?rst via Unicode : > > Well, an then somebody else uses 103.5 somewhere. How are you going to express this so that it doesn't turn into 103.5 in plain text? 10^3.5 ;) From jameskass at code2001.com Sat Mar 20 13:43:15 2021 From: jameskass at code2001.com (James Kass) Date: Sat, 20 Mar 2021 18:43:15 +0000 Subject: UNIHAN update? Message-ID: <6dd8d421-f707-951d-d057-2a684e4364a2@code2001.com> https://www.unicode.org/Public/UCD/latest/ucd/ The file "UNIHAN.ZIP" has not been updated since 2020-02-19 and doesn't seem to include any data for CJK Unified Ideographs Extension G.? Is there some kind of update forthcoming?? Or am I looking in the wrong place? From markus.icu at gmail.com Sat Mar 20 22:42:33 2021 From: markus.icu at gmail.com (Markus Scherer) Date: Sat, 20 Mar 2021 20:42:33 -0700 Subject: UNIHAN update? In-Reply-To: <6dd8d421-f707-951d-d057-2a684e4364a2@code2001.com> References: <6dd8d421-f707-951d-d057-2a684e4364a2@code2001.com> Message-ID: On Sat, Mar 20, 2021 at 11:45 AM James Kass via Unicode wrote: > https://www.unicode.org/Public/UCD/latest/ucd/ > > The file "UNIHAN.ZIP" has not been updated since 2020-02-19 and doesn't > seem to include any data for CJK Unified Ideographs Extension G. Is > there some kind of update forthcoming? Or am I looking in the wrong place? > The "latest" is for the latest released version of Unicode, which is version 13. This fall, Unicode 14 will be published. The "alpha" data files for that are here: https://www.unicode.org/Public/14.0.0/ucd/ See http://blog.unicode.org/2021/02/unicode-140-alpha-review.html and follow the link to PRI #428 and further links from there. markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From jameskasskrv at gmail.com Sun Mar 21 00:15:17 2021 From: jameskasskrv at gmail.com (James Kass) Date: Sun, 21 Mar 2021 05:15:17 +0000 Subject: UNIHAN update? In-Reply-To: References: <6dd8d421-f707-951d-d057-2a684e4364a2@code2001.com> Message-ID: <1fe4b2e9-d385-4eb5-24d9-c31f4086d4bf@gmail.com> On 2021-03-21 3:42 AM, Markus Scherer via Unicode wrote: > The "latest" is for the latest released version of Unicode, which is > version 13. > > This fall, Unicode 14 will be published. The "alpha" data files for that > are here:https://www.unicode.org/Public/14.0.0/ucd/ Thank you. CJK Extension G became official with Unicode 13.0. The file "Unihan_RadicalStrokeCounts.txt" contained in UNIHAN.ZIP (Unihan-14.0.0d3.zip) just downloaded from the 14.0.0 link is also not updated.? There's practically no data for Extension E or Extension F and nothing at all for Plane 3 (Ext G). But it does appear that the updated information I seek is in the 14.0.0 file "Unihan_IRGSources.txt". From asmusf at ix.netcom.com Sun Mar 21 03:20:08 2021 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Sun, 21 Mar 2021 01:20:08 -0700 Subject: UNIHAN update? In-Reply-To: <1fe4b2e9-d385-4eb5-24d9-c31f4086d4bf@gmail.com> References: <6dd8d421-f707-951d-d057-2a684e4364a2@code2001.com> <1fe4b2e9-d385-4eb5-24d9-c31f4086d4bf@gmail.com> Message-ID: <91bf5d22-080a-df1b-d13c-ff03838ecadf@ix.netcom.com> An HTML attachment was scrubbed... URL: From duerst at it.aoyama.ac.jp Sun Mar 21 06:17:46 2021 From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J=2e_D=c3=bcrst?=) Date: Sun, 21 Mar 2021 20:17:46 +0900 Subject: HTML entities In-Reply-To: <259EE9CC-3215-4535-91EA-29965C9884CB@crissov.de> References: <1866eedd-c8d4-9586-26bd-46125690d920@it.aoyama.ac.jp> <259EE9CC-3215-4535-91EA-29965C9884CB@crissov.de> Message-ID: On 2021/03/20 05:43, Christoph P?per wrote: > Martin J. D?rst via Unicode : >> >> Well, an then somebody else uses 103.5 somewhere. How are you going to express this so that it doesn't turn into 103.5 in plain text? > > 10^3.5 ;) Interesting idea to use the (Ruby parenthesis) element. But I'm sure there's a better (semantically more appropriate) way to use markup (+maybe styling) to hide the "^" but let it appear when in plain text. Regards, Martin. From christoph.paeper at crissov.de Sun Mar 21 07:18:34 2021 From: christoph.paeper at crissov.de (=?utf-8?Q?Christoph_P=C3=A4per?=) Date: Sun, 21 Mar 2021 13:18:34 +0100 Subject: HTML entities In-Reply-To: References: Message-ID: <7A9EB686-D4A3-4E8E-BD11-64E4D8447746@crissov.de> > Martin J. D?rst via Unicode : > > Interesting idea to use the (Ruby parenthesis) element. But I'm sure there's a better (semantically more appropriate) way to use markup (+maybe styling) to hide the "^" but let it appear when in plain text. I don?t think there?s one in HTML Following the precedence set by U+2064 Invisible Plus (e.g. between integer and vulgar fraction) and U+2062 Invisible Times (e.g. between letter constants or variables), Unicode could add X+2065 Invisible Exponentiation (or Invisible Opening Parenthesis and Invisible Closing Parenthesis). From jameskass at code2001.com Sun Mar 21 13:58:08 2021 From: jameskass at code2001.com (James Kass) Date: Sun, 21 Mar 2021 18:58:08 +0000 Subject: UNIHAN update? In-Reply-To: <91bf5d22-080a-df1b-d13c-ff03838ecadf@ix.netcom.com> References: <6dd8d421-f707-951d-d057-2a684e4364a2@code2001.com> <1fe4b2e9-d385-4eb5-24d9-c31f4086d4bf@gmail.com> <91bf5d22-080a-df1b-d13c-ff03838ecadf@ix.netcom.com> Message-ID: <6fbed239-22f1-b63e-9c51-0fb0849b8766@code2001.com> On 2021-03-21 8:20 AM, Asmus Freytag via Unicode wrote: > Are you saying that the zip file is incomplete? I was looking for updated radical and stroke data for CJK ideographs through Extension G.? That tantalizing file name, "Unihan_RadicalStrokeCounts.txt", misled me. That file only contains two fields from the Unihan database: #??? kRSAdobe_Japan1_6 #??? kRSKangXi ... and has no entries for Extension G.? (And very little for E and F.) So I *was* looking in the wrong place, apparently. Happy ending, though.? I found what I was looking for. From markus.icu at gmail.com Sun Mar 21 17:13:09 2021 From: markus.icu at gmail.com (Markus Scherer) Date: Sun, 21 Mar 2021 15:13:09 -0700 Subject: UNIHAN update? In-Reply-To: <6fbed239-22f1-b63e-9c51-0fb0849b8766@code2001.com> References: <6dd8d421-f707-951d-d057-2a684e4364a2@code2001.com> <1fe4b2e9-d385-4eb5-24d9-c31f4086d4bf@gmail.com> <91bf5d22-080a-df1b-d13c-ff03838ecadf@ix.netcom.com> <6fbed239-22f1-b63e-9c51-0fb0849b8766@code2001.com> Message-ID: For documentation of the Unihan database please see UAX #38. Note that which field/property is in which file may change between versions. There is a proposed update for Unicode 14: https://www.unicode.org/review/pri421/ --> https://www.unicode.org/reports/tr38/proposed.html markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From jukkakk at gmail.com Mon Mar 22 03:53:31 2021 From: jukkakk at gmail.com (Jukka K. Korpela) Date: Mon, 22 Mar 2021 10:53:31 +0200 Subject: HTML entities In-Reply-To: <1866eedd-c8d4-9586-26bd-46125690d920@it.aoyama.ac.jp> References: <3163040.UKa7oIsXr7@laptop> <001201d719c8$924b9720$b6e2c560$@xencraft.com> <1866eedd-c8d4-9586-26bd-46125690d920@it.aoyama.ac.jp> Message-ID: Martin J. D?rst (duerst at it.aoyama.ac.jp) wrote: > Hello Jukka, others, > > On 2021/03/18 17:20, Jukka K. Korpela via Unicode wrote: > > Tex (textexin at xencraft.com) wrote: > > >> However, you are quoting a doc that has been withdrawn. > > > It?s a pity that this well-written and useful document was withdrawn, for > > reasons I don?t understand. > > Here are the main reasons, as far as I understand them. Unicode gets > updated roughly once a year, and Web technology also changes over time. > There was not enough manpower to keep the document up to date. > > In addition, the document was always a kind of tug-of-war between those > who pushed for more favorable descriptions of specific Unicode > characters (such as ? in this discussion) or more favorable descriptions > of markup-based and style-based solutions (such as ). Thank you for the description. These opposite views surely reflected different needs, such as the need to represent data in plain text in some contexts and the need for more structured representation. Well, an then somebody else uses 103.5 somewhere. How are you > going to express this so that it doesn't turn into 103.5 in plain text? > The problem is that there is always a limit somewhere for plain text. Well, in the given case, it might help if we had IMPLIED EXPONENTIATION (we don?t; we have IMPLIED TIMES, but it does not help here); at least it would appear in text data to indicate that adjacent digits are not part of the same number. > > There is also always a limit somewhere for markup and styled rendering, > but it's in a quite different place. > Regarding exponents, the limit is currently set by the presence of superscript characters for digits, plus, and minus, and (for some reason), =, (, ), and n. This covers most of the cases where one might consider using superscripts in general texts and in expressing values of quantities. But when you have, say, text that contains the simple expression *ax *with *x* as a superscript denoting exponent there is no satisfactory way to represent it in plain text. Using just ax would mean using a wrong expression, and using a? (with U+02E3 MODIFIER LETTER SMALL X) would be too tricky. Unicode hasn?t got a repertoire of superscript Latin letters even though they are often used as semantically different from normal letters; it only has some of such letters, apparently meant for special uses only (like phonetic symbols). > > Out of the box rendering of and may be rather crude, but I > guess it should be possible to do a lot better with some dose of CSS and > possibly some Web fonts. > In a sense, it would be straightforward to map, say, 2 to SUPERSCRIPT TWO in the rendering phase, either directly at the character level or via glyph selection when an OpenType font is used. In another sense, it would be complicated, since we hardly want to have 2 rendered substantially different from x in style. So the mapping should take place only when the entire document contains only such elements where are characters have superscript counterparts in Unicode (or at the glyph level). Jukka -------------- next part -------------- An HTML attachment was scrubbed... URL: From marius.spix at web.de Mon Mar 22 08:23:59 2021 From: marius.spix at web.de (Marius Spix) Date: Mon, 22 Mar 2021 14:23:59 +0100 Subject: Aw: Re: HTML entities In-Reply-To: References: <3163040.UKa7oIsXr7@laptop> <001201d719c8$924b9720$b6e2c560$@xencraft.com> <1866eedd-c8d4-9586-26bd-46125690d920@it.aoyama.ac.jp> Message-ID: An HTML attachment was scrubbed... URL: From christoph.paeper at crissov.de Mon Mar 22 12:17:24 2021 From: christoph.paeper at crissov.de (=?utf-8?Q?Christoph_P=C3=A4per?=) Date: Mon, 22 Mar 2021 18:17:24 +0100 Subject: HTML entities In-Reply-To: References: Message-ID: <21390B27-7E3D-4A7B-A7FD-F6EF83BED603@crissov.de> Marius Spix via Unicode : > > CSS is also no solution, because and are semantic tags (like , , and ) and not just stylistic ones (like , , or ). When HTML introduced the `b`/`strong` and `i`/`em` distinctions, it should also have added presentational/semantic pairs - `sup`/`exp` (exponent) or `pow` (power) and - `sub`/`idx`, `ind` (index) or `base`. I don?t think the WHATWG or W3C would be interested in adding them now. From marius.spix at web.de Mon Mar 22 12:37:20 2021 From: marius.spix at web.de (Marius Spix) Date: Mon, 22 Mar 2021 18:37:20 +0100 Subject: Aw: Re: HTML entities In-Reply-To: <21390B27-7E3D-4A7B-A7FD-F6EF83BED603@crissov.de> References: <21390B27-7E3D-4A7B-A7FD-F6EF83BED603@crissov.de> Message-ID: An HTML attachment was scrubbed... URL: From marius.spix at web.de Mon Mar 22 12:44:10 2021 From: marius.spix at web.de (Marius Spix) Date: Mon, 22 Mar 2021 18:44:10 +0100 Subject: Fw: Aw: Re: HTML entities References: <21390B27-7E3D-4A7B-A7FD-F6EF83BED603@crissov.de> Message-ID: An HTML attachment was scrubbed... URL: From harjitmoe at outlook.com Mon Mar 22 13:39:35 2021 From: harjitmoe at outlook.com (Harriet Riddle) Date: Mon, 22 Mar 2021 18:39:35 +0000 Subject: Aw: Re: HTML entities In-Reply-To: References: <21390B27-7E3D-4A7B-A7FD-F6EF83BED603@crissov.de> , Message-ID: Several originally presentational elements have been re-defined in HTML5 as having vague semantics distinct from just a styled span, but also distinct from any similarly styled semantic elements; those which could not be were deprecated.? This applies to more than just sup/sub, e.g. is treated as a vague differentiated, but not emphasised, voice, such as commentary or a character's thoughts, et cetera. This has some interesting effects: has been interpreted as a de?mphasis and is still valid, while the accompanying is deprecated since it could not be given a consistent distinctive semantic (e.g., headings should use heading elements). ?Har. ________________________________ From: Unicode on behalf of Marius Spix via Unicode Sent: Monday, March 22, 2021 5:44:10 PM To: christoph.paeper at crissov.de Cc: unicode at unicode.org Subject: Fw: Aw: Re: HTML entities I did some further research: The WHATWG spec differs from the Mozilla definition. It lists and in the text-level semantics section and states: > These elements must be used only to mark up typographical conventions with specific meanings, not for typographical presentation for presentation's sake. > The sub element can be used inside a var element, for variables that have subscripts. See also: https://html.spec.whatwg.org/multipage/text-level-semantics.html#the-sub-and-sup-elements Rergards, Marius Spix Gesendet: Montag, 22. M?rz 2021 um 18:37 Uhr Von: "Marius Spix" An: christoph.paeper at crissov.de Cc: unicode at unicode.org Betreff: Aw: Re: HTML entities Dear Christoph, according to Mozilla [1], > The element should only be used for typographical reasons?that is, to change the position of the text to comply > with typographical conventions or standards, rather than solely for presentation or appearance purposes. [1] https://developer.mozilla.org/en-US/docs/Web/HTML/Element/sup Regards, Marius Spix Gesendet: Montag, 22. M?rz 2021 um 18:17 Uhr Von: "Christoph P?per via Unicode" An: unicode at unicode.org Betreff: Re: HTML entities Marius Spix via Unicode : > > CSS is also no solution, because and are semantic tags (like , , and ) and not just stylistic ones (like , , or ). When HTML introduced the `b`/`strong` and `i`/`em` distinctions, it should also have added presentational/semantic pairs - `sup`/`exp` (exponent) or `pow` (power) and - `sub`/`idx`, `ind` (index) or `base`. I don?t think the WHATWG or W3C would be interested in adding them now. -------------- next part -------------- An HTML attachment was scrubbed... URL: From asmusf at ix.netcom.com Mon Mar 22 14:24:04 2021 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Mon, 22 Mar 2021 12:24:04 -0700 Subject: Aw: Re: HTML entities In-Reply-To: References: <21390B27-7E3D-4A7B-A7FD-F6EF83BED603@crissov.de> Message-ID: An HTML attachment was scrubbed... URL: From mark at macchiato.com Mon Mar 22 14:27:44 2021 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Mon, 22 Mar 2021 12:27:44 -0700 Subject: Aw: Re: HTML entities In-Reply-To: References: <21390B27-7E3D-4A7B-A7FD-F6EF83BED603@crissov.de> Message-ID: +1 On Mon, Mar 22, 2021, 12:26 Asmus Freytag via Unicode wrote: > On 3/22/2021 10:37 AM, Marius Spix via Unicode wrote: > > Dear Christoph, > > according to Mozilla [1], > > "The element should only be used for typographical reasons?that is, > to change the position of the text to complywith typographical conventions > or standards, rather than solely for presentation or appearance purposes." > > [1] https://developer.mozilla.org/en-US/docs/Web/HTML/Element/sup > > > Now, I have a hard time coming up with examples of "presentation or > appearance" purposes that require small, raised letters or digits and are > *not* related to some "typographical convention". > > The problem with seems to be more in the fact that there's more than > one convention that might apply. > > A./ > > > > Regards, > > Marius Spix > > > *Gesendet:* Montag, 22. M?rz 2021 um 18:17 Uhr > *Von:* "Christoph P?per via Unicode" > > *An:* unicode at unicode.org > *Betreff:* Re: HTML entities > Marius Spix via Unicode : > > > > CSS is also no solution, because and are semantic tags (like > , , and ) and not just stylistic ones (like , > , or ). > > When HTML introduced the `b`/`strong` and `i`/`em` distinctions, it should > also have added presentational/semantic pairs > > - `sup`/`exp` (exponent) or `pow` (power) and > - `sub`/`idx`, `ind` (index) or `base`. > > I don?t think the WHATWG or W3C would be interested in adding them now. > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Mon Mar 22 17:16:24 2021 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Mon, 22 Mar 2021 22:16:24 +0000 Subject: Keyboard Suddenly Outputting in NFD Message-ID: <20210322221624.2b4d61ed@JRWUBU2> I'm asking here because my searches turned up nothing. I've just noticed that when I use my handrolled keyboard designed to output NFC, what appears on the terminal (Gnome-terminal) or browser (Firefox into a Wikimedia form), my text is being stored as NFD UTF-8. I use an M17n definition with fcitx on Ubuntu 16.04.3 as the input method. It used to generate NFC; I'm not sure when it suddenly changed to generating NFD text. The keyboard used to generate NFC output. This change causes me grief because I am using grep to search data files stored in NFC; grep does not respect canonical equivalence, so a typed in sequence in NFD does not match the NFC data in the file. Does anyone know where this change has occurred? Are there any quick fixes? I do have a grep-like search utility that respects canonical equivalence, but it's a bit slow with a million-line input file. Richard. From duerst at it.aoyama.ac.jp Mon Mar 22 18:23:29 2021 From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J=2e_D=c3=bcrst?=) Date: Tue, 23 Mar 2021 08:23:29 +0900 Subject: Aw: Re: HTML entities In-Reply-To: References: <21390B27-7E3D-4A7B-A7FD-F6EF83BED603@crissov.de> Message-ID: Hello Asmus, others, On 2021/03/23 04:24, Asmus Freytag via Unicode wrote: > On 3/22/2021 10:37 AM, Marius Spix via Unicode wrote: >> Dear Christoph, >> according to Mozilla [1], >> "The element should only be used for typographical reasons?that is, to >> change the position of the text to complywith typographical conventions or >> standards, rather than solely for presentation or appearance purposes." >> [1] https://developer.mozilla.org/en-US/docs/Web/HTML/Element/sup > > Now, I have a hard time coming up with examples of "presentation or appearance" > purposes that require small, raised letters or digits and are *not* related to > some "typographical convention". > > The problem with seems to be more in the fact that there's more than one > convention that might apply. I agree that this text from MDN is not very good. I think that what it meant is something like "don't use if you want smaller, raised letters just for a change or just for fun". Also, of course, MDN is not a specification. Regards,?? Martin. From duerst at it.aoyama.ac.jp Mon Mar 22 18:44:11 2021 From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J=2e_D=c3=bcrst?=) Date: Tue, 23 Mar 2021 08:44:11 +0900 Subject: Aw: Re: HTML entities In-Reply-To: References: <3163040.UKa7oIsXr7@laptop> <001201d719c8$924b9720$b6e2c560$@xencraft.com> <1866eedd-c8d4-9586-26bd-46125690d920@it.aoyama.ac.jp> Message-ID: <1575b882-826a-151a-26b6-dfc41503df1c@it.aoyama.ac.jp> Hello Marius, others, On 2021/03/22 22:23, Marius Spix via Unicode wrote: > You cannot just map 2 to SUPERSCRIPT TWO, because you may have cases > with nested or like 10(10100), which is the > representation of a number known as Googolplex, or ?CO2, > which is the percentage of carbon dioxide in an air sample. Such cases are not > and should not be handled by Unicode, because their interpretation requires a > stack machine. > CSS is also no solution, because and are semantic tags (like , > , and ) and not just stylistic ones (like , , or ). What I meant was not to use CSS instead of or , but to use it in addition to one of these. That should make it possible to address the browser's limitation on rendering superscripts and subscripts. Using CSS (and Web Fonts) it should be possible to get as close as needed in look and style to the builtin ??? superscript characters without actually using these characters. That would also make sure that none of these characters needs character entity references, and there is no worry about using a character that does not have a superscript (or subscript) variant in Unicode itself. That would avoid the slippery slope problem both for character entity references and for Unicode superscript/subscript variants. And that's a very good thing, because whenever somebody comes up with a request for yet another of these, the only thing that is sure is that it won't be the last. See an additional comment below. > *Gesendet:* Montag, 22. M?rz 2021 um 09:53 Uhr > *Von:* "Jukka K. Korpela via Unicode" > *An:* "Martin J. D?rst" > *Cc:* "via Unicode" > *Betreff:* Re: HTML entities > Martin J. D?rst (duerst at it.aoyama.ac.jp ) wrote: > > Hello Jukka, others, > > On 2021/03/18 17:20, Jukka K. Korpela via Unicode wrote: > > Tex (textexin at xencraft.com ) wrote: > > >> However, you are quoting a doc that has been withdrawn. > > > It?s a pity that this well-written and useful document was withdrawn, for > > reasons I don?t understand. > > Here are the main reasons, as far as I understand them. Unicode gets > updated roughly once a year, and Web technology also changes over time. > There was not enough manpower to keep the document up to date. > > In addition, the document was always a kind of tug-of-war between those > who pushed for more favorable descriptions of specific Unicode > characters (such as ? in this discussion) or more favorable descriptions > of markup-based and style-based solutions (such as ). > > Thank you for the description. These opposite views surely reflected different > needs, such as the need to represent data in plain text in some contexts and the > need for more structured representation. Not only. They also were a front line in the discussion about how far Unicode should go in encoding characters with typographical/stylistic distinctions, or in other words, what should be the limits of plain text. Regards, Martin. > Well, an then somebody else uses 103.5 somewhere. How are you > going to express this so that it doesn't turn into 103.5 in plain text? > The problem is that there is always a limit somewhere for plain text. > > Well, in the given case, it might help if we had IMPLIED EXPONENTIATION (we > don?t; we have IMPLIED TIMES, but it does not help here); at least it would > appear in text data to indicate that adjacent digits are not part of the same > number. > > > There is also always a limit somewhere for markup and styled rendering, > but it's in a quite different place. > > Regarding exponents, the limit is currently set by the presence of superscript > characters for digits, plus, and minus, and (for some reason), =, (, ), and n. > This covers most of the cases where one might consider using superscripts in > general texts and in expressing values of quantities. > > But when you have, say, text that contains the simple expression /ax /with /x/ > as a superscript denoting exponent there is no satisfactory way to represent it > in plain text. Using just ax would mean using a wrong expression, and using a? > (with U+02E3 MODIFIER LETTER SMALL X) would be too tricky. Unicode hasn?t got a > repertoire of superscript Latin letters even though they are often used as > semantically different from normal letters; it only has some of such letters, > apparently meant for special uses only (like phonetic symbols). > > > Out of the box rendering of and may be rather crude, but I > guess it should be possible to do a lot better with some dose of CSS and > possibly some Web fonts. > > In a sense, it would be straightforward to map, say, 2 to SUPERSCRIPT > TWO in the rendering phase, either directly at the character level or via glyph > selection when an OpenType font is used. In another sense, it would be > complicated, since we hardly want to have 2 rendered substantially > different from x in style. So the mapping should take place only when > the entire document contains only such elements where are characters have > superscript counterparts in Unicode (or at the glyph level). > > Jukka From asmusf at ix.netcom.com Mon Mar 22 19:29:56 2021 From: asmusf at ix.netcom.com (Asmus Freytag (c)) Date: Mon, 22 Mar 2021 17:29:56 -0700 Subject: Aw: Re: HTML entities In-Reply-To: References: <21390B27-7E3D-4A7B-A7FD-F6EF83BED603@crissov.de> Message-ID: On 3/22/2021 4:23 PM, Martin J. D?rst wrote: > Hello Asmus, others, > > On 2021/03/23 04:24, Asmus Freytag via Unicode wrote: >> On 3/22/2021 10:37 AM, Marius Spix via Unicode wrote: >>> Dear Christoph, >>> according to Mozilla [1], >>> "The element should only be used for typographical >>> reasons?that is, to change the position of the text to complywith >>> typographical conventions or standards, rather than solely for >>> presentation or appearance purposes." >>> [1] https://developer.mozilla.org/en-US/docs/Web/HTML/Element/sup >> >> Now, I have a hard time coming up with examples of "presentation or >> appearance" >> purposes that require small, raised letters or digits and are *not* >> related to >> some "typographical convention". >> >> The problem with seems to be more in the fact that there's more >> than one >> convention that might apply. > > I agree that this text from MDN is not very good. I think that what it > meant is something like "don't use if you want smaller, raised > letters just for a change or just for fun". Also, of course, MDN is > not a specification. Right, we get that. In the unusual circumstance that I might want smaller, raised letters "just for fun", I may not care about a precise appearance, so I wouldn't pay attention to "rules" anyway. The real issue with compared to is that language like that makes it masquerade as "semantic", when it isn't. A./ -------------- next part -------------- An HTML attachment was scrubbed... URL: From duerst at it.aoyama.ac.jp Mon Mar 22 20:18:54 2021 From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J=2e_D=c3=bcrst?=) Date: Tue, 23 Mar 2021 10:18:54 +0900 Subject: Aw: Re: HTML entities In-Reply-To: References: <21390B27-7E3D-4A7B-A7FD-F6EF83BED603@crissov.de> Message-ID: <3aae7bb0-8a74-ae66-7cb6-d1e4623de9f1@it.aoyama.ac.jp> Hello Asmus, others, On 2021/03/23 09:29, Asmus Freytag (c) wrote: > On 3/22/2021 4:23 PM, Martin J. D?rst wrote: >> I agree that this text from MDN is not very good. I think that what it >> meant is something like "don't use if you want smaller, raised >> letters just for a change or just for fun". Also, of course, MDN is >> not a specification. > > Right, we get that. > > In the unusual circumstance that I might want smaller, raised letters > "just for fun", I may not care about a precise appearance, so I wouldn't > pay attention to "rules" anyway. > > The real issue with compared to is that language like > that makes it masquerade as "semantic", when it isn't. In my opinion, in these contexts, 'semantic' has to be seen as something with a degree. may have a higher degree of semantics that . For , it's essentially any kind of semantics that is usually displayed as a superscript, which could be e.g. an exponent, a superscript index in some mathematical of physical,... notation, a superscript in some phonetic notation, and so on. For , at least if we follow the meaning of the word 'strong' itself, it's any kind of semantics that implies some kind of strengthening, which still could be a rather wide range. In both cases, for finer semantics, an HTML class attribute might be used. Regards, Martin. > A./ > From richard.wordingham at ntlworld.com Tue Mar 23 03:11:45 2021 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Tue, 23 Mar 2021 08:11:45 +0000 Subject: Keyboard Suddenly Outputting in NFD In-Reply-To: <20210322221624.2b4d61ed@JRWUBU2> References: <20210322221624.2b4d61ed@JRWUBU2> Message-ID: <20210323081145.039fa11c@JRWUBU2> On Mon, 22 Mar 2021 22:16:24 +0000 Richard Wordingham via Unicode wrote: > Are there any quick fixes? There is one off-the-shelf fix. Instead of typing grep pattern file one types grep $(unconv -x any-nfc << References: Message-ID: Martin J. D?rst via Unicode : > > Interesting idea to use the (Ruby parenthesis) element. But I'm sure there's a better (semantically more appropriate) way to use markup (+maybe styling) to hide the "^" but let it appear when in plain text. I?ve asked just now: -------------- next part -------------- An HTML attachment was scrubbed... URL: From ishida at w3.org Tue Mar 23 06:04:21 2021 From: ishida at w3.org (r12a) Date: Tue, 23 Mar 2021 11:04:21 +0000 Subject: HTML entities In-Reply-To: References: <3163040.UKa7oIsXr7@laptop> <001201d719c8$924b9720$b6e2c560$@xencraft.com> <1866eedd-c8d4-9586-26bd-46125690d920@it.aoyama.ac.jp> Message-ID: <7bc7d93a-da39-ea99-0d06-5df13c745425@w3.org> fwiw, i was curious enough to check it out, and Unicode has the full ASCII lower-case alphabet except for q available as superscripted letters. ????????????????q??????????????????? ri Jukka K. Korpela via Unicode wrote on 22/03/2021 08:53: > Unicode hasn?t got a repertoire of superscript Latin letters even > though they are often used as semantically different from normal > letters; it only has some of such letters, apparently meant for > special uses only (like phonetic symbols). -------------- next part -------------- An HTML attachment was scrubbed... URL: From kilobyte at angband.pl Tue Mar 23 07:53:09 2021 From: kilobyte at angband.pl (Adam Borowski) Date: Tue, 23 Mar 2021 13:53:09 +0100 Subject: HTML entities In-Reply-To: <7bc7d93a-da39-ea99-0d06-5df13c745425@w3.org> References: <3163040.UKa7oIsXr7@laptop> <001201d719c8$924b9720$b6e2c560$@xencraft.com> <1866eedd-c8d4-9586-26bd-46125690d920@it.aoyama.ac.jp> <7bc7d93a-da39-ea99-0d06-5df13c745425@w3.org> Message-ID: On Tue, Mar 23, 2021 at 11:04:21AM +0000, r12a via Unicode wrote: > Jukka K. Korpela via Unicode wrote on 22/03/2021 08:53: > > Unicode hasn?t got a repertoire of superscript Latin letters even though > > they are often used as semantically different from normal letters; it only > > has some of such letters, apparently meant for special uses only (like > > phonetic symbols). > fwiw, i was curious enough to check it out, and Unicode has the full ASCII > lower-case alphabet except for q available as superscripted letters. > > ????????????????q??????????????????? And for uppercase: ??C??F??????????Q?S????XYZ plus look-alikes: ??? The pipeline already includes CFQ. Thus, what about adding the stragglers, ie, qSXYZ ? On the other hand, subscript is nowhere close: ?bcd?fg?????????q?????w?yz with no capitals. Meow! -- ??????? Latin: meow 4 characters, 4 columns, 4 bytes ??????? Greek: ???? 4 characters, 4 columns, 8 bytes ??????? Runes: ???? 4 characters, 4 columns, 12 bytes ??????? Chinese: ? 1 character, 2 columns, 3 bytes <-- best! From beckiergb at gmail.com Tue Mar 23 14:24:08 2021 From: beckiergb at gmail.com (Rebecca Bettencourt) Date: Tue, 23 Mar 2021 12:24:08 -0700 Subject: HTML entities In-Reply-To: References: <3163040.UKa7oIsXr7@laptop> <001201d719c8$924b9720$b6e2c560$@xencraft.com> <1866eedd-c8d4-9586-26bd-46125690d920@it.aoyama.ac.jp> <7bc7d93a-da39-ea99-0d06-5df13c745425@w3.org> Message-ID: The pipeline also includes lowercase q. You're not going to convince the UTC to encode superscript SXYZ unless you find evidence of them being used as part of a phonetic transcription system. That's the only use case that has gotten superscripts and subscripts accepted in recent years; all other proposals have been summarily dismissed. -- Rebecca Bettencourt On Tue, Mar 23, 2021 at 5:59 AM Adam Borowski via Unicode < unicode at unicode.org> wrote: > On Tue, Mar 23, 2021 at 11:04:21AM +0000, r12a via Unicode wrote: > > Jukka K. Korpela via Unicode wrote on 22/03/2021 08:53: > > > Unicode hasn?t got a repertoire of superscript Latin letters even > though > > > they are often used as semantically different from normal letters; it > only > > > has some of such letters, apparently meant for special uses only (like > > > phonetic symbols). > > > fwiw, i was curious enough to check it out, and Unicode has the full > ASCII > > lower-case alphabet except for q available as superscripted letters. > > > > ????????????????q??????????????????? > > And for uppercase: > ??C??F??????????Q?S????XYZ > plus look-alikes: ??? > > The pipeline already includes CFQ. > > Thus, what about adding the stragglers, ie, qSXYZ ? > > > On the other hand, subscript is nowhere close: > ?bcd?fg?????????q?????w?yz > with no capitals. > > > Meow! > -- > ??????? Latin: meow 4 characters, 4 columns, 4 bytes > ??????? Greek: ???? 4 characters, 4 columns, 8 bytes > ??????? Runes: ???? 4 characters, 4 columns, 12 bytes > ??????? Chinese: ? 1 character, 2 columns, 3 bytes <-- best! > -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Tue Mar 23 15:18:43 2021 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Tue, 23 Mar 2021 20:18:43 +0000 Subject: Keyboard Suddenly Outputting in NFD In-Reply-To: References: <20210322221624.2b4d61ed@JRWUBU2> <20210323081145.039fa11c@JRWUBU2> Message-ID: <20210323201843.383d05fa@JRWUBU2> On Tue, 23 Mar 2021 11:22:44 +0100 Marius Spix wrote: > Logstash can be used for NFC normalization. > Gesendet: Dienstag, 23. M?rz 2021 um 09:11 Uhr > Von: "Richard Wordingham via Unicode" >> one types >> grep $(uconv -x any-nfc <<> This won't work nicely if the pattern contains shell control >> characters, such as spaces and dollars. Ah, my solution is the wrong way round. It should be: uconv -x any-nfd | grep pattern I should make the data match the search string! Richard. From lyratelle at gmx.de Wed Mar 24 04:38:16 2021 From: lyratelle at gmx.de (Dominikus Dittes Scherkl) Date: Wed, 24 Mar 2021 10:38:16 +0100 Subject: HTML entities In-Reply-To: <7A9EB686-D4A3-4E8E-BD11-64E4D8447746@crissov.de> References: <7A9EB686-D4A3-4E8E-BD11-64E4D8447746@crissov.de> Message-ID: <8c84d590-e920-2b6c-d2bb-5a71665a50ed@gmx.de> Am 21.03.21 um 13:18 schrieb Christoph P?per via Unicode: >> Martin J. D?rst via Unicode : >> >> Interesting idea to use the (Ruby parenthesis) element. But I'm sure there's a better (semantically more appropriate) way to use markup (+maybe styling) to hide the "^" but let it appear when in plain text. > > I don?t think there?s one in HTML > > Following the precedence set by U+2064 Invisible Plus (e.g. between integer and vulgar fraction) and U+2062 Invisible Times (e.g. between letter constants or variables), Unicode could add X+2065 Invisible Exponentiation (or Invisible Opening Parenthesis and Invisible Closing Parenthesis). > Yes, I think adding an "Invisible Exponent" character to Unicode would really help solving this semantic distinction problem in plain text. -- Dominikus Dittes Scherkl From corentin.jabot at gmail.com Fri Mar 26 06:44:11 2021 From: corentin.jabot at gmail.com (Corentin) Date: Fri, 26 Mar 2021 12:44:11 +0100 Subject: White spaces for the purpose of programming languages Message-ID: Hello In UAX #44, White_space is described as "Spaces, separator characters and other control characters which should be treated by programming languages as "white space" for the purpose of parsing elements." >From what I can tell, ECMAScript/JS uses White_space (or rather Space_Separator which is slightly different), Rust uses Pattern_White_Space which is a more restricted set, while most other languages seem to only support the ASCII spaces. I wanted to confirm that the intent is that White_Space is recommended in programming languages. I assumed that Pattern_White_Space would be more suitable for that purpose, but it isn't actually clear from a reading of UAX31 Which first states in it's introduction > A common task facing an implementer of the Unicode Standard is the provision of a parsing and/or lexing engine for identifiers, such as programming language variables or domain names. But later: Pattern Syntax : There are many circumstances where software interprets patterns that are a mixture of literal characters, whitespace, and syntax characters. Examples include regular expressions, Java collation rules, Excel or ICU number formats, and many others. (programming languages are not mentioned there) Any clarification as to whether White_Space should be considered over Pattern_White_Space for programming languages would be appreciated :) I think that clarification might be useful for many users as different programming languages have made different choices! Thanks, Corentin -------------- next part -------------- An HTML attachment was scrubbed... URL: From mandel59 at gmail.com Fri Mar 26 11:43:30 2021 From: mandel59 at gmail.com (Ryusei) Date: Sat, 27 Mar 2021 01:43:30 +0900 Subject: What is Urdu paragraph separator? Message-ID: <2A159DCE-B621-4330-8997-821DFB613E42@gmail.com> Hello According to NamesList.txt, U+203B is for several usages: > 203B REFERENCE MARK > = Japanese kome > = Urdu paragraph separator > x (tibetan ku ru kha bzhi mig can - 0FBF) > x (cjk unified ideograph-200AD - 200AD) I know Japanese komejirushi, and a page of Wikipedia > shows a good real-life usage in Japan. But I never heard about Urdu paragraph separator. How is it used? And why Urdu separator mark and East Asian reference mark are unified? (I think unifying marks of different scripts likely cause typographic issue, especially where font fallback is required. I don't expect that Urdu separator mark is rendered as fullwidth character.) Thanks, Ryusei -------------- next part -------------- An HTML attachment was scrubbed... URL: From wjgo_10009 at btinternet.com Fri Mar 26 13:45:44 2021 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Fri, 26 Mar 2021 18:45:44 +0000 (GMT) Subject: A poem using language-independent glyphs Message-ID: <67cbea6c.1298.1786fdb4925.Webtop.87@btinternet.com> Here is link to a forum post that I produced today. https://forum.affinity.serif.com/index.php?/topic/138654-artwork-for-greetings-cards/ I am hoping that in time that these glyphs, and others, will become accessible within regular Unicode using a mechanism related to, yet a little different from, the mechanism used for QID emoji. The mechanism being to use a tag exclamation mark rather than the tag Q used for QID emoji in the original proposal. William Overington Friday 26 March 2021 From richard.wordingham at ntlworld.com Sat Mar 27 14:00:22 2021 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Sat, 27 Mar 2021 19:00:22 +0000 Subject: Keyboard Suddenly Outputting in NFD In-Reply-To: <20210322221624.2b4d61ed@JRWUBU2> References: <20210322221624.2b4d61ed@JRWUBU2> Message-ID: <20210327190022.53c248cf@JRWUBU2> On Mon, 22 Mar 2021 22:16:24 +0000 Richard Wordingham via Unicode wrote: ** FALSE ALARM! ** > I've just noticed that when I use my handrolled keyboard designed to > output NFC, what appears on the terminal (Gnome-terminal) or browser > (Firefox into a Wikimedia form), my text is being stored as NFD UTF-8. > I use an M17n definition with fcitx on Ubuntu 16.04.3 as the input > method. It used to generate NFC; I'm not sure when it suddenly changed > to generating NFD text. Sorry, it can't have been working as well as I thought it did. I seem to have slightly broken the keyboard in October 2020. (The keyboard converts XSAMPA input to IPA in NFC. The immediate idea was to apply the transform for the string "_s" when there is a transform for "a_M", without having to define a transform for "a_s". I defined a transform for "a_", that left "_" in the pending input, but that extended transform then fired instead of the one for "a_M".) Richard. From wjgo_10009 at btinternet.com Wed Mar 31 16:18:35 2021 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Wed, 31 Mar 2021 22:18:35 +0100 (BST) Subject: A poem using language-independent glyphs In-Reply-To: <67cbea6c.1298.1786fdb4925.Webtop.87@btinternet.com> References: <67cbea6c.1298.1786fdb4925.Webtop.87@btinternet.com> Message-ID: <7c2ed17b.153d.1788a270029.Webtop.108@btinternet.com> The thread now has 28 posts in it and over 800 views. There are now three poems using language-independent glyphs, two of the poems written today. https://forum.affinity.serif.com/index.php?/topic/138654-artwork-for-greetings-cards William Overington Wednesday 31 March 2021 ------ Original Message ------ From: "William_J_G Overington via Unicode" To: unicode at unicode.org Sent: Friday, 2021 Mar 26 At 18:45 Subject: A poem using language-independent glyphs Here is link to a forum post that I produced today. https://forum.affinity.serif.com/index.php?/topic/138654-artwork-for-greetings-cards/ I am hoping that in time that these glyphs, and others, will become accessible within regular Unicode using a mechanism related to, yet a little different from, the mechanism used for QID emoji. The mechanism being to use a tag exclamation mark rather than the tag Q used for QID emoji in the original proposal. William Overington Friday 26 March 2021 -------------- next part -------------- An HTML attachment was scrubbed... URL: From markus.icu at gmail.com Wed Mar 31 22:10:01 2021 From: markus.icu at gmail.com (Markus Scherer) Date: Wed, 31 Mar 2021 20:10:01 -0700 Subject: White spaces for the purpose of programming languages In-Reply-To: References: Message-ID: On Fri, Mar 26, 2021 at 4:50 AM Corentin via Unicode wrote: > In UAX #44, White_space is described as "Spaces, separator characters and > other control characters which should be treated by programming languages > as "white space" for the purpose of parsing elements." > > From what I can tell, ECMAScript/JS uses White_space (or > rather Space_Separator which is slightly different), Rust uses > Pattern_White_Space which is a more restricted set, while most other > languages seem to only support the ASCII spaces. > > I wanted to confirm that the intent is that White_Space is recommended in > programming languages. > I assumed that Pattern_White_Space would be more suitable for that purpose, > but it isn't actually clear from a reading of UAX31 > We came up with Pattern_White_Space for working with ICU *rule and pattern strings* (e.g., rules to define sort orders, rules for number spellout, date/time/number formatting patterns). This is why we included the RLM and LRM controls -- making it easy to keep rule strings legible when there are RTL characters. (If we were defining it now, I assume that we would also include the newer ALM (U+061C), but the property is immutable so we can't add anything.) We proposed this as a Unicode property because it seemed useful. We were not specifically thinking about whole programming languages. I assume that existing languages are not going to want to make a change here. When parsing *user input*, we generally look for all White_Space where "space" is allowed. Personally, I think that White_Space is unnecessarily broad for programming language syntax. Pattern_White_Space might be a useful starting point. - The bidi controls should probably not be programming "white space" on their own because they don't have any advance width. They should be allowed somewhere, maybe at token boundaries or after indenting spaces. - U+0085 NEL is a holdover from OS/390 and the line feed confusion on IBM systems. (They didn't much care what LF/NEL mapped to because their text systems had a "record" per line and didn't need a line separator character like Unix-y systems.) - I can't tell if the EBCDIC platforms are "alive". Elsewhere I have tried to find out if there is a competent C++11 compiler available. - Line & paragraph separators apparently never got much use. - Form feed? Vertical tab? - East Asian developers might appreciate U+3000 ideographic space because their IMEs tend to emit that. So maybe just TAB, LF, CR, space (0020), and possibly wide space (3000), plus also LRM/RLM/ALM at certain boundaries? Best regards, markus -------------- next part -------------- An HTML attachment was scrubbed... URL: