From junicode at jcbradfield.org Sun Aug 1 08:53:29 2021 From: junicode at jcbradfield.org (Julian Bradfield) Date: Sun, 1 Aug 2021 14:53:29 +0100 (BST) Subject: superscript =?UTF-8?Q?=CF=80=3F?= References: <923fe98d-1f1b-270d-a9bb-36c9e48eda18@gmail.com> Message-ID: On 2021-08-01, David Chmelik via Unicode wrote: > less-common cases there is notation for that.? In the more common cases > of just one level, scientists are tired of having to use other notation. Do you have evidence for that? I've never interacted with a scientist who would rather remember how to type a superscript unicode character than type ^. From doug at ewellic.org Mon Aug 2 00:09:00 2021 From: doug at ewellic.org (Doug Ewell) Date: Sun, 1 Aug 2021 23:09:00 -0600 Subject: =?utf-8?Q?RE:_superscript_=CF=80=3F?= In-Reply-To: <923fe98d-1f1b-270d-a9bb-36c9e48eda18@gmail.com> References: <923fe98d-1f1b-270d-a9bb-36c9e48eda18@gmail.com> Message-ID: <000701d7875c$81ac9070$8505b150$@ewellic.org> Although the Unicode Technical Committee has stated many, many times that additional superscript and subscript letters will not be encoded, except to support natural-language orthographies and phonetic notation, it?s impossible to predict whether that principle will be upheld forever. We do have things in Unicode today that were once "no, not ever" and some that appeared as a Notice of Non-Approval. What is certain, however, is that no character will be added based solely on a public mailing-list thread. A formal proposal must be submitted. Even new emoji base characters require a submitted proposal, though it's quite different from that required for normal characters. Perhaps by "I'm requesting" you meant you wanted to open discussion and build evidence and support on the public list before preparing your proposal. More comments below. David Chmelik wrote: > There are several other Greek superscripts, so I'm again requesting a > superscript ? (pi, Greek letter p.) The existence of other superscripts in Unicode is not precedence on its own for encoding others. Each must be justified individually. > The previous argument that mathematics discussion should be in complex > document formatted text (TeX, etc.) is false. Simply asserting that a given argument is "false" usually isn?t a great argumentative strategy and is unlikely to convince those who support that argument. "Oh, I'm wrong? Very well, then, I'm wrong. Sorry." > There has always been, and still is, mathematics discussion in plain- > text only, such as on NNTP/Usenet (and Gmane) and Internet Relay Chat > (IRC) some of which still have very large science discussion areas. Are you aware of Unicode Technical Report #25, "Unicode Support for Mathematics"? This was prepared, and revised many times, with quite a bit of input and support from the mathematical community. It provides a way to represent math notation, using the Unicode characters that currently exist, that is recognized and accepted by mathematicians. > You should probably even have complete English and Greek superscripts > & subscripts. You may not realize how much this statement torpedoes your basic argument for encoding superscript ?. "Let's encode all of them" is exactly the floodgate Unicode is trying to avoid opening. > That doesn't mean it's necessary to have further levels of > superscripts and subscripts because in such less-common cases there is > notation for that. But math does require arbitrary levels of superscripting and subscripting, as well as much more two-dimensional layout. So a great deal of math notation would still not be representable in plain text. Some already is, and has been long before Unicode: x? + y? = z?. Adding another superscript character doesn't make math notation available in plain text in general; it just moves the have/have-not bar a little. -- Doug Ewell, CC, ALB | Lakewood, CO, US | ewellic.org From haberg-1 at telia.com Mon Aug 2 08:28:27 2021 From: haberg-1 at telia.com (=?utf-8?Q?Hans_=C3=85berg?=) Date: Mon, 2 Aug 2021 15:28:27 +0200 Subject: Subscripts and superscripts In-Reply-To: <000701d7875c$81ac9070$8505b150$@ewellic.org> References: <923fe98d-1f1b-270d-a9bb-36c9e48eda18@gmail.com> <000701d7875c$81ac9070$8505b150$@ewellic.org> Message-ID: <00CD705F-6298-4053-82E4-A2E4C26DC6DF@telia.com> > On 2 Aug 2021, at 07:09, Doug Ewell via Unicode wrote: > > Although the Unicode Technical Committee has stated many, many times that additional superscript and subscript letters will not be encoded, except to support natural-language orthographies and phonetic notation, it?s impossible to predict whether that principle will be upheld forever. We do have things in Unicode today that were once "no, not ever" and some that appeared as a Notice of Non-Approval. A factor might be what modern text-only renderers might be able to handle. Today, the norm is a display where there ought to be no problem with subscripts and superscripts even when nested. From ratmice at gmail.com Mon Aug 2 09:09:22 2021 From: ratmice at gmail.com (Matt Rice) Date: Mon, 2 Aug 2021 14:09:22 +0000 Subject: =?UTF-8?Q?Re=3A_superscript_=CF=80=3F?= In-Reply-To: References: <923fe98d-1f1b-270d-a9bb-36c9e48eda18@gmail.com> Message-ID: On Sun, Aug 1, 2021 at 1:56 PM Julian Bradfield via Unicode wrote: > > On 2021-08-01, David Chmelik via Unicode wrote: > > less-common cases there is notation for that. In the more common cases > > of just one level, scientists are tired of having to use other notation. > > Do you have evidence for that? I've never interacted with a scientist > who would rather remember how to type a superscript unicode character > than type ^. I haven't looked through all the uses of superscripts and subscripts in the lean theorem prover or mathlib, But I know they get used there somewhat frequently, and I myself at least once had to rename a variable to one which had a subscript from a character which did not. Looking through the mathlib library in alphabetical order, the first file I opened contains many, I've highlighted some https://github.com/leanprover-community/mathlib/blob/master/src/algebra/algebra/basic.lean#L407-L412 As far as remembering unicode characters, when editing lean you would typically use the abbreviations file rather than entering unicode characters directly, so \^ or \_. I don't really think editor or keymap interaction is good argument either for or against any specific character. Abbreviations for superscript/subscript characters in the editor highlighted (To enter the abbreviation you would prefix the character sequence with a '\') https://github.com/leanprover/vscode-lean/blob/master/src/abbreviation/abbreviations.json#L1150-L1316 >From this perspective of picking variable names and eventually wanting superscript/subscript coverage for that variable I'd imagine the more coverage the better, and I doubt anyone coming from lean would argue against wider coverage... my 2c... From jk at koremail.com Mon Aug 2 09:31:23 2021 From: jk at koremail.com (jk at koremail.com) Date: Mon, 02 Aug 2021 22:31:23 +0800 Subject: Subscripts and superscripts In-Reply-To: <00CD705F-6298-4053-82E4-A2E4C26DC6DF@telia.com> References: <923fe98d-1f1b-270d-a9bb-36c9e48eda18@gmail.com> <000701d7875c$81ac9070$8505b150$@ewellic.org> <00CD705F-6298-4053-82E4-A2E4C26DC6DF@telia.com> Message-ID: <224b74bda49acbd6890b903b1c824565@koremail.com> Proven of course by the fact that several superscript letters are to be added in Unicode 14, including superscript capital Q. On 2021-08-02 21:28, Hans ?berg via Unicode wrote: >> On 2 Aug 2021, at 07:09, Doug Ewell via Unicode >> wrote: >> >> Although the Unicode Technical Committee has stated many, many times >> that additional superscript and subscript letters will not be encoded, >> except to support natural-language orthographies and phonetic >> notation, it?s impossible to predict whether that principle will be >> upheld forever. We do have things in Unicode today that were once "no, >> not ever" and some that appeared as a Notice of Non-Approval. > > A factor might be what modern text-only renderers might be able to > handle. Today, the norm is a display where there ought to be no > problem with subscripts and superscripts even when nested. From junicode at jcbradfield.org Mon Aug 2 09:55:54 2021 From: junicode at jcbradfield.org (Julian Bradfield) Date: Mon, 2 Aug 2021 15:55:54 +0100 (BST) Subject: superscript =?UTF-8?Q?=CF=80=3F?= References: <923fe98d-1f1b-270d-a9bb-36c9e48eda18@gmail.com> Message-ID: On 2021-08-02, Matt Rice via Unicode wrote: > I haven't looked through all the uses of superscripts and subscripts > in the lean theorem prover or mathlib, > But I know they get used there somewhat frequently, and I myself at > least once had to rename a variable to one which had a subscript > from a character which did not. > As far as remembering unicode characters, when editing lean you would > typically use the abbreviations file > rather than entering unicode characters directly, so \^ or \_. I > don't really think editor or keymap interaction is > good argument either for or against any specific character. Equally, it's not a good argument for characters if tool writers can't be bothered to implement simple HTML-style display of things that must internally be structured data anyway. Though as far as I see in Lean, the scripts have no meaning; it's just that identifiers can contain subscript characters. You can just as well name your variable g_1 (or g1) as g?. I'm curious what forced you to use a subscript. (Long ago, I made quite a fancy display mode for maths in Emacs. After a few years, I stopped using it. Nowadays I use Unicode, but only when writing email to students who might not yet be familiar with LaTeX.) From prosfilaes at gmail.com Mon Aug 2 10:43:40 2021 From: prosfilaes at gmail.com (David Starner) Date: Mon, 2 Aug 2021 08:43:40 -0700 Subject: =?UTF-8?Q?Re=3A_superscript_=CF=80=3F?= In-Reply-To: <000701d7875c$81ac9070$8505b150$@ewellic.org> References: <923fe98d-1f1b-270d-a9bb-36c9e48eda18@gmail.com> <000701d7875c$81ac9070$8505b150$@ewellic.org> Message-ID: On Sun, Aug 1, 2021 at 10:13 PM Doug Ewell via Unicode wrote: > You may not realize how much this statement torpedoes your basic argument for encoding superscript ?. "Let's encode all of them" is exactly the floodgate Unicode is trying to avoid opening. I'd add that not everyone uses Latin-based languages, as well. Why shouldn't a Cherokee math teacher be able to write ?^??? Or a Chinese economics teacher ?^?? Even within Latin languages, natural characters have all sorts of diacritics on them, like ?, ?, and ?, with mathematics offering its own set. CJKV ideographs alone are the majority of Unicode characters, and with a billion Chinese speakers, I'll eat my hat if there aren't published examples of ideographs being used that way. -- The standard is written in English . If you have trouble understanding a particular section, read it again and again and again . . . Sit up straight. Eat your vegetables. Do not mumble. -- _Pascal_, ISO 7185 (1991) From ratmice at gmail.com Mon Aug 2 10:53:50 2021 From: ratmice at gmail.com (Matt Rice) Date: Mon, 2 Aug 2021 15:53:50 +0000 Subject: =?UTF-8?Q?Re=3A_superscript_=CF=80=3F?= In-Reply-To: References: <923fe98d-1f1b-270d-a9bb-36c9e48eda18@gmail.com> Message-ID: On Mon, Aug 2, 2021 at 2:58 PM Julian Bradfield via Unicode wrote: > > On 2021-08-02, Matt Rice via Unicode wrote: > > I haven't looked through all the uses of superscripts and subscripts > > in the lean theorem prover or mathlib, > > But I know they get used there somewhat frequently, and I myself at > > least once had to rename a variable to one which had a subscript > > from a character which did not. > > > As far as remembering unicode characters, when editing lean you would > > typically use the abbreviations file > > rather than entering unicode characters directly, so \^ or \_. I > > don't really think editor or keymap interaction is > > good argument either for or against any specific character. > > Equally, it's not a good argument for characters if tool writers can't > be bothered to implement simple HTML-style display of things that must > internally be structured data anyway. Though as far as I see in Lean, > the scripts have no meaning; it's just that identifiers can contain > subscript characters. You can just as well name your variable g_1 (or > g1) as g?. I'm curious what forced you to use a subscript. I didn't write the linked code, but It is a theorem prover for machine checked proofs, not a format intended for displaying equations for which there are other things. These can be long and tedious and involve many variables derived from other variables... Nothing "forced" me to use a subscript, it came about through a refactoring for clarity in an effort to get rid of long largely uninformative variables like hnm1 hmn1 hnm2 hmn2. But also because things such as a functions inverse are most naturally conveyed by identifiers f??... Unlike _ ^ isn't a valid identifier... Keeping the core implementation of the language small and allowing unicode identifiers gives a good balance between simplicity, while affording some more opportunity for proof clarity (This is at least what I imagine to be the case, I didn't implement the language either)... You asked for evidence. I offered some, There is plenty of evidence that you _can_ do without subscripts and superscripts in theorem provers that only support ascii identifiers... yet people are clearly using sub/superscripts anyways when they are available. From jk at koremail.com Mon Aug 2 11:22:25 2021 From: jk at koremail.com (jk at koremail.com) Date: Tue, 03 Aug 2021 00:22:25 +0800 Subject: =?UTF-8?Q?Re=3A_superscript_=CF=80=3F?= In-Reply-To: References: <923fe98d-1f1b-270d-a9bb-36c9e48eda18@gmail.com> <000701d7875c$81ac9070$8505b150$@ewellic.org> Message-ID: <00249acb9143e73d7895af072c5b7f41@koremail.com> On 2021-08-02 23:43, David Starner via Unicode wrote: > Or a Chinese > economics teacher ?^?? Even within Latin languages, natural characters > have all sorts of diacritics on them, like ?, ?, and ?, with > mathematics offering its own set. CJKV ideographs alone are the > majority of Unicode characters, and with a billion Chinese speakers, > I'll eat my hat if there aren't published examples of ideographs being > used that way. Fortunately there is no way to prove there are no such published cases in Chinese so you hat is safe, however mathematical formulas in Chinese are written using western conventions so use letters not characters for variables. From abrahamgross at disroot.org Mon Aug 2 11:31:09 2021 From: abrahamgross at disroot.org (abrahamgross at disroot.org) Date: Mon, 02 Aug 2021 16:31:09 +0000 Subject: =?utf-8?B?UmU6IHN1cGVyc2NyaXB0IM+APw==?= In-Reply-To: <00249acb9143e73d7895af072c5b7f41@koremail.com> References: <00249acb9143e73d7895af072c5b7f41@koremail.com> <923fe98d-1f1b-270d-a9bb-36c9e48eda18@gmail.com> <000701d7875c$81ac9070$8505b150$@ewellic.org> Message-ID: <087c7c655be0dfbd247627bc1b58c604@disroot.org> technically there are a couple of encoded superscript CJK ideographs: kanbun (https://www.unicode.org/charts/PDF/U3190.pdf) There are probably more kanbun characters coming soon too. (https://www.unicode.org/L2/L2020/20232-kanbun-additions.pdf) 2021?8?2? 12:23, "John Knightley via Unicode" wrote: > On 2021-08-02 23:43, David Starner via Unicode wrote: > >> Or a Chinese >> economics teacher ?^?? Even within Latin languages, natural characters >> have all sorts of diacritics on them, like ?, ?, and ?, with >> mathematics offering its own set. CJKV ideographs alone are the >> majority of Unicode characters, and with a billion Chinese speakers, >> I'll eat my hat if there aren't published examples of ideographs being >> used that way. > > Fortunately there is no way to prove there are no such published cases in Chinese so you hat is > safe, however mathematical formulas in Chinese are written using western conventions so use letters > not characters for variables. From jk at koremail.com Mon Aug 2 11:50:49 2021 From: jk at koremail.com (jk at koremail.com) Date: Tue, 03 Aug 2021 00:50:49 +0800 Subject: =?UTF-8?Q?Re=3A_superscript_=CF=80=3F?= In-Reply-To: <087c7c655be0dfbd247627bc1b58c604@disroot.org> References: <00249acb9143e73d7895af072c5b7f41@koremail.com> <923fe98d-1f1b-270d-a9bb-36c9e48eda18@gmail.com> <000701d7875c$81ac9070$8505b150$@ewellic.org> <087c7c655be0dfbd247627bc1b58c604@disroot.org> Message-ID: Good point, though these are of course used in Japanese almost as a sort of mark up language, and not used in Chinese, whether they be mathematicians or not. On 2021-08-03 00:31, Abraham Gross via Unicode wrote: > technically there are a couple of encoded superscript CJK ideographs: > kanbun (https://www.unicode.org/charts/PDF/U3190.pdf) > > There are probably more kanbun characters coming soon too. > (https://www.unicode.org/L2/L2020/20232-kanbun-additions.pdf) > > 2021?8?2? 12:23, "John Knightley via Unicode" > wrote: > >> On 2021-08-02 23:43, David Starner via Unicode wrote: >> >>> Or a Chinese >>> economics teacher ?^?? Even within Latin languages, natural >>> characters >>> have all sorts of diacritics on them, like ?, ?, and ?, with >>> mathematics offering its own set. CJKV ideographs alone are the >>> majority of Unicode characters, and with a billion Chinese speakers, >>> I'll eat my hat if there aren't published examples of ideographs >>> being >>> used that way. >> >> Fortunately there is no way to prove there are no such published cases >> in Chinese so you hat is >> safe, however mathematical formulas in Chinese are written using >> western conventions so use letters >> not characters for variables. From kenwhistler at sonic.net Mon Aug 2 12:21:02 2021 From: kenwhistler at sonic.net (Ken Whistler) Date: Mon, 2 Aug 2021 10:21:02 -0700 Subject: =?UTF-8?Q?Re=3a_superscript_=cf=80=3f?= In-Reply-To: References: <923fe98d-1f1b-270d-a9bb-36c9e48eda18@gmail.com> Message-ID: <9ffc794a-0497-424a-67c6-858dbfb1be7e@sonic.net> On 8/2/2021 8:53 AM, Matt Rice via Unicode wrote: > Unlike _ ^ isn't a valid identifier... The ASCII U+005E ^ CIRCUMFLEX ACCENT is disallowed from most identifier syntax, but the corresponding modifier letter U+02C6 ? MODIFIER LETTER CIRCUMFLEX ACCENT is generally allowed, along with most of the letterform modifier letters from that same Spacing Modifier Letters block, e.g. U+02B7 ? MODIFER LETTER SMALL W. > Keeping the core implementation > of the language small and allowing unicode identifiers gives a good > balance between simplicity, > while affording some more opportunity for proof clarity So if you are looking for a creative balance, where e^i? might not work because of the status of U+005E in a particular system, e?i?? might work. Of course, it would be nicest if e^i? just worked. ;-) --Ken > (This is at > least what I imagine to be the case, I didn't implement the language > either)... -------------- next part -------------- An HTML attachment was scrubbed... URL: From prosfilaes at gmail.com Mon Aug 2 12:56:31 2021 From: prosfilaes at gmail.com (David Starner) Date: Mon, 2 Aug 2021 10:56:31 -0700 Subject: =?UTF-8?Q?Re=3A_superscript_=CF=80=3F?= In-Reply-To: <00249acb9143e73d7895af072c5b7f41@koremail.com> References: <923fe98d-1f1b-270d-a9bb-36c9e48eda18@gmail.com> <000701d7875c$81ac9070$8505b150$@ewellic.org> <00249acb9143e73d7895af072c5b7f41@koremail.com> Message-ID: On Mon, Aug 2, 2021 at 9:22 AM wrote: > > On 2021-08-02 23:43, David Starner via Unicode wrote: > > Or a Chinese > > economics teacher ?^?? Even within Latin languages, natural characters > > have all sorts of diacritics on them, like ?, ?, and ?, with > > mathematics offering its own set. CJKV ideographs alone are the > > majority of Unicode characters, and with a billion Chinese speakers, > > I'll eat my hat if there aren't published examples of ideographs being > > used that way. > > Fortunately there is no way to prove there are no such published cases > in Chinese so you hat is safe, however mathematical formulas in Chinese > are written using western conventions so use letters not characters for > variables. You phrase that as if that's a response to what I wrote. It's not; I did not claim that that was standard in China, merely that, like Lean, it had been used. -- The standard is written in English . If you have trouble understanding a particular section, read it again and again and again . . . Sit up straight. Eat your vegetables. Do not mumble. -- _Pascal_, ISO 7185 (1991) From jameskasskrv at gmail.com Mon Aug 2 16:28:15 2021 From: jameskasskrv at gmail.com (James Kass) Date: Mon, 2 Aug 2021 21:28:15 +0000 Subject: =?UTF-8?Q?Re=3a_superscript_=cf=80=3f?= In-Reply-To: References: <923fe98d-1f1b-270d-a9bb-36c9e48eda18@gmail.com> <000701d7875c$81ac9070$8505b150$@ewellic.org> <00249acb9143e73d7895af072c5b7f41@koremail.com> Message-ID: <47154b70-f8b4-66ed-78ac-b4a5db7a7772@gmail.com> On 2021-08-02 5:56 PM, David Starner via Unicode wrote: > On Mon, Aug 2, 2021 at 9:22 AM wrote: >> On 2021-08-02 23:43, David Starner via Unicode wrote: >>> Or a Chinese >>> economics teacher ?^?? Even within Latin languages, natural characters >>> have all sorts of diacritics on them, like ?, ?, and ?, with >>> mathematics offering its own set. CJKV ideographs alone are the >>> majority of Unicode characters, and with a billion Chinese speakers, >>> I'll eat my hat if there aren't published examples of ideographs being >>> used that way. >> Fortunately there is no way to prove there are no such published cases >> in Chinese so you hat is safe, however mathematical formulas in Chinese >> are written using western conventions so use letters not characters for >> variables. > You phrase that as if that's a response to what I wrote. It's not; I > did not claim that that was standard in China, merely that, like Lean, > it had been used. > John Knightley responded to what you wrote.? Nobody asserted that you claimed this usage was standard in China.? The part about Chinese math notation was informational. If anyone wants to prove that CJK ideographs get used in superscript fashion, all that's needed is one published example.? But it cannot be disproved because of a thing called 'negative proof'.? Just because something can't be disproved doesn't make it true; it only means your hat is safe. From arthur at reutenauer.eu Mon Aug 2 17:31:22 2021 From: arthur at reutenauer.eu (Arthur Rosendahl) Date: Tue, 3 Aug 2021 00:31:22 +0200 Subject: superscript =?utf-8?B?z4A/?= In-Reply-To: <47154b70-f8b4-66ed-78ac-b4a5db7a7772@gmail.com> References: <923fe98d-1f1b-270d-a9bb-36c9e48eda18@gmail.com> <000701d7875c$81ac9070$8505b150$@ewellic.org> <00249acb9143e73d7895af072c5b7f41@koremail.com> <47154b70-f8b4-66ed-78ac-b4a5db7a7772@gmail.com> Message-ID: <20210802223122.lluisac7alfpt75c@phare.normalesup.org> On Mon, Aug 02, 2021 at 09:28:15PM +0000, James Kass via Unicode wrote: > it only means your hat is > safe. Considering the turn this thread has taken, I?d be remiss if I didn?t point out that it also means that David won?t have to ingest a hat and is thus safe too. Arthur From asmusf at ix.netcom.com Mon Aug 2 20:23:05 2021 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Mon, 2 Aug 2021 18:23:05 -0700 Subject: =?UTF-8?Q?Re=3a_superscript_=cf=80=3f?= In-Reply-To: <20210802223122.lluisac7alfpt75c@phare.normalesup.org> References: <923fe98d-1f1b-270d-a9bb-36c9e48eda18@gmail.com> <000701d7875c$81ac9070$8505b150$@ewellic.org> <00249acb9143e73d7895af072c5b7f41@koremail.com> <47154b70-f8b4-66ed-78ac-b4a5db7a7772@gmail.com> <20210802223122.lluisac7alfpt75c@phare.normalesup.org> Message-ID: An HTML attachment was scrubbed... URL: From public at khwilliamson.com Wed Aug 11 11:11:44 2021 From: public at khwilliamson.com (Karl Williamson) Date: Wed, 11 Aug 2021 10:11:44 -0600 Subject: I'm trying to grok emoji-sequences.txt Message-ID: <4fa7bf72-407e-a8de-ef24-81fa79768a21@khwilliamson.com> The first lines of that file are: # emoji-sequences.txt # Date: 2020-08-31, 01:06:24 GMT # ? 2020 Unicode?, Inc. # Unicode and the Unicode Logo are registered trademarks of Unicode, Inc. in the U.S. and other countries. # For terms of use, see http://www.unicode.org/terms_of_use.html # # Emoji Sequence Data for UTS #51 # Version: 13.1 # # For documentation and usage, see http://www.unicode.org/reports/tr51 # # Format: # code_point(s) ; type_field ; description # comments # Fields: # code_point(s): one or more code points in hex format, separated by spaces # type_field, one of the following: # Basic_Emoji # Emoji_Keycap_Sequence # RGI_Emoji_Flag_Sequence # RGI_Emoji_Tag_Sequence # RGI_Emoji_Modifier_Sequence # The type_field is a convenience for parsing the emoji sequence files, and is not intended to be maintained as a property. # short name: CLDR short name of sequence; characters may be escaped with \x{hex}. # # For the purpose of regular expressions, each of the type fields defines the name of # a binary property of strings. The short name of each property is the same as the long name. # My issues are short_name is mentioned but I don't see it appearing in the file It says that 'type_field' is not intended to be a property, but then it says the type fields define the name of properties of strings. I don't understand From markus.icu at gmail.com Wed Aug 11 11:59:25 2021 From: markus.icu at gmail.com (Markus Scherer) Date: Wed, 11 Aug 2021 09:59:25 -0700 Subject: I'm trying to grok emoji-sequences.txt In-Reply-To: <4fa7bf72-407e-a8de-ef24-81fa79768a21@khwilliamson.com> References: <4fa7bf72-407e-a8de-ef24-81fa79768a21@khwilliamson.com> Message-ID: Hi Karl, On Wed, Aug 11, 2021 at 9:15 AM Karl Williamson via Unicode < unicode at corp.unicode.org> wrote: > short_name is mentioned but I don't see it appearing in the file > The "Format" line has a "description" field, not a "short name" field, but it looks like that's basically the same thing. For CLDR data about emoji in all languages including English, you should read the CLDR data files. It says that 'type_field' is not intended to be a property, but then it > says the type fields define the name of properties of strings. I don't > understand > At this time, these are not UCD properties, but for the purpose of regular expressions they are usable as properties, and they are recommended properties in UTS #18 . (Two such properties were omitted and will be added in the next version.) Hope this helps... you could submit a bug report for clarification. Best regards, markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Sat Aug 28 09:13:56 2021 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Sat, 28 Aug 2021 15:13:56 +0100 Subject: Tai Laing Sibilants Message-ID: <20210828151356.2b74f505@JRWUBU2> I thought I had seen a query from Ben Mitchell or Vinodh Rajan about this, but I can't find it, let alone a resolution, explanation or way forward. (I would have access to a discussion on Unicore.) The anomaly appears when looking at the letters for Pali, as implicitly described in UTN 11 and as shown in L2/12-012 Figure 5: c ? U+1078 MYANMAR LETTER SHAN CA ch ? U+AA6C MYANMAR LETTER KHAMTI SA j ? U+A9EB MYANMAR LETTER TAI LAING JA jh ? U+A9EC MYANMAR LETTER TAI LAING JHA ? ? U+A9E7 MYANMAR LETTER TAI LAING NYA s ? U+A9EC MYANMAR LETTER TAI LAING JHA The double application of U+A9EC as both and is anomalous. The voiced palatals for Pali are regularly formed by application of the overstriking voicing dot diacritic. Now, L2/12-012 Figure 4 also gives letters for the consonant phonemes of Tai Laing. The first two rows are: k ? U+1000 MYANMAR LETTER KA k? ? U+1075 MYANMAR LETTER SHAN KA ? ? U+1004 MYANMAR LETTER NGA s ? U+1078 MYANMAR LETTER SHAN CA s? (see below) ? ? U+A9E7 MYANMAR LETTER TAI LAING NYA Comparing the letters with the reference forms, we see a large U-shaped bowl at the start of some letters, which has to be shrunk to obtain the references forms of SHAN KA and TAI LAING NYA. If we apply the same analogy to /s?/ and trim that off, we find the letter ? SA. I can think of a just-so story to explain what is going on: 1. Locally, the distinction of Pali and has been lost, as seemingly in much of Northern Thailand, and in the corresponding vernacular, a merger that extends to Lao. 2. We therefore end up with Tai Laing viewed in isolation having three significant glyphs for writing /s?/ - (a) KHAMTI SA, (b) KHAMTI SA with a peg at the bottom of the middle, which may be interpreted as SA with a flourish at the start and is what is shown in Figure 4, and (c) the same again, but with the peg implemented as a dot in the right-hand bowl. 3. One way of writing Pali is to use Tai Laing writing phonetically, but with the extra consonants denoted by a dot - possible borrowed from the European notation for Indic letters outside typical European repertoires. CHA and SA are distinguished by specialising the glyphs, in much the same way as IPA 'a' and '?' are distinguished. When the chart in Figure 5 was drawn up, the wrong Tai Laing glyph was accidentally inserted for Pali . Can anyone with access to Tai Laing materials verify or disverify this story? I think the question we have left is what should be the encoding of the character(s) for Tai Laing /s?/ and Pali in Tai Laing orthography? The answer might simply be to use U+101E MYANMAR LETTER SA. But perhaps we need a new character rather than dismiss the bowl as mere font variation. Richard. From markus.icu at gmail.com Sun Aug 29 23:53:57 2021 From: markus.icu at gmail.com (Markus Scherer) Date: Sun, 29 Aug 2021 21:53:57 -0700 Subject: happy 33rd on Unicode 88 Message-ID: Happy 33rd anniversary of Unicode 88! https://www.unicode.org/history/unicode88.pdf Extending from the tagline of the 1998 reprint: 33 years of the Unicode Standard! ?? markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From vinodh.vinodh at gmail.com Mon Aug 30 05:15:15 2021 From: vinodh.vinodh at gmail.com (Vinodh Rajan) Date: Mon, 30 Aug 2021 12:15:15 +0200 Subject: Tai Laing Sibilants In-Reply-To: <20210828151356.2b74f505@JRWUBU2> References: <20210828151356.2b74f505@JRWUBU2> Message-ID: My original post (about 2 years ago) is here: https://corp.unicode.org/pipermail/unicode/2019-August/008206.html Vinodh On Sat, Aug 28, 2021 at 4:17 PM Richard Wordingham via Unicode < unicode at corp.unicode.org> wrote: > I thought I had seen a query from Ben Mitchell or Vinodh Rajan about > this, but I can't find it, let alone a resolution, explanation or way > forward. (I would have access to a discussion on Unicore.) > > The anomaly appears when looking at the letters for Pali, as implicitly > described in UTN 11 and as shown in L2/12-012 Figure 5: > > c ? U+1078 MYANMAR LETTER SHAN CA > > ch ? U+AA6C MYANMAR LETTER KHAMTI SA > > j ? U+A9EB MYANMAR LETTER TAI LAING JA > > jh ? U+A9EC MYANMAR LETTER TAI LAING JHA > > ? ? U+A9E7 MYANMAR LETTER TAI LAING NYA > > s ? U+A9EC MYANMAR LETTER TAI LAING JHA > > The double application of U+A9EC as both and is anomalous. > > The voiced palatals for Pali are regularly formed by application of the > overstriking voicing dot diacritic. > > Now, L2/12-012 Figure 4 also gives letters for the consonant phonemes > of Tai Laing. The first two rows are: > > k ? U+1000 MYANMAR LETTER KA > > k? ? U+1075 MYANMAR LETTER SHAN KA > > ? ? U+1004 MYANMAR LETTER NGA > > s ? U+1078 MYANMAR LETTER SHAN CA > > s? (see below) > > ? ? U+A9E7 MYANMAR LETTER TAI LAING NYA > > > Comparing the letters with the reference forms, we see a large U-shaped > bowl at the start of some letters, which has to be shrunk to obtain the > references forms of SHAN KA and TAI LAING NYA. If we apply the same > analogy to /s?/ and trim that off, we find the letter ? SA. > > I can think of a just-so story to explain what is going on: > > 1. Locally, the distinction of Pali and has been lost, as > seemingly in much of Northern Thailand, and in the corresponding > vernacular, a merger that extends to Lao. > > 2. We therefore end up with Tai Laing viewed in isolation having three > significant glyphs for writing /s?/ - (a) KHAMTI SA, (b) KHAMTI SA with > a peg at the bottom of the middle, which may be interpreted as SA with > a flourish at the start and is what is shown in Figure 4, and (c) the > same again, but with the peg implemented as a dot in the right-hand > bowl. > > 3. One way of writing Pali is to use Tai Laing writing phonetically, but > with the extra consonants denoted by a dot - possible borrowed from the > European notation for Indic letters outside typical European > repertoires. CHA and SA are distinguished by specialising the glyphs, > in much the same way as IPA 'a' and '?' are distinguished. When the > chart in Figure 5 was drawn up, the wrong Tai Laing glyph was > accidentally inserted for Pali . > > Can anyone with access to Tai Laing materials verify or disverify this > story? > > I think the question we have left is what should be the > encoding of the character(s) for Tai Laing /s?/ and Pali in Tai > Laing orthography? The answer might simply be to use U+101E MYANMAR > LETTER SA. But perhaps we need a new character rather than dismiss the > bowl as mere font variation. > > Richard. > > -- http://www.virtualvinodh.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Mon Aug 30 11:17:43 2021 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Mon, 30 Aug 2021 17:17:43 +0100 Subject: Tai Laing Sibilants In-Reply-To: References: <20210828151356.2b74f505@JRWUBU2> Message-ID: <20210830171743.7e3b150b@JRWUBU2> On Mon, 30 Aug 2021 12:15:15 +0200 Vinodh Rajan via Unicode wrote: > My original post (about 2 years ago) is here: > > https://corp.unicode.org/pipermail/unicode/2019-August/008206.html That references L2/11-130R, which I can't find. The document register only knows L2/11-130 dated 19 April 2011. I wonder if it was lost in the great crash. I have more to say at the end. > On Sat, Aug 28, 2021 at 4:17 PM Richard Wordingham via Unicode < > unicode at corp.unicode.org> wrote: > > I thought I had seen a query from Ben Mitchell or Vinodh Rajan about > > this, but I can't find it, let alone a resolution, explanation or > > way forward. (I would have access to a discussion on Unicore.) > > The anomaly appears when looking at the letters for Pali, as > > implicitly described in UTN 11 and as shown in L2/12-012 Figure 5: > > c ? U+1078 MYANMAR LETTER SHAN CA > > ch ? U+AA6C MYANMAR LETTER KHAMTI SA > > j ? U+A9EB MYANMAR LETTER TAI LAING JA > > jh ? U+A9EC MYANMAR LETTER TAI LAING JHA > > ? ? U+A9E7 MYANMAR LETTER TAI LAING NYA > > s ? U+A9EC MYANMAR LETTER TAI LAING JHA > > The double application of U+A9EC as both and is anomalous. > > The voiced palatals for Pali are regularly formed by application of > > the overstriking voicing dot diacritic. > > Now, L2/12-012 Figure 4 also gives letters for the consonant > > phonemes of Tai Laing. The first two rows are: > > k ? U+1000 MYANMAR LETTER KA > > k? ? U+1075 MYANMAR LETTER SHAN KA > > ? ? U+1004 MYANMAR LETTER NGA > > s ? U+1078 MYANMAR LETTER SHAN CA > > s? (see below) > > ? ? U+A9E7 MYANMAR LETTER TAI LAING NYA > > > > Comparing the letters with the reference forms, we see a large > > U-shaped bowl at the start of some letters, which has to be shrunk > > to obtain the references forms of SHAN KA and TAI LAING NYA. If we > > apply the same analogy to /s?/ and trim that off, we find the > > letter ? SA. > > I can think of a just-so story to explain what is going on: > > 1. Locally, the distinction of Pali and has been lost, as > > seemingly in much of Northern Thailand, and in the corresponding > > vernacular, a merger that extends to Lao. > > 2. We therefore end up with Tai Laing viewed in isolation having > > three significant glyphs for writing /s?/ - (a) KHAMTI SA, (b) > > KHAMTI SA with a peg at the bottom of the middle, which may be > > interpreted as SA with a flourish at the start and is what is shown > > in Figure 4, and (c) the same again, but with the peg implemented > > as a dot in the right-hand bowl. > > 3. One way of writing Pali is to use Tai Laing writing > > phonetically, but with the extra consonants denoted by a dot - > > possible borrowed from the European notation for Indic letters > > outside typical European repertoires. CHA and SA are distinguished > > by specialising the glyphs, in much the same way as IPA 'a' and '?' > > are distinguished. When the chart in Figure 5 was drawn up, the > > wrong Tai Laing glyph was accidentally inserted for Pali . > > Can anyone with access to Tai Laing materials verify or disverify > > this story? > > I think the question we have left is what should be the > > encoding of the character(s) for Tai Laing /s?/ and Pali in Tai > > Laing orthography? The answer might simply be to use U+101E MYANMAR > > LETTER SA. But perhaps we need a new character rather than dismiss > > the bowl as mere font variation. Comparing L2-11/130 and the final(?) document, L2-12/012, there are some interesting changes. It was originally proposed that a new character be encoded for 'KHAMTI SA with peg' (this is where a glyph registry would be handy!), 'MYANMAR LETTER TAI LAING SHA'. By L2-12/012, it had been unified with MYANMAR LETTER KHAMTI SA. There is a very relevant paragraph on p5 of L2-12/012; I don't know if it is in L2-11/130R: "The proposed characters have been circled in figure 4, with all the other characters already supported in the UCS. Notice that the labelling of the sa and s?a characters is wrong when compared with the Pali based shiksha. In addition the shape of the s?a belies its underlying encoding of U+AA6C MYANMAR LETTER KHAMTI SA, as can be seen in figure 5". I think the analysis was thrown off by the general Shan pattern that Proto-SW Tai *c yields Shan /s/ and Proto-SW Tai *s yields Shan /s?/, so I see no contradiction between the chart of native consonants and the palatal row (c, ch, ...) of the Pali chart. (Recall how 's' takes the position of 'ch' in the Lao alphabet, and the merger of the two sounds in Northern Thai and Tai Khuen.) The material in Tai Laing in L2-12/012 clearly uses the glyph with the peg for the native /s?/, so I think UTN-11 should have used U+AA6C in the slot for 's' in the character lists. As to what happens with Pali in Tai Laing, I think we're going to have to find some documents ourselves. I tried googling for various plausible Tai Laing spellings of 'gacch?mi' (all with U+A9E9 MYANMAR LETTER TAI LAING GA), but had no luck - not even Facebook. The shiksha on p7 might even be right. Very few Pali words start with 'jh', word internally it follows '?' or 'j', and '?s' and 'js' are not possible Pali clusters, so the system could work. It's just very odd to have a diacritic as part of a very common letter like 's'. And Latin 'i' usually has its tittle. Richard. From kenwhistler at sonic.net Mon Aug 30 11:48:40 2021 From: kenwhistler at sonic.net (Ken Whistler) Date: Mon, 30 Aug 2021 09:48:40 -0700 Subject: Tai Laing Sibilants In-Reply-To: <20210830171743.7e3b150b@JRWUBU2> References: <20210828151356.2b74f505@JRWUBU2> <20210830171743.7e3b150b@JRWUBU2> Message-ID: Richard, On 8/30/2021 9:17 AM, Richard Wordingham via Unicode wrote: > That references L2/11-130R, which I can't find. The document register > only knows L2/11-130 dated 19 April 2011. I wonder if it was lost in > the great crash. No. The document register from 2011 (and all the other document registers prior to 2020) was unaffected by the VM crash of April 2020. There never was a L2/11-130R filed in the document register. --ken From michel at suignard.com Mon Aug 30 14:41:04 2021 From: michel at suignard.com (Michel Suignard) Date: Mon, 30 Aug 2021 19:41:04 +0000 Subject: Tai Laing Sibilants In-Reply-To: References: <20210828151356.2b74f505@JRWUBU2> <20210830171743.7e3b150b@JRWUBU2> Message-ID: It is however in the WG2 side as https://www.unicode.org/wg2/docs/n3976.pdf Quite a few of those cases out there (where L2 and WG2 are not in sync in their revision history), same date but quite different. I think L2 should add this one for reference. Michel -----Original Message----- From: Unicode On Behalf Of Ken Whistler via Unicode Sent: Monday, August 30, 2021 9:49 AM To: Richard Wordingham Cc: unicode at corp.unicode.org Subject: Re: Tai Laing Sibilants Richard, On 8/30/2021 9:17 AM, Richard Wordingham via Unicode wrote: > That references L2/11-130R, which I can't find. The document register > only knows L2/11-130 dated 19 April 2011. I wonder if it was lost in > the great crash. No. The document register from 2011 (and all the other document registers prior to 2020) was unaffected by the VM crash of April 2020. There never was a L2/11-130R filed in the document register. --ken From richard.wordingham at ntlworld.com Mon Aug 30 18:01:36 2021 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Tue, 31 Aug 2021 00:01:36 +0100 Subject: Tai Laing Sibilants In-Reply-To: References: <20210828151356.2b74f505@JRWUBU2> <20210830171743.7e3b150b@JRWUBU2> Message-ID: <20210831000136.008a9119@JRWUBU2> On Mon, 30 Aug 2021 19:41:04 +0000 Michel Suignard via Unicode wrote: > It is however in the WG2 side as > https://www.unicode.org/wg2/docs/n3976.pdf Quite a few of those cases > out there (where L2 and WG2 are not in sync in their revision > history), same date but quite different. I think L2 should add this > one for reference. Michel Thanks for the clarification. Unfortunately, the resolution is rather confusing. At L2, we have L2/12-012, labelled in the top right-hand corner as replacing L2/11-130, and dated 23 May 2011 and at the first line of the top right-hand corner labelled as ISO/IEC JTC1/SC2/WG2 N3976. Meanwhile, at WG2, the file https://www.unicode.org/wg2/docs/n3976.pdf bears in the top right-hand corner the reference numbers ISO/IEC JTC1/SC2/WG2 N3976R and L2/11-130R and the date 19 April 2011. However, in the title block, it bears two dates, 23 May 2011 and 11 February 2012. It looks as though the sequence is: Latest date WG2 number L2 number Held in 2011-04-19 - L2/11-130 L2 2011-05-23 N3976 L2/12-012 L2 2012-02-11 N3976R L2/11-130R WG2 L2/11-130R seems to differ from L2/12-012 by the addition of a section about confusables on p9. There might be other changes that I didn't spot in a quick visual scan. They seem to have the same repertoire of additional points. Richard. > -----Original Message----- > From: Unicode On Behalf Of Ken > Whistler via Unicode Sent: Monday, August 30, 2021 9:49 AM > To: Richard Wordingham > Cc: unicode at corp.unicode.org > Subject: Re: Tai Laing Sibilants > > Richard, > > On 8/30/2021 9:17 AM, Richard Wordingham via Unicode wrote: > > That references L2/11-130R, which I can't find. The document > > register only knows L2/11-130 dated 19 April 2011. I wonder if it > > was lost in the great crash. > > No. The document register from 2011 (and all the other document > registers prior to 2020) was unaffected by the VM crash of April > 2020. There never was a L2/11-130R filed in the document register. > > --ken > >