From tom at honermann.net Sun Apr 4 17:07:58 2021 From: tom at honermann.net (Tom Honermann) Date: Sun, 4 Apr 2021 18:07:58 -0400 Subject: White spaces for the purpose of programming languages In-Reply-To: References: Message-ID: <5c071b21-0fa9-b35c-f9eb-226c13ced32c@honermann.net> On 3/31/21 11:10 PM, Markus Scherer via Unicode wrote: > > o I can't tell if the EBCDIC platforms are "alive". Elsewhere I > have tried to find out if there is a competent C++11 compiler > available. > Yes, EBCDIC platforms continue to roam the earth.? IBM's traditional xlC for z/OS compiler is effectively on life support and stuck at a pre-C++11 language level, but there are multiple options for recent C++ language standards available today.? IBM has other EBCDIC-based OSs as well, but I'm not as familiar with them. IBM started distributing a Clang-based compiler (xlclang) with XL C/C++ V2.3.1 for z/OS two years ago and has started posting patches to LLVM Clang to add z/OS support.? One such patch to enable -fexec-charset to support IBM-1047 (an EBCDIC encoding of the ISO-8819-1 character repertoire) is currently in review here . Dignus offers Systems/C++ , a LLVM-based C++ compiler that, as of version 2.25 released last year , supports C++17. Tom. -------------- next part -------------- An HTML attachment was scrubbed... URL: From wjgo_10009 at btinternet.com Mon Apr 5 12:00:02 2021 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Mon, 5 Apr 2021 18:00:02 +0100 (BST) Subject: A new era of encoding Message-ID: One week to go for the Public Review on QID Emoji. Then a report from the Emoji Subcommittee about what they propose. Then consideration by the Unicode Technical Committee. In my opinion this has implications far beyond just emoji, because what is under consideration is the introduction of an auxiliary encoding space, with its own map of code points. So if one auxiliary encoding space is introduced, then there is the possibility of other such auxiliary encoding spaces becoming introduced. I consider that the introduction of such auxiliary encoding spaces will be a good thing, with great benefits. It will allow new concepts to become implemented in a rigorous interoperable format. William Overington Monday 5 April 2021 From public at khwilliamson.com Mon Apr 5 16:44:56 2021 From: public at khwilliamson.com (Karl Williamson) Date: Mon, 5 Apr 2021 15:44:56 -0600 Subject: White spaces for the purpose of programming languages In-Reply-To: <5c071b21-0fa9-b35c-f9eb-226c13ced32c@honermann.net> References: <5c071b21-0fa9-b35c-f9eb-226c13ced32c@honermann.net> Message-ID: <9fbc35cd-4e7e-2eab-45eb-6085cc9bf44a@khwilliamson.com> On 4/4/21 4:07 PM, Tom Honermann via Unicode wrote: > On 3/31/21 11:10 PM, Markus Scherer via Unicode wrote: >> >> o I can't tell if the EBCDIC platforms are "alive". Elsewhere I >> have tried to find out if there is a competent C++11 compiler >> available. >> > Yes, EBCDIC platforms continue to roam the earth.? IBM's traditional xlC > for z/OS compiler is effectively on life support and stuck at a > pre-C++11 language level, but there are multiple options for recent C++ > language standards available today.? IBM has other EBCDIC-based OSs as > well, but I'm not as familiar with them. > > IBM started distributing a Clang-based compiler (xlclang) with XL C/C++ > V2.3.1 > > for z/OS two years ago and has started posting patches to LLVM Clang to > add z/OS support.? One such patch to enable -fexec-charset to support > IBM-1047 (an EBCDIC encoding of the ISO-8819-1 character repertoire) is > currently in review here . > > Dignus offers Systems/C++ , a LLVM-based > C++ compiler that, as of version 2.25 released last year > , supports C++17. > > Tom. > Both modern Python and Perl run on z/OS. Perl offers full support of UTF-EBCDIC; I don't know about Python From wjgo_10009 at btinternet.com Tue Apr 13 11:49:25 2021 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Tue, 13 Apr 2021 17:49:25 +0100 (BST) Subject: Haiku Poetry Day Saturday 17 April 2021 Message-ID: <40dc874.f2b4.178cc233fdd.Webtop.93@btinternet.com> https://www.daysoftheyear.com/days/haiku-poetry-day/ Are haiku poems written in any languages other than Japanese and English? It occurs to me that the 5-7-5 structure in one language may well not have a 5-7-5 structure if translated from one language to another. Also, meaning could be lost. For example if the word 'Elles' in French in translated to 'They' in English, the translation may loose the meaning. Although not a haiku poem, my song lyrics about colourful fonts would lose some meaning in translation if 'Elles' were translated to 'They'. http://www.users.globalnet.co.uk/~ngo/une_chanson.pdf I remember that one year this mailing list featured a haiku contest. I entered but did not win a prize. Perhaps there could be another haiku contest this year, though different in that all entries must be posted to this mailing list, then they would be archived and publicly available. Maybe it could be more of a festival than a contest, with original haiku in many languages that can be represented in Unicode. Actually, I sent in fourteen entries and afterwards I placed them in the mailing list archives. I have searched but cannot find them yet. Can one have haiku expressed in emoji? William Overington Tuesday 13 April 2021 From jameskass at code2001.com Tue Apr 13 13:49:15 2021 From: jameskass at code2001.com (James Kass) Date: Tue, 13 Apr 2021 18:49:15 +0000 Subject: Haiku Poetry Day Saturday 17 April 2021 In-Reply-To: <40dc874.f2b4.178cc233fdd.Webtop.93@btinternet.com> References: <40dc874.f2b4.178cc233fdd.Webtop.93@btinternet.com> Message-ID: <16a24830-de21-d86e-9bf9-fc540e3f3c71@code2001.com> On 2021-04-13 4:49 PM, William_J_G Overington via Unicode wrote: > I remember that one year this mailing list featured a haiku contest. http://blog.unicode.org/2009/09/unicode-announcement-unicode-haiku.html From junicode at jcbradfield.org Tue Apr 13 14:19:17 2021 From: junicode at jcbradfield.org (Julian Bradfield) Date: Tue, 13 Apr 2021 20:19:17 +0100 (BST) Subject: Haiku Poetry Day Saturday 17 April 2021 References: <40dc874.f2b4.178cc233fdd.Webtop.93@btinternet.com> Message-ID: On 2021-04-13, William_J_G Overington via Unicode wrote: > https://www.daysoftheyear.com/days/haiku-poetry-day/ > > Are haiku poems written in any languages other than Japanese and > English? > > It occurs to me that the 5-7-5 structure in one language may well not > have a 5-7-5 structure if translated from one language to another. > > Also, meaning could be lost. For example if the word 'Elles' in French > in translated to 'They' in English, the translation may loose the > meaning. I have a friend who sometimes writes quadrilingual haikus: the same sentiment expressed in haiku form in each of French, German, English and Quenya. Like all translation, there are choices that can be made. From john.w.kennedy at gmail.com Tue Apr 13 14:45:08 2021 From: john.w.kennedy at gmail.com (John W Kennedy) Date: Tue, 13 Apr 2021 15:45:08 -0400 Subject: Haiku Poetry Day Saturday 17 April 2021 In-Reply-To: <40dc874.f2b4.178cc233fdd.Webtop.93@btinternet.com> References: <40dc874.f2b4.178cc233fdd.Webtop.93@btinternet.com> Message-ID: On Apr 13, 2021, at 12:49 PM, William_J_G Overington via Unicode wrote: > https://www.daysoftheyear.com/days/haiku-poetry-day/ > > Are haiku poems written in any languages other than Japanese and English? I don?t know of any other languages in which haiku /are/ written; I dare say French would serve better than English. English hates haiku; Syllables, like April snow, Melt and flow away. Of course, one might always question the meaning of ?mora?, Returning to the springtime of English and the forge of the ancient English scops, But I fear that a seven-footed line?s too long. > It occurs to me that the 5-7-5 structure in one language may well not have a 5-7-5 structure if translated from one language to another. ?Traduttore traditore.? (And the more exquisite, the worse.) -- John W Kennedy "...when you're trying to build a house of cards, the last thing you should do is blow hard and wave your hands like a madman." -- Rupert Goodwins From eik at iki.fi Tue Apr 13 23:55:26 2021 From: eik at iki.fi (eik at iki.fi) Date: Wed, 14 Apr 2021 07:55:26 +0300 Subject: Haiku Poetry Day Saturday 17 April 2021 Message-ID: <001001d730ea$63b9a3f0$2b2cebd0$@iki.fi> There are several Haikus written and published also at least in Finnish. Erkki I. Kolehmainen Mannerheimintie 75 B 37, 00270 Helsinki, Finland Mob: +358 400 825 943 -----Alkuper?inen viesti----- L?hett?j?: Unicode Puolesta John W Kennedy via Unicode L?hetetty: tiistai 13. huhtikuuta 2021 22.45 Vastaanottaja: Unicode Discussion Aihe: Re: Haiku Poetry Day Saturday 17 April 2021 On Apr 13, 2021, at 12:49 PM, William_J_G Overington via Unicode wrote: > https://www.daysoftheyear.com/days/haiku-poetry-day/ > > Are haiku poems written in any languages other than Japanese and English? I don?t know of any other languages in which haiku /are/ written; I dare say French would serve better than English. English hates haiku; Syllables, like April snow, Melt and flow away. Of course, one might always question the meaning of ?mora?, Returning to the springtime of English and the forge of the ancient English scops, But I fear that a seven-footed line?s too long. > It occurs to me that the 5-7-5 structure in one language may well not have a 5-7-5 structure if translated from one language to another. ?Traduttore traditore.? (And the more exquisite, the worse.) -- John W Kennedy "...when you're trying to build a house of cards, the last thing you should do is blow hard and wave your hands like a madman." -- Rupert Goodwins From marius.spix at web.de Wed Apr 14 02:11:33 2021 From: marius.spix at web.de (Marius Spix) Date: Wed, 14 Apr 2021 09:11:33 +0200 Subject: Aw: RE: Haiku Poetry Day Saturday 17 April 2021 In-Reply-To: <001001d730ea$63b9a3f0$2b2cebd0$@iki.fi> References: <001001d730ea$63b9a3f0$2b2cebd0$@iki.fi> Message-ID: In German, there is a similar poem form called Elfchen (diminutive of ?Elf?, which mean ?eleven?) , which has the structure 1 word ? 2 words ? 3 words ? 4 words ? 1 word. > Gesendet: Mittwoch, 14. April 2021 um 06:55 Uhr > Von: "eik--- via Unicode" > An: "'John W Kennedy'" , unicode at unicode.org > Betreff: RE: Haiku Poetry Day Saturday 17 April 2021 > > There are several Haikus written and published also at least in Finnish. > > Erkki I. Kolehmainen > Mannerheimintie 75 B 37, 00270 Helsinki, Finland > Mob: +358 400 825 943 > > -----Alkuper?inen viesti----- > L?hett?j?: Unicode Puolesta John W Kennedy via Unicode > L?hetetty: tiistai 13. huhtikuuta 2021 22.45 > Vastaanottaja: Unicode Discussion > Aihe: Re: Haiku Poetry Day Saturday 17 April 2021 > > On Apr 13, 2021, at 12:49 PM, William_J_G Overington via Unicode wrote: > > > https://www.daysoftheyear.com/days/haiku-poetry-day/ > > > > Are haiku poems written in any languages other than Japanese and English? > > I don?t know of any other languages in which haiku /are/ written; I dare say French would serve better than English. > > English hates haiku; > Syllables, like April snow, > Melt and flow away. > > Of course, one might always question the meaning of ?mora?, Returning to the springtime of English and the forge of the ancient English scops, But I fear that a seven-footed line?s too long. > > > It occurs to me that the 5-7-5 structure in one language may well not have a 5-7-5 structure if translated from one language to another. > > ?Traduttore traditore.? (And the more exquisite, the worse.) > > -- > John W Kennedy > "...when you're trying to build a house of cards, the last thing you should do is blow hard and wave your hands like a madman." > -- Rupert Goodwins > > > > > > From doug at ewellic.org Wed Apr 14 11:41:28 2021 From: doug at ewellic.org (Doug Ewell) Date: Wed, 14 Apr 2021 10:41:28 -0600 Subject: Need reference to good ABNF for \uXXXX syntax Message-ID: <001001d7314d$04c4d410$0e4e7c30$@ewellic.org> Is anyone aware of an existing RFC or other specification that includes complete, correct, and clear ABNF for Unicode escape sequences using the UTF-16 encoding scheme? Examples: \u0041 \u3042 \uD801\uDC02 (NOT: \U0001042A) This type of sequence is described in Section 6.3 of RFC 5137, but that RFC does not recommend this syntax and does not include ABNF for it. "Correct" implies, for instance, that the ABNF excludes unpaired surrogates. To be clear, I'm NOT looking for someone on this list to contribute their own code, but rather a pointer to code that is already published, and easy for another document, such as an I-D, to reference. -- Doug Ewell, CC, ALB | Lakewood, CO, US | ewellic.org From markus.icu at gmail.com Wed Apr 14 12:43:51 2021 From: markus.icu at gmail.com (Markus Scherer) Date: Wed, 14 Apr 2021 07:43:51 -1000 Subject: Need reference to good ABNF for \uXXXX syntax In-Reply-To: <001001d7314d$04c4d410$0e4e7c30$@ewellic.org> References: <001001d7314d$04c4d410$0e4e7c30$@ewellic.org> Message-ID: Hi Doug, On Wed, Apr 14, 2021 at 6:45 AM Doug Ewell via Unicode wrote: > Is anyone aware of an existing RFC or other specification that includes > complete, correct, and clear ABNF for Unicode escape sequences using the > UTF-16 encoding scheme? > ... > "Correct" implies, for instance, that the ABNF excludes unpaired > surrogates. > I was looking for something, but all I can find is either loose about surrogates (e.g., Java ), or deals in code points rather than UTF-16 code units. Can you say why you want/need strict 16-bit escapes for well-formed UTF-16 code units, rather than what others are doing? markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From wjgo_10009 at btinternet.com Wed Apr 14 13:55:18 2021 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Wed, 14 Apr 2021 19:55:18 +0100 (BST) Subject: Colours Message-ID: <24f9813c.115dc.178d1bcda63.Webtop.107@btinternet.com> A very interesting document was added yesterday to the Current Document Register. Examining Emoji Color Spaces: A Strategy for Improving the Coverage of Heart Emoji https://www.unicode.org/L2/L2021/21075-heart-emoji-coverage.pdf I have been studying this and have started a thread in the Affinity forum, which some readers might perhaps find of interest. https://forum.affinity.serif.com/index.php?/topic/140122-colours/ William Overington Wednesday 14 April 2021 From duerst at it.aoyama.ac.jp Wed Apr 14 18:50:43 2021 From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J=2e_D=c3=bcrst?=) Date: Thu, 15 Apr 2021 08:50:43 +0900 Subject: Need reference to good ABNF for \uXXXX syntax In-Reply-To: <001001d7314d$04c4d410$0e4e7c30$@ewellic.org> References: <001001d7314d$04c4d410$0e4e7c30$@ewellic.org> Message-ID: Hello Doug, On 2021-04-15 01:41, Doug Ewell via Unicode wrote: > Is anyone aware of an existing RFC or other specification that includes complete, correct, and clear ABNF for Unicode escape sequences using the UTF-16 encoding scheme? > > Examples: > \u0041 > \u3042 > \uD801\uDC02 (NOT: \U0001042A) > > This type of sequence is described in Section 6.3 of RFC 5137, but that RFC does not recommend this syntax and does not include ABNF for it. > > "Correct" implies, for instance, that the ABNF excludes unpaired surrogates. > > To be clear, I'm NOT looking for someone on this list to contribute their own code, but rather a pointer to code that is already published, and easy for another document, such as an I-D, to reference. So I guess you are looking for something like the regular expression on https://www.w3.org/International/questions/qa-forms-utf-8, but for the above syntax (rather than byte sequences in UTF-8) and in ABNF. The closest I was able to come up from memory may be https://tools.ietf.org/html/rfc5137, but it's not exactly what you want. I'd guess it might be quicker for you to put something together on your own (and then maybe run it by this list). Regards, ? Martin. > -- > Doug Ewell, CC, ALB | Lakewood, CO, US | ewellic.org > From doug at ewellic.org Wed Apr 14 20:52:11 2021 From: doug at ewellic.org (Doug Ewell) Date: Wed, 14 Apr 2021 19:52:11 -0600 Subject: Need reference to good ABNF for \uXXXX syntax In-Reply-To: References: <001001d7314d$04c4d410$0e4e7c30$@ewellic.org> Message-ID: <002701d73199$f3e94f70$dbbbee50$@ewellic.org> Markus Scherer wrote: > I was looking for something, but all I can find is either loose about > surrogates (e.g., > https://docs.oracle.com/javase/specs/jls/se7/html/jls-3.html), or > deals in code points rather than UTF-16 code units. Yes, the text of the Java spec knows about concatenating a high surrogate and a low surrogate, but doesn't know about excluding unpaired surrogates. So the syntax on that page is really just pre-1993 UCS-2. > Can you say why you want/need strict 16-bit escapes for well-formed > UTF-16 code units, rather than what others are doing? It's for an update to RFC 8610, which defines CDDL, a metalanguage for expressing CBOR data structures. The syntax is already defined and out in the field, so it's too late to change it, but the ABNF describing it was incorrect and someone filed an erratum. The discussion was on how to fix the ABNF, and I thought it would be better to find and validate a rule already published than to create an all-new, probably slightly different, and possibly buggy one. In the end, some time after I wrote my message, the decision was made to create a new rule (see below). Fortunately it has a lot of eyes on it, and seems to be correct. Martin J. D?rst wrote: > So I guess you are looking for something like the regular expression > on https://www.w3.org/International/questions/qa-forms-utf-8, but for > the above syntax (rather than byte sequences in UTF-8) and in ABNF. Yes. > The closest I was able to come up from memory may be > https://tools.ietf.org/html/rfc5137, but it's not exactly what you > want. No, that just repeats the Java spec's UCS-2 definition, in real ABNF instead of whatever the Java spec is using. I did mention RFC 5167 in my post; when I said it didn't include ABNF for the syntax we're talking about, I meant surrogate-aware. > I'd guess it might be quicker for you to put something together on > your own (and then maybe run it by this list). What Carsten Bormann came up with was this: hexchar = non-surrogate / (high-surrogate "\" %x75 low-surrogate) non-surrogate = ((DIGIT / "A" / "B" / "C" / "E" / "F") 3HEXDIG) / ("D" %x30-37 2HEXDIG ) high-surrogate = "D" ("8" / "9" / "A" / "B") 2HEXDIG low-surrogate = "D" ("C" / "D" / "E" / "F") 2HEXDIG (My contribution was to define non-surrogate, high-surrogate, and low-surrogate separately instead of making this one behemoth rule.) J Decker wrote: > There's also long encode in JS using \u{NNNNN} where the N digits > aren't required, because there's a framing of {}.... this allows one > to specify A character without surrogate encoding. Thanks, but the goal was not to find a better encoding for CDDL, but to find good ABNF for the encoding that CDDL already uses. -- Doug Ewell, CC, ALB | Lakewood, CO, US | ewellic.org From markus.icu at gmail.com Thu Apr 15 00:09:04 2021 From: markus.icu at gmail.com (Markus Scherer) Date: Wed, 14 Apr 2021 19:09:04 -1000 Subject: Need reference to good ABNF for \uXXXX syntax In-Reply-To: <002701d73199$f3e94f70$dbbbee50$@ewellic.org> References: <001001d7314d$04c4d410$0e4e7c30$@ewellic.org> <002701d73199$f3e94f70$dbbbee50$@ewellic.org> Message-ID: On Wed, Apr 14, 2021 at 3:55 PM Doug Ewell via Unicode wrote: > (My contribution was to define non-surrogate, high-surrogate, and > low-surrogate separately instead of making this one behemoth rule.) > lgtm I personally like to use "lead surrogate" and "trail surrogate" because I find "high" vs. "low" confusing. markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From duerst at it.aoyama.ac.jp Fri Apr 16 00:33:34 2021 From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J=2e_D=c3=bcrst?=) Date: Fri, 16 Apr 2021 14:33:34 +0900 Subject: Need reference to good ABNF for \uXXXX syntax In-Reply-To: <002701d73199$f3e94f70$dbbbee50$@ewellic.org> References: <001001d7314d$04c4d410$0e4e7c30$@ewellic.org> <002701d73199$f3e94f70$dbbbee50$@ewellic.org> Message-ID: <7b5706b0-b327-0911-0c2b-323cd80536c2@it.aoyama.ac.jp> Hello Doug, (Carsten cc'ed as a shortcut.) On 2021-04-15 10:52, Doug Ewell via Unicode wrote: > Martin J. D?rst wrote: > >> So I guess you are looking for something like the regular expression >> on https://www.w3.org/International/questions/qa-forms-utf-8, but for >> the above syntax (rather than byte sequences in UTF-8) and in ABNF. > > Yes. > >> The closest I was able to come up from memory may be >> https://tools.ietf.org/html/rfc5137, but it's not exactly what you >> want. > > No, that just repeats the Java spec's UCS-2 definition, in real ABNF instead of whatever the Java spec is using. I did mention RFC 5167 in my post; Sorry, I shouldn't have missed that. > when I said it didn't include ABNF for the syntax we're talking about, I meant surrogate-aware. > >> I'd guess it might be quicker for you to put something together on >> your own (and then maybe run it by this list). > > What Carsten Bormann came up with was this: > > hexchar = non-surrogate / (high-surrogate "\" %x75 low-surrogate) > > non-surrogate = ((DIGIT / "A" / "B" / "C" / "E" / "F") 3HEXDIG) / > ("D" %x30-37 2HEXDIG ) > > high-surrogate = "D" ("8" / "9" / "A" / "B") 2HEXDIG > > low-surrogate = "D" ("C" / "D" / "E" / "F") 2HEXDIG > > (My contribution was to define non-surrogate, high-surrogate, and low-surrogate separately instead of making this one behemoth rule.) What bothers me in this grammar is that the first "\u" isn't anywhere in sight, but the second one is there. It would be much clearer if either the first "\u" is at the start of hexchar, i.e. hexchar = "\" %x75 (non-surrogate / (high-surrogate "\" %x75 low-surrogate)) or the various "\u" parts are integrated with the various parts, as follows: hexchar = non-surrogate / (high-surrogate low-surrogate) non-surrogate = "\" %x75 ((DIGIT / "A" / "B" / "C" / "E" / "F") 3HEXDIG) / ("D" %x30-37 2HEXDIG ) high-surrogate = "\" %x75 "D" ("8" / "9" / "A" / "B") 2HEXDIG low-surrogate = "\" %x75 "D" ("C" / "D" / "E" / "F") 2HEXDIG The way it is written, it looks like the convenience of ABNF details (such as maybe line length) are dominating the expression of a clear structure. Regards, ? Martin. From doug at ewellic.org Fri Apr 16 10:25:38 2021 From: doug at ewellic.org (Doug Ewell) Date: Fri, 16 Apr 2021 09:25:38 -0600 Subject: Need reference to good ABNF for \uXXXX syntax In-Reply-To: <7b5706b0-b327-0911-0c2b-323cd80536c2@it.aoyama.ac.jp> References: <001001d7314d$04c4d410$0e4e7c30$@ewellic.org> <002701d73199$f3e94f70$dbbbee50$@ewellic.org> <7b5706b0-b327-0911-0c2b-323cd80536c2@it.aoyama.ac.jp> Message-ID: <000001d732d4$c1e1b850$45a528f0$@ewellic.org> Martin J. D?rst wrote: > What bothers me in this grammar is that the first "\u" isn't anywhere > in sight, but the second one is there. It would be much clearer if > either the first "\u" is at the start of hexchar, i.e. Sorry, I neglected to include this line, which precedes everything I did quote: SESC = "\" ( %x22 / %x2F / %x5C / %x62 / %x66 / %x6E / %x72 / %x74 / (%x75 hexchar) ) SESC incorporates all the other backslash-escaped characters. It, not hexchar, is the real entity referenced by everything else. -- Doug Ewell, CC, ALB | Lakewood, CO, US | ewellic.org From beckiergb at gmail.com Fri Apr 16 14:00:40 2021 From: beckiergb at gmail.com (Rebecca Bettencourt) Date: Fri, 16 Apr 2021 12:00:40 -0700 Subject: Need reference to good ABNF for \uXXXX syntax In-Reply-To: <000001d732d4$c1e1b850$45a528f0$@ewellic.org> References: <001001d7314d$04c4d410$0e4e7c30$@ewellic.org> <002701d73199$f3e94f70$dbbbee50$@ewellic.org> <7b5706b0-b327-0911-0c2b-323cd80536c2@it.aoyama.ac.jp> <000001d732d4$c1e1b850$45a528f0$@ewellic.org> Message-ID: Is %x2F supposed to be %x27? -- Rebecca Bettencourt On Fri, Apr 16, 2021 at 8:29 AM Doug Ewell via Unicode wrote: > Martin J. D?rst wrote: > > > What bothers me in this grammar is that the first "\u" isn't anywhere > > in sight, but the second one is there. It would be much clearer if > > either the first "\u" is at the start of hexchar, i.e. > > Sorry, I neglected to include this line, which precedes everything I did > quote: > > SESC = "\" ( %x22 / %x2F / %x5C / %x62 / %x66 / %x6E / %x72 / %x74 / > (%x75 hexchar) ) > > SESC incorporates all the other backslash-escaped characters. It, not > hexchar, is the real entity referenced by everything else. > > -- > Doug Ewell, CC, ALB | Lakewood, CO, US | ewellic.org > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From kent.b.karlsson at bahnhof.se Fri Apr 16 15:09:27 2021 From: kent.b.karlsson at bahnhof.se (Kent Karlsson) Date: Fri, 16 Apr 2021 22:09:27 +0200 Subject: Need reference to good ABNF for \uXXXX syntax In-Reply-To: <000001d732d4$c1e1b850$45a528f0$@ewellic.org> References: <001001d7314d$04c4d410$0e4e7c30$@ewellic.org> <002701d73199$f3e94f70$dbbbee50$@ewellic.org> <7b5706b0-b327-0911-0c2b-323cd80536c2@it.aoyama.ac.jp> <000001d732d4$c1e1b850$45a528f0$@ewellic.org> Message-ID: <1151F8C2-78C2-4B7F-887B-015D3B4C5E9D@bahnhof.se> > 16 apr. 2021 kl. 17:25 skrev Doug Ewell via Unicode : > > Martin J. D?rst wrote: > >> What bothers me in this grammar is that the first "\u" isn't anywhere >> in sight, but the second one is there. It would be much clearer if >> either the first "\u" is at the start of hexchar, i.e. > > Sorry, I neglected to include this line, which precedes everything I did quote: > > SESC = "\" ( %x22 / %x2F / %x5C / %x62 / %x66 / %x6E / %x72 / %x74 / > (%x75 hexchar) ) 1) Why are some ?very plain letters in ASCII? given as hex escapes here? Esp. since the not so plain (it is used as an escape, which is the point here?) ?\? has not warranted a hex escape. (The grammar even uses it to escape ?, which is a bit ironic). 2) Apart from the second line there, these have nothing to do with ?\u? escapes, and in addition the set of these other escapes vary (a bit) by programming language (or other context), and technically aren?t needed when \u escapes are allowed (though still practical). /Kent K > SESC incorporates all the other backslash-escaped characters. It, not hexchar, is the real entity referenced by everything else. > > -- > Doug Ewell, CC, ALB | Lakewood, CO, US | ewellic.org > > > From doug at ewellic.org Fri Apr 16 15:38:11 2021 From: doug at ewellic.org (Doug Ewell) Date: Fri, 16 Apr 2021 14:38:11 -0600 Subject: Need reference to good ABNF for \uXXXX syntax In-Reply-To: <1151F8C2-78C2-4B7F-887B-015D3B4C5E9D@bahnhof.se> References: <001001d7314d$04c4d410$0e4e7c30$@ewellic.org> <002701d73199$f3e94f70$dbbbee50$@ewellic.org> <7b5706b0-b327-0911-0c2b-323cd80536c2@it.aoyama.ac.jp> <000001d732d4$c1e1b850$45a528f0$@ewellic.org> <1151F8C2-78C2-4B7F-887B-015D3B4C5E9D@bahnhof.se> Message-ID: <000001d73300$6b1788c0$41469a40$@ewellic.org> Again, the object of this exercise was not to redefine the CDDL syntax, but to find good, debugged ABNF to describe the existing syntax. It does, however, seem reasonable that backslash (%5C) should be included in the list. Also, as Rebecca pointed out, solidus (%2F) should apparently be changed to single-quote (%27). These are helpful corrections, but orthogonal to the question of how best to represent the \u syntax in ABNF. -- Doug Ewell, CC, ALB | Lakewood, CO, US | ewellic.org -----Original Message----- From: Kent Karlsson > 16 apr. 2021 kl. 17:25 skrev Doug Ewell via Unicode : > > Martin J. D?rst wrote: > >> What bothers me in this grammar is that the first "\u" isn't anywhere >> in sight, but the second one is there. It would be much clearer if >> either the first "\u" is at the start of hexchar, i.e. > > Sorry, I neglected to include this line, which precedes everything I did quote: > > SESC = "\" ( %x22 / %x2F / %x5C / %x62 / %x66 / %x6E / %x72 / %x74 / > (%x75 hexchar) ) 1) Why are some ?very plain letters in ASCII? given as hex escapes here? Esp. since the not so plain (it is used as an escape, which is the point here?) ?\? has not warranted a hex escape. (The grammar even uses it to escape ?, which is a bit ironic). 2) Apart from the second line there, these have nothing to do with ?\u? escapes, and in addition the set of these other escapes vary (a bit) by programming language (or other context), and technically aren?t needed when \u escapes are allowed (though still practical). /Kent K From doug at ewellic.org Fri Apr 16 15:49:46 2021 From: doug at ewellic.org (Doug Ewell) Date: Fri, 16 Apr 2021 14:49:46 -0600 Subject: Need reference to good ABNF for \uXXXX syntax In-Reply-To: References: <001001d7314d$04c4d410$0e4e7c30$@ewellic.org> <002701d73199$f3e94f70$dbbbee50$@ewellic.org> <7b5706b0-b327-0911-0c2b-323cd80536c2@it.aoyama.ac.jp> <000001d732d4$c1e1b850$45a528f0$@ewellic.org> Message-ID: <000201d73302$0937c500$1ba74f00$@ewellic.org> Carsten Bormann wrote: >> Is %x2F supposed to be %x27? > > No, it?s really %x2F. OK, I take back what I wrote a few minutes ago. -- Doug Ewell, CC, ALB | Lakewood, CO, US | ewellic.org From kent.b.karlsson at bahnhof.se Sat Apr 17 08:13:21 2021 From: kent.b.karlsson at bahnhof.se (Kent Karlsson) Date: Sat, 17 Apr 2021 15:13:21 +0200 Subject: Need reference to good ABNF for \uXXXX syntax In-Reply-To: <8BAFBF3F-79E8-4E22-88C4-58FBF4041C50@tzi.org> References: <001001d7314d$04c4d410$0e4e7c30$@ewellic.org> <002701d73199$f3e94f70$dbbbee50$@ewellic.org> <7b5706b0-b327-0911-0c2b-323cd80536c2@it.aoyama.ac.jp> <000001d732d4$c1e1b850$45a528f0$@ewellic.org> <1151F8C2-78C2-4B7F-887B-015D3B4C5E9D@bahnhof.se> <8BAFBF3F-79E8-4E22-88C4-58FBF4041C50@tzi.org> Message-ID: <0334017C-1263-4FFE-BE01-D67810BAD8A3@bahnhof.se> (Going a bit further off the original topic of this thread?) > 16 apr. 2021 kl. 22:54 skrev Carsten Bormann : > > >> On 16. Apr 2021, at 22:09, Kent Karlsson wrote: >> >>> SESC = "\" ( %x22 / %x2F / %x5C / %x62 / %x66 / %x6E / %x72 / %x74 / >>> (%x75 hexchar) ) >> >> 1) Why are some ?very plain letters in ASCII? given as hex escapes here? Esp. since the not so plain (it is used as an escape, which is the point here?) ?\? has not warranted a hex escape. (The grammar even uses it to escape ?, which is a bit ironic). > > Because RFC 8259 does. > > This is ABNF, so there are some peculiarities to be taken care of. > Long form: %x2F and %x5C really should be ?/? and ?\?, as I wrote before. > %x22 is a convenient form to put a double quote into ABNF (there is no escaping in ABNF, which was invented around 1977, for RFC 733). > > %x62 / %x66 / %x6E / %x72 / %x74 are of course ?b?/?f?/?n?/?t?, which prefixed by ?\? are popular white space escapes \b is usually used for backspace (going back decades in tradition?). But backspace is NOT a ?whitespace? character at all. Neither when it was used to create bold (on typewriter type of terminals), overtyping to create combined characters (long since deprecated) or used as a command to erase character preceding current position, it has never been a whitespace character. However, vertical tab (sometimes representable as a \v escape, as in C/C++, JavaScript, GoLang, PHP), nowadays used more for an ASCII representation of LINE SEPARATOR than for vertical tabulation, is usually regarded as a whitespace character. /Kent K > so you don?t have to use \uXXXX for them. > Unfortunately, writing ?b?/?f?/?n?/?t? in ABNF would invoke the default case-insensitivity of ABNF (think 1977 again). > This could be written %s?b?/%s?f?/%s?n?/%s?t? with the ABNF extension documented in RFC 7405, but RFC 7159 (that became RFC 8259 later) was written before RFC 7405 (obviously). Also, not using the extension slightly widens the set of tools that can be used with this ABNF. > > I apologise for polluting this list with arcane details of JavaScript and ABNF, but those are the reasons this grammar looks like it does. > > Gr??e, Carsten > From doug at ewellic.org Sat Apr 17 10:50:42 2021 From: doug at ewellic.org (Doug Ewell) Date: Sat, 17 Apr 2021 09:50:42 -0600 Subject: Need reference to good ABNF for \uXXXX syntax In-Reply-To: <0334017C-1263-4FFE-BE01-D67810BAD8A3@bahnhof.se> References: <001001d7314d$04c4d410$0e4e7c30$@ewellic.org> <002701d73199$f3e94f70$dbbbee50$@ewellic.org> <7b5706b0-b327-0911-0c2b-323cd80536c2@it.aoyama.ac.jp> <000001d732d4$c1e1b850$45a528f0$@ewellic.org> <1151F8C2-78C2-4B7F-887B-015D3B4C5E9D@bahnhof.se> <8BAFBF3F-79E8-4E22-88C4-58FBF4041C50@tzi.org> <0334017C-1263-4FFE-BE01-D67810BAD8A3@bahnhof.se> Message-ID: <001301d733a1$6cc99360$465cba20$@ewellic.org> I can see it would have been best if I had not posted the "SESC = ..." line at all, but simply replied with something like "hexchar is referenced by a previous rule that includes the '\u' prefix." -- Doug Ewell, CC, ALB | Lakewood, CO, US | ewellic.org From lorna_evans at sil.org Mon Apr 26 16:50:40 2021 From: lorna_evans at sil.org (Lorna Evans) Date: Mon, 26 Apr 2021 16:50:40 -0500 Subject: Normalizing Syriac Message-ID: <1dfd9dcd-895b-b5ae-0d06-e08401bf5ab5@sil.org> I've got a situation that I'm not sure how to handle...or even if Unicode or the rendering engines need update. In a language using Syriac there is a /rish seyame/ which can be followed by U+0739 or U+0738 /rish /= 072A /seyame /= 0308 In TUS, chapter 9, it says: > In Modern Syriac usage, when a word contains a /rish /and a /seyame/, > the dot of > the /rish /and the /seyame /are replaced by a /rish /with two dots > above it. Then, there's a table which indicates this ligature is obligatory: > Table 9-17. Syriac Ligatures > > Ligature Classes. As in other scripts, ligatures in Syriac vary > depending on the font style. > Table 9-17 identifies the principal valid ligatures for each font > style. When applicable, these > ligatures are obligatory, unless denoted with an asterisk (*). > > rish seyame Right-joining Right-joining Right-joining BFBS (no > asterisk, so it is obligatory) > Finally, in "Developing OpenType Fonts for Syriac Script" https://docs.microsoft.com/en-us/typography/script-development/syriac In the "Glossary section" it says: > *Ligature* - A combination of glyphs that join to form a single glyph. > For example, the 'rish seyame' (U072a + U0308) combinations of glyphs > are mandatory ligatures for Syriac. Other ligatures are optional. > So, it seems clear that 072a+0308 is a mandatory ligature. The problem I'm seeing is that when this ligature is followed by U+0739 or U+0738 AND an application does normalization, it changes the sequence to U+072A U+0739 U+0308 and that breaks the ligature. You can see why they are reordering it when you see 0308 is 230 and U+0738 or U+0739 are 220. 0308;COMBINING DIAERESIS;Mn;*230*;NSM;;;;;N;NON-SPACING DIAERESIS;;;; 0738;SYRIAC DOTTED ZLAMA HORIZONTAL;Mn;*220*;NSM;;;;;N;;;;; 0739;SYRIAC DOTTED ZLAMA ANGULAR;Mn;*220*;NSM;;;;;N;;;;; All of the Syriac fonts that I see only handle this sequence *U+072A U+0308 U+0739* and not the reordered *U+072A U+0739 U+0308* Are the fonts wrong, should they be able to handle U+072A U+0739 U+0308? Or, is there a special normalization rule for this? How should /rish seyame/ followed by a below mark like U+0738 or U+0739 be handled? Lorna ** -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Mon Apr 26 17:48:40 2021 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Mon, 26 Apr 2021 23:48:40 +0100 Subject: Normalizing Syriac In-Reply-To: <1dfd9dcd-895b-b5ae-0d06-e08401bf5ab5@sil.org> References: <1dfd9dcd-895b-b5ae-0d06-e08401bf5ab5@sil.org> Message-ID: <20210426234840.76732e6c@JRWUBU2> On Mon, 26 Apr 2021 16:50:40 -0500 Lorna Evans via Unicode wrote: > You can see why they are reordering it when you see 0308 is 230 and > U+0738 or U+0739 are 220. > > 0308;COMBINING DIAERESIS;Mn;*230*;NSM;;;;;N;NON-SPACING DIAERESIS;;;; > 0738;SYRIAC DOTTED ZLAMA HORIZONTAL;Mn;*220*;NSM;;;;;N;;;;; > 0739;SYRIAC DOTTED ZLAMA ANGULAR;Mn;*220*;NSM;;;;;N;;;;; > > All of the Syriac fonts that I see only handle this sequence *U+072A > U+0308 U+0739* and not the reordered *U+072A U+0739 U+0308* > > Are the fonts wrong, should they be able to handle U+072A U+0739 > U+0308? > > Or, is there a special normalization rule for this? > > How should /rish seyame/ followed by a below mark like U+0738 or > U+0739 be handled? It depends on your technology. In an OpenType font, I would combine RISH with COMBINING DIAERESIS using a substitution lookup that ignores marks below. Am I missing something? In a combination of base, mark above and mark below, the order of the marks shouldn't matter if they don't interact - one just sets up the mark 'attachment' classes so that the marks are in different classes. In later version of OpenType, one can even ignore a set of marks peculiar to that lookup. Of course, the OpenType (syntax) specification doesn't state what the subsequent sequence of glyphs is after a ligature lookup if an intervening mark has been skipped. John Hudson has publicly complained that the semantics of OpenType ought to be defined. Perhaps some Syriac shaper exploits this gap to go spectacularly wrong - one would hope it doesn't. It has struck me as odd that there is very little hint around of what sequences of marks fonts have to handle. Back when Harfbuzz was beginning to handle Tai Tham, Behdad kindly did a normalisation on the fly so that tone marks (ccc=230) would come before COENG (ccc=9) so that COENG would remain adjacent to its following consonant. There is a similar issue with Hebrew. (Like a good boy, I'd elaborated my fonts to handle normalised sequences.) It is well known that the set of character sequences supported by Uniscribe is not closed under canonical equivalence - apparently this is allowed by the conformance clauses of TUS. Richard. From harjitmoe at outlook.com Mon Apr 26 17:58:23 2021 From: harjitmoe at outlook.com (Harriet Riddle) Date: Mon, 26 Apr 2021 22:58:23 +0000 Subject: Normalizing Syriac In-Reply-To: <1dfd9dcd-895b-b5ae-0d06-e08401bf5ab5@sil.org> References: <1dfd9dcd-895b-b5ae-0d06-e08401bf5ab5@sil.org> Message-ID: What I gather for background information (which you may well already be aware of, but just in case) is that: ? Normalisation rules are set in stone per stability policy (software has to be able to rely on any input that normalises to a certain output continuing to normalise like that, so it can use a normalised form as e.g. a database key, input for a password hash, etc.?even if a better behaviour theoretically exists). ? A cluster of a base character and combining characters can be interrupted with one or more of the confusingly named Combining Grapheme Joiner, which is typically used to split what is one grapheme cluster for display purposes into multiple grapheme clusters for normalisation and/or collation purposes. This can be used to inhibit diacritic re?rderings that pose an issue in practice. ?Har. Get Outlook for Android ________________________________ From: Unicode on behalf of Lorna Evans via Unicode Sent: Monday, April 26, 2021 10:50:40 PM To: Unicode Mailing List Subject: Normalizing Syriac I've got a situation that I'm not sure how to handle...or even if Unicode or the rendering engines need update. In a language using Syriac there is a rish seyame which can be followed by U+0739 or U+0738 rish = 072A seyame = 0308 In TUS, chapter 9, it says: In Modern Syriac usage, when a word contains a rish and a seyame, the dot of the rish and the seyame are replaced by a rish with two dots above it. Then, there's a table which indicates this ligature is obligatory: Table 9-17. Syriac Ligatures Ligature Classes. As in other scripts, ligatures in Syriac vary depending on the font style. Table 9-17 identifies the principal valid ligatures for each font style. When applicable, these ligatures are obligatory, unless denoted with an asterisk (*). rish seyame Right-joining Right-joining Right-joining BFBS (no asterisk, so it is obligatory) Finally, in "Developing OpenType Fonts for Syriac Script" https://docs.microsoft.com/en-us/typography/script-development/syriac In the "Glossary section" it says: Ligature - A combination of glyphs that join to form a single glyph. For example, the 'rish seyame' (U072a + U0308) combinations of glyphs are mandatory ligatures for Syriac. Other ligatures are optional. So, it seems clear that 072a+0308 is a mandatory ligature. The problem I'm seeing is that when this ligature is followed by U+0739 or U+0738 AND an application does normalization, it changes the sequence to U+072A U+0739 U+0308 and that breaks the ligature. You can see why they are reordering it when you see 0308 is 230 and U+0738 or U+0739 are 220. 0308;COMBINING DIAERESIS;Mn;230;NSM;;;;;N;NON-SPACING DIAERESIS;;;; 0738;SYRIAC DOTTED ZLAMA HORIZONTAL;Mn;220;NSM;;;;;N;;;;; 0739;SYRIAC DOTTED ZLAMA ANGULAR;Mn;220;NSM;;;;;N;;;;; All of the Syriac fonts that I see only handle this sequence U+072A U+0308 U+0739 and not the reordered U+072A U+0739 U+0308 Are the fonts wrong, should they be able to handle U+072A U+0739 U+0308? Or, is there a special normalization rule for this? How should rish seyame followed by a below mark like U+0738 or U+0739 be handled? Lorna -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Mon Apr 26 23:21:08 2021 From: doug at ewellic.org (Doug Ewell) Date: Mon, 26 Apr 2021 22:21:08 -0600 Subject: Normalizing Syriac In-Reply-To: <20210426234840.76732e6c@JRWUBU2> References: <1dfd9dcd-895b-b5ae-0d06-e08401bf5ab5@sil.org> <20210426234840.76732e6c@JRWUBU2> Message-ID: <000001d73b1c$c0ea90c0$42bfb240$@ewellic.org> Richard Wordingham wrote: > In an OpenType font, I would combine RISH with COMBINING DIAERESIS > using a substitution lookup that ignores marks below. I thought the number one goal of Unicode was to make text encoding independent of font selection. -- Doug Ewell, CC, ALB | Lakewood, CO, US | ewellic.org From richard.wordingham at ntlworld.com Tue Apr 27 13:47:02 2021 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Tue, 27 Apr 2021 19:47:02 +0100 Subject: Normalizing Syriac In-Reply-To: <000001d73b1c$c0ea90c0$42bfb240$@ewellic.org> References: <1dfd9dcd-895b-b5ae-0d06-e08401bf5ab5@sil.org> <20210426234840.76732e6c@JRWUBU2> <000001d73b1c$c0ea90c0$42bfb240$@ewellic.org> Message-ID: <20210427194702.5470e918@JRWUBU2> On Mon, 26 Apr 2021 22:21:08 -0600 Doug Ewell via Unicode wrote: > Richard Wordingham wrote: > > > In an OpenType font, I would combine RISH with COMBINING DIAERESIS > > using a substitution lookup that ignores marks below. > > I thought the number one goal of Unicode was to make text encoding > independent of font selection. Which is why I would do it that way. Those who wrote the fonts seem to have assumed that these two characters would wind up adjacent. It isn't always trivial to ensure that canonically equivalent sequences render the same. The rendering engine can make a big difference to how much work has to be done by the tables in the font. Richard. From asmusf at ix.netcom.com Tue Apr 27 17:18:07 2021 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Tue, 27 Apr 2021 15:18:07 -0700 Subject: Normalizing Syriac In-Reply-To: <20210427194702.5470e918@JRWUBU2> References: <1dfd9dcd-895b-b5ae-0d06-e08401bf5ab5@sil.org> <20210426234840.76732e6c@JRWUBU2> <000001d73b1c$c0ea90c0$42bfb240$@ewellic.org> <20210427194702.5470e918@JRWUBU2> Message-ID: <72d59eaf-7ba1-5f3f-afe5-629f444a6da0@ix.netcom.com> An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Wed Apr 28 03:12:37 2021 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Wed, 28 Apr 2021 09:12:37 +0100 Subject: Normalizing Syriac In-Reply-To: <72d59eaf-7ba1-5f3f-afe5-629f444a6da0@ix.netcom.com> References: <1dfd9dcd-895b-b5ae-0d06-e08401bf5ab5@sil.org> <20210426234840.76732e6c@JRWUBU2> <000001d73b1c$c0ea90c0$42bfb240$@ewellic.org> <20210427194702.5470e918@JRWUBU2> <72d59eaf-7ba1-5f3f-afe5-629f444a6da0@ix.netcom.com> Message-ID: <20210428091237.414f91cf@JRWUBU2> On Tue, 27 Apr 2021 15:18:07 -0700 Asmus Freytag via Unicode wrote: > Doug, what Richard is saying is that Normalization being predictable > and fixed, it is up to each font (all of them) to correctly render > all forms of normalized text (and not to make assumptions on some > unnormalized ordering). I don't know if he has repented, but someone proposed that fonts should not be allowed to remove dotted circles so as, for instance, to render normalised Tibetan contractions with vowels above and below in the same akshara. Richard. From rick at unicode.org Wed Apr 28 16:25:02 2021 From: rick at unicode.org (Rick McGowan) Date: Wed, 28 Apr 2021 14:25:02 -0700 Subject: Unicode.org mail system maintenance Message-ID: <6089D2AE.30209@unicode.org> Hello everyone, On May 3 the Unicode Consortium will be doing maintenance and configuration updates to our e-mail service infrastructure. This could entail some mail delays, mail list outage, and/or temporary delivery issues. The maintenance is expected to begin mid-morning US Pacific Time and be complete by afternoon of the same day. I will be back in touch with further news after that. Regards,