From prospero at cyber-wizard.com Wed Dec 6 17:44:17 2023 From: prospero at cyber-wizard.com (prospero) Date: Thu, 7 Dec 2023 00:44:17 +0100 Subject: Question regarding TR-29 Message-ID: unicode.org/reports/tr29 ? The WB4 rule for word breaks: ? > Ignore Format and Extend characters, except after sot, CR, LF, and Newline. (See Section 6.2, Replacing Ignore Rules[https://unicode.org/reports/tr29/#Grapheme_Cluster_and_Format_Rules].) > This also has the effect of: Any ? (Format | Extend | ZWJ) seems incomplete and ambiguous. First, the "except after" part needs to apply to WSegSpace also, otherwise tests fail. And the handling of WB3c seems contradicted by the tests, e.g., the one on line 1158: ? 200D ? 0308 ? 231A ? # ? [0.2] ZERO WIDTH JOINER (ZWJ_FE) ? [4.0] COMBINING DIAERESIS ( Extend_FE) ? [999.0] WATCH (ExtPict) ? [0.3] seems to contradict it, since ignoring the 0308 (Extend_FE) should yield a ZWJ_FE + ExtPict, which should not break, but the test requires a break. If the tests are dispositive, could TR-29 be better clarified to reflect them? From pgcon6 at msn.com Thu Dec 7 12:44:43 2023 From: pgcon6 at msn.com (Peter Constable) Date: Thu, 7 Dec 2023 18:44:43 +0000 Subject: UTC public review issues to close January 2 In-Reply-To: References: Message-ID: After the last Unicode Technical Committee meeting, there were some public review issues posted. PRIs are a way that UTC uses to solicit input and feedback on specific proposals or work in progress. The input period for these PRIs ends January 2, 2024. (Time is needed before the next UTC meeting to process the feedback.) That means we're almost halfway through the public review period. Here's a summary of the five open PRIs: PRI #483: Proposed Update UAX #38, Unicode Han Database (Unihan) ? UAX #38 describes the many properties Unicode provides for CJK ideographs. This is a draft update of this spec for Unicode 16.0. PRI #484: Proposed Update UAX #50, Unicode Vertical Text Layout ? UAX #50 describes how characters should be adjusted between horizontal and vertical layout. This is a draft update of this spec for Unicode 16.0. PRI #485: Draft UTR #56, Unicode Cuneiform Sign Lists ? This is a draft for a new technical report that will provide additional data that will aid in the use of the Unicode encoding for Sumero-Akkadian Cuneiform script. PRI #486: Stabilization of UAX #42, Unicode Character Database in XML (UCDXML) ? UAX #42 provides the data for the Unicode Character Database in XML format. (UCD is character property data for use in processing algorithms that is provide with each version of Unicode. This PRI is for feedback on a planned UTC action to freeze UAX #42 as of Unicode 15.1. PRI #487: Proposed Update UAX #53 Unicode Arabic Mark Rendering ? This specification was previously published as a Unicode Technical Report. This is a draft for changing the status of the spec, to make it formally part of The Unicode Standard as a Unicode Standard Annex (UAX) starting in Unicode 16.0. UTC invites you to please take a look and provide feedback on these issues. Peter Constable UTC Chair -------------- next part -------------- An HTML attachment was scrubbed... URL: From manishsmail at gmail.com Thu Dec 7 17:56:58 2023 From: manishsmail at gmail.com (Manish Goregaokar) Date: Thu, 7 Dec 2023 15:56:58 -0800 Subject: Question regarding TR-29 In-Reply-To: References: Message-ID: Hi! I think a crucial thing to note about interpreting these rules is that they must be applied in order, WB4 can only be applied after all of the WB3s, etc. In general the logical model is that each rule is applied to the entire input string before moving on to the next rule. In practice, implementations tend to come up with a way of doing this in one or a handful of loops by retaining some careful state. The sequences `WSegSpace Format* WSegSpace` or `ZWJ Extend Ext_Pict` won't have do-not-breaks generated by WB3d/WB3c because those rules apply before the "ignore Extend/Format" Since no rules after WB4 mention Extended_Pictographic or WSegSpace, WB4 does not need to try to include them in the "except" clause. Hope this helps Thanks, -Manish On Wed, Dec 6, 2023, 4:17?PM prospero via Unicode wrote: > > unicode.org/reports/tr29 > > The WB4 rule for word breaks: > > > Ignore Format and Extend characters, except after sot, CR, LF, and > Newline. (See Section 6.2, Replacing Ignore Rules[ > https://unicode.org/reports/tr29/#Grapheme_Cluster_and_Format_Rules].) > > This also has the effect of: Any ? (Format | Extend | ZWJ) > > seems incomplete and ambiguous. First, the "except after" part needs to > apply to WSegSpace also, otherwise tests fail. And the handling of WB3c > seems contradicted by the tests, e.g., the one on line 1158: > > ? 200D ? 0308 ? 231A ? # ? [0.2] ZERO WIDTH JOINER (ZWJ_FE) ? [4.0] > COMBINING DIAERESIS ( Extend_FE) ? [999.0] WATCH (ExtPict) ? [0.3] > > seems to contradict it, since ignoring the 0308 (Extend_FE) should yield a > ZWJ_FE + ExtPict, which should not break, but the test requires a break. If > the tests are dispositive, could TR-29 be better clarified to reflect them? > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From prospero at cyber-wizard.com Fri Dec 8 13:55:56 2023 From: prospero at cyber-wizard.com (prospero) Date: Fri, 8 Dec 2023 20:55:56 +0100 Subject: Question regarding TR-29 In-Reply-To: References: Message-ID: An HTML attachment was scrubbed... URL: From doug at ewellic.org Mon Dec 18 10:31:23 2023 From: doug at ewellic.org (Doug Ewell) Date: Mon, 18 Dec 2023 16:31:23 +0000 Subject: UDHR in Unicode Message-ID: I noticed that the ?UDHR in Unicode? link has been removed from the Technical Site web page. The actual site, , is still present. I?m wondering whether this is part of a simple reorganization, or whether this long-running project is being dismantled ? and if so, why. -- Doug Ewell, CC, ALB | Lakewood, CO, US | ewellic.org From doug at ewellic.org Mon Dec 18 23:31:40 2023 From: doug at ewellic.org (Doug Ewell) Date: Tue, 19 Dec 2023 05:31:40 +0000 Subject: UTC public review issues to close January 2 In-Reply-To: References: Message-ID: Peter Constable wrote: > https://www.unicode.org/review/pri486/https://www.unicode.org/review/pri486/ > ? UAX #42 provides the data for the Unicode Character Database in XML > format. (UCD is character property data for use in processing > algorithms that is provide with each version of Unicode. This PRI is > for feedback on a planned UTC action to freeze UAX #42 as of Unicode > 15.1. This is a shame. I don?t know how widely the XML files were adopted, but I certainly found them easier to process than the traditional Unicode data files. I imagine creating these files was a matter of auto-generation with custom tools, combined with human fine-tuning and judgment (i.e. where to draw the line when grouping characters). It would be great if Eric and/or Lauren?iu could donate any tools, but the human effort is probably what could not be replaced. -- Doug Ewell, CC, ALB | Lakewood, CO, US | ewellic.org From daniel.buenzli at erratique.ch Tue Dec 19 08:37:59 2023 From: daniel.buenzli at erratique.ch (=?utf-8?Q?Daniel_B=C3=BCnzli?=) Date: Tue, 19 Dec 2023 15:37:59 +0100 Subject: UTC public review issues to close January 2 In-Reply-To: References: Message-ID: On 19 December 2023 at 06:34:55, Doug Ewell via Unicode (unicode at corp.unicode.org) wrote: > This is a shame. I don?t know how widely the XML files were adopted, but I certainly found > them easier to process than the traditional Unicode data files. For me this is only half the story. As I wrote in my feedback on the PRI, UAX42 is the only place where you can easily find out the type of a property and where their evolution from version to version is carefuly chronicled. This is a golden ressource if you maintain APIs that expose or make use of these properties. Regarding how much it is used, it?s unclear but if you search for the various compressed and uncompressed file names on code hosting platforms, it?s far from anecdotic. Best, Daniel From pgcon6 at msn.com Tue Dec 19 13:24:34 2023 From: pgcon6 at msn.com (Peter Constable) Date: Tue, 19 Dec 2023 19:24:34 +0000 Subject: UTC public review issues to close January 2 In-Reply-To: References: Message-ID: Human effort ? a committed volunteer ? was, indeed, the missing factor that led to asking whether it was worth continuing to maintain UCDXML. Peter -----Original Message----- From: Doug Ewell Sent: Monday, December 18, 2023 10:32 PM To: Peter Constable ; unicode at unicode.org Subject: RE: UTC public review issues to close January 2 Peter Constable wrote: > https://www.u/ > nicode.org%2Freview%2Fpri486%2Fhttps%3A%2F%2Fwww.unicode.org%2Freview% > 2Fpri486%2F&data=05%7C02%7C%7Cb50c6c78d2774ef006e308dc0053ca3f%7C84df9 > e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638385607077909013%7CUnknown%7CT > WFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI > 6Mn0%3D%7C3000%7C%7C%7C&sdata=zwMU3kvBDqgLESjedkZ3c6akN0L%2FxhndyHurzI > ZBzyI%3D&reserved=0 ? UAX #42 provides the data for the Unicode > Character Database in XML format. (UCD is character property data for > use in processing algorithms that is provide with each version of > Unicode. This PRI is for feedback on a planned UTC action to freeze > UAX #42 as of Unicode 15.1. This is a shame. I don?t know how widely the XML files were adopted, but I certainly found them easier to process than the traditional Unicode data files. I imagine creating these files was a matter of auto-generation with custom tools, combined with human fine-tuning and judgment (i.e. where to draw the line when grouping characters). It would be great if Eric and/or Lauren?iu could donate any tools, but the human effort is probably what could not be replaced. -- Doug Ewell, CC, ALB | Lakewood, CO, US | ewellic.org From pgcon6 at msn.com Tue Dec 19 13:29:12 2023 From: pgcon6 at msn.com (Peter Constable) Date: Tue, 19 Dec 2023 19:29:12 +0000 Subject: UDHR in Unicode In-Reply-To: References: Message-ID: That project is being closed down. I'm not certain of all the exact reasons. Peter -----Original Message----- From: Unicode On Behalf Of Doug Ewell via Unicode Sent: Monday, December 18, 2023 9:31 AM To: unicode at corp.unicode.org Subject: UDHR in Unicode I noticed that the "UDHR in Unicode" link has been removed from the Technical Site web page. The actual site, , is still present. I'm wondering whether this is part of a simple reorganization, or whether this long-running project is being dismantled - and if so, why. -- Doug Ewell, CC, ALB | Lakewood, CO, US | ewellic.org From ashpilkin at gmail.com Tue Dec 19 15:10:01 2023 From: ashpilkin at gmail.com (Alexander Shpilkin) Date: Tue, 19 Dec 2023 23:10:01 +0200 Subject: UDHR in Unicode In-Reply-To: References: Message-ID: So, um, does anybody have an up-to-date copy of the Git repository? Because apparently outright deleting the data when a project is shut down is something the Unicode Consortium considers a good and proper thing to do. ?Alex On Tue, 19 Dec 2023, 21:33 Peter Constable via Unicode, < unicode at corp.unicode.org> wrote: > That project is being closed down. I'm not certain of all the exact > reasons. > > Peter > > -----Original Message----- > From: Unicode On Behalf Of Doug Ewell > via Unicode > Sent: Monday, December 18, 2023 9:31 AM > To: unicode at corp.unicode.org > Subject: UDHR in Unicode > > I noticed that the "UDHR in Unicode" link has been removed from the > Technical Site web page. The actual site, , is > still present. > > I'm wondering whether this is part of a simple reorganization, or whether > this long-running project is being dismantled - and if so, why. > > -- > Doug Ewell, CC, ALB | Lakewood, CO, US | ewellic.org > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Tue Dec 19 15:16:34 2023 From: doug at ewellic.org (Doug Ewell) Date: Tue, 19 Dec 2023 21:16:34 +0000 Subject: UDHR in Unicode In-Reply-To: References: Message-ID: Alexander Shpilkin wrote: > So, um, does anybody have an up-to-date copy of the Git repository? > Because apparently outright deleting the data when a project is shut > down is something the Unicode Consortium considers a good and proper > thing to do. As of a couple of minutes ago, the site at https://unicode.org/udhr/ is still up and the aggregate files and bulk downloads, at least, still appear to be available. -- Doug Ewell, CC, ALB | Lakewood, CO, US | ewellic.org From moyogo at gmail.com Tue Dec 19 16:12:09 2023 From: moyogo at gmail.com (Denis Jacquerye) Date: Tue, 19 Dec 2023 23:12:09 +0100 Subject: UDHR in Unicode In-Reply-To: References: Message-ID: On Tue, 19 Dec 2023 at 22:20, Doug Ewell via Unicode < unicode at corp.unicode.org> wrote: > Alexander Shpilkin wrote: > > > So, um, does anybody have an up-to-date copy of the Git repository? > > Because apparently outright deleting the data when a project is shut > > down is something the Unicode Consortium considers a good and proper > > thing to do. > > As of a couple of minutes ago, the site at https://unicode.org/udhr/ is > still up and the aggregate files and bulk downloads, at least, still appear > to be available. > > There are some forks as well, for example https://github.com/moyogo/udhr or https://github.com/sffc/udhr -------------- next part -------------- An HTML attachment was scrubbed... URL: From ashpilkin at gmail.com Tue Dec 19 16:40:44 2023 From: ashpilkin at gmail.com (Alexander Shpilkin) Date: Wed, 20 Dec 2023 00:40:44 +0200 Subject: UDHR in Unicode In-Reply-To: References: Message-ID: On Wed, 20 Dec 2023, 00:12 Denis Jacquerye, wrote: > On Tue, 19 Dec 2023 at 22:20, Doug Ewell via Unicode < > unicode at corp.unicode.org> wrote: > >> Alexander Shpilkin wrote: >> >> > So, um, does anybody have an up-to-date copy of the Git repository? >> > Because apparently outright deleting the data when a project is shut >> > down is something the Unicode Consortium considers a good and proper >> > thing to do. >> >> As of a couple of minutes ago, the site at https://unicode.org/udhr/ is >> still up and the aggregate files and bulk downloads, at least, still appear >> to be available. >> >> > There are some forks as well, for example https://github.com/moyogo/udhr > or https://github.com/sffc/udhr > Yes, and apparently you updated yours (the first one) to the last released version as I was scouring various archives and search engine caches for commit SHAs; thank you for that! ?Alex -------------- next part -------------- An HTML attachment was scrubbed... URL: From jameskass at code2001.com Mon Dec 25 17:10:24 2023 From: jameskass at code2001.com (James Kass) Date: Mon, 25 Dec 2023 23:10:24 +0000 Subject: UDHR in Unicode In-Reply-To: References: Message-ID: <8c9926c0-265a-4114-b930-de22ed21902b@code2001.com> On 2023-12-19 7:29 PM, Peter Constable via Unicode wrote: > That project is being closed down. I'm not certain of all the exact reasons. > > Peter That's a shame.? Was any effort made to ask the UN if they had any interest in hosting the project? From wjgo_10009 at btinternet.com Tue Dec 26 01:53:23 2023 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Tue, 26 Dec 2023 07:53:23 +0000 (GMT) Subject: The Rescue Project (from Re: UDHR in Unicode) In-Reply-To: <8c9926c0-265a-4114-b930-de22ed21902b@code2001.com> References: <8c9926c0-265a-4114-b930-de22ed21902b@code2001.com> Message-ID: <38563ac1.814.18ca51d3641.Webtop.95@btinternet.com> James Kass wrote: > That's a shame. Yes. I wonder if, even though the stuff appears to be being sent to the scrapyard, can it be rescued and restored by enthusiasts, like in England many steam locomotives were rescued and restored after being sent to what is known informally as Barry Scrapyard? And, just like there are some new build steam locomotive projects in England, can the number of languages for which there is a translation be increased please? William Overington Tuesday 26 December 2023 ------ Original Message ------ From: "James Kass via Unicode" To: unicode at corp.unicode.org Sent: Monday, 2023 Dec 25 At 23:10 Subject: Re: UDHR in Unicode On 2023-12-19 7:29 PM, Peter Constable via Unicode wrote: That project is being closed down. I'm not certain of all the exact reasons. Peter That's a shame. Was any effort made to ask the UN if they had any interest in hosting the project? -------------- next part -------------- An HTML attachment was scrubbed... URL: From wjgo_10009 at btinternet.com Tue Dec 26 02:27:36 2023 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Tue, 26 Dec 2023 08:27:36 +0000 (GMT) Subject: Bing Chat AI Artificial Intelligence and Unicode Message-ID: <7c2d5b54.81e.18ca53c8a51.Webtop.95@btinternet.com> Recently I have been experimenting using Bing Chat AI, just as an end user using Bing Chat AI from within the Edge browser running on my home computer. After some experiments produced amazing results I decided to try requesting content that is in a language other than English, namely Portuguese, and it worked well. Later I tried an experiment that produced results not only in several languages but also in several scripts too. Unicode other than in English and a few words in Welsh is used from the second post on page 4 of the thread. Here is a link. https://punster.me/serif/viewtopic.php?id=516&p=4 William Overington Tuesday 26 December 2023 -------------- next part -------------- An HTML attachment was scrubbed... URL: