From unicode at unicode.org Thu Oct 3 10:12:51 2019 From: unicode at unicode.org (Johannes Bergerhausen via Unicode) Date: Thu, 3 Oct 2019 17:12:51 +0200 Subject: worldswritingsystems.org Message-ID: <8295C9C0-05BB-44F6-88D9-21B1B34AD902@bergerhausen.com> Dear list, FYI: we?ve updated http://worldswritingsystems.org to Unicode 12.1 and fixed a few little bugs and errors. All the best, Johannes ? Helmig Bergerhausen Gladbacher Stra?e 40, D-50672 K?ln, Germany www.helmigbergerhausen.de ? Prof. Bergerhausen Hochschule Mainz, School of Design Holzstra?e 36, D-55116 Mainz, Germany www.worldswritingsystems.org www.decodeunicode.org www.designlabor-gutenberg.de www.hs-mainz.de/gestaltung -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Oct 3 11:53:37 2019 From: unicode at unicode.org (Markus Scherer via Unicode) Date: Thu, 3 Oct 2019 09:53:37 -0700 Subject: Manipuri/Meitei customary writing system Message-ID: Dear Unicoders, Is Manipuri/Meitei customarily written in Bangla/Bengali script or in Meitei script? I am looking at https://en.wikipedia.org/wiki/Meitei_language#Writing_systems which seems to describe writing practice in transition, and I can't quite tell where it stands. Is the use of the Meitei script aspirational or customary? Which script is being used for major newspapers, popular books, and video captions? Thanks, markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Oct 4 01:35:09 2019 From: unicode at unicode.org (=?utf-8?B?TWFydGluIEouIETDvHJzdA==?= via Unicode) Date: Fri, 4 Oct 2019 06:35:09 +0000 Subject: Manipuri/Meitei customary writing system In-Reply-To: References: Message-ID: <86f9a05f-8b00-1992-5afd-d07ec834df47@it.aoyama.ac.jp> Hello Markus, On 2019/10/04 01:53, Markus Scherer via Unicode wrote: > Dear Unicoders, > > Is Manipuri/Meitei customarily written in Bangla/Bengali script or > in Meitei script? > > I am looking at > https://en.wikipedia.org/wiki/Meitei_language#Writing_systems which seems > to describe writing practice in transition, and I can't quite tell where it > stands. > > Is the use of the Meitei script aspirational or customary? > Which script is being used for major newspapers, popular books, and video > captions? This may give you some more information: https://www.atypi.org/conferences/tokyo-2019/programme/activity?a=906https://www.youtube.com/watch?v=S8XxVZkfUkk It's a recent talk at ATypI in Tokyo (sponsored by Google, among others). Regards, Martin. From unicode at unicode.org Fri Oct 4 02:12:59 2019 From: unicode at unicode.org (=?utf-8?B?TWFydGluIEouIETDvHJzdA==?= via Unicode) Date: Fri, 4 Oct 2019 07:12:59 +0000 Subject: Manipuri/Meitei customary writing system In-Reply-To: <86f9a05f-8b00-1992-5afd-d07ec834df47@it.aoyama.ac.jp> References: <86f9a05f-8b00-1992-5afd-d07ec834df47@it.aoyama.ac.jp> Message-ID: <37a2689d-50f0-84f9-40a4-92e97e670325@it.aoyama.ac.jp> On 2019/10/04 15:35, Martin J. D?rst via Unicode wrote: > Hello Markus, > > On 2019/10/04 01:53, Markus Scherer via Unicode wrote: >> Dear Unicoders, >> >> Is Manipuri/Meitei customarily written in Bangla/Bengali script or >> in Meitei script? >> >> I am looking at >> https://en.wikipedia.org/wiki/Meitei_language#Writing_systems which seems >> to describe writing practice in transition, and I can't quite tell where it >> stands. >> >> Is the use of the Meitei script aspirational or customary? >> Which script is being used for major newspapers, popular books, and video >> captions? > > This may give you some more information: > https://www.atypi.org/conferences/tokyo-2019/programme/activity?a=906 Sorry, this should have been two separate URIs (about the same talk). > https://www.youtube.com/watch?v=S8XxVZkfUkk > > It's a recent talk at ATypI in Tokyo (sponsored by Google, among others). > > Regards, Martin. > From unicode at unicode.org Fri Oct 4 16:02:57 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Fri, 4 Oct 2019 22:02:57 +0100 Subject: Manipuri/Meitei customary writing system In-Reply-To: <37a2689d-50f0-84f9-40a4-92e97e670325@it.aoyama.ac.jp> References: <86f9a05f-8b00-1992-5afd-d07ec834df47@it.aoyama.ac.jp> <37a2689d-50f0-84f9-40a4-92e97e670325@it.aoyama.ac.jp> Message-ID: <20191004220257.2ea735df@JRWUBU2> On Fri, 4 Oct 2019 07:12:59 +0000 Martin J. D?rst via Unicode wrote: > On 2019/10/04 15:35, Martin J. D?rst via Unicode wrote: > > Hello Markus, > > > > On 2019/10/04 01:53, Markus Scherer via Unicode wrote: > >> Dear Unicoders, > >> > >> Is Manipuri/Meitei customarily written in Bangla/Bengali script or > >> in Meitei script? > >> > >> I am looking at > >> https://en.wikipedia.org/wiki/Meitei_language#Writing_systems > >> which seems to describe writing practice in transition, and I > >> can't quite tell where it stands. > >> > >> Is the use of the Meitei script aspirational or customary? > >> Which script is being used for major newspapers, popular books, > >> and video captions? > > > > This may give you some more information: > > https://www.atypi.org/conferences/tokyo-2019/programme/activity?a=906 > > Sorry, this should have been two separate URIs (about the same talk). > > > https://www.youtube.com/watch?v=S8XxVZkfUkk > > > > It's a recent talk at ATypI in Tokyo (sponsored by Google, among > > others). So newspaper sales tell us that the Bengali script is still the *usual* script for the language. Is that a different question to what the 'customary' script is? Richard. From unicode at unicode.org Fri Oct 4 18:11:50 2019 From: unicode at unicode.org (Markus Scherer via Unicode) Date: Fri, 4 Oct 2019 16:11:50 -0700 Subject: Manipuri/Meitei customary writing system In-Reply-To: <20191004220257.2ea735df@JRWUBU2> References: <86f9a05f-8b00-1992-5afd-d07ec834df47@it.aoyama.ac.jp> <37a2689d-50f0-84f9-40a4-92e97e670325@it.aoyama.ac.jp> <20191004220257.2ea735df@JRWUBU2> Message-ID: On Fri, Oct 4, 2019 at 2:05 PM Richard Wordingham via Unicode < unicode at unicode.org> wrote: > > >> Is the use of the Meitei script aspirational or customary? > > >> Which script is being used for major newspapers, popular books, > > >> and video captions? > > > > > > This may give you some more information: > > > https://www.atypi.org/conferences/tokyo-2019/programme/activity?a=906 > > > > > Sorry, this should have been two separate URIs (about the same talk). > > > > > https://www.youtube.com/watch?v=S8XxVZkfUkk > > > > > > It's a recent talk at ATypI in Tokyo (sponsored by Google, among > > > others). > > So newspaper sales tell us that the Bengali script is still the *usual* > script for the language. Yes. FYI in the video, the relevant part is at 14:04-14:34. My transcription: "Due to the lack of readership of Meetei Mayek, local newspapers continue to use Bengali script. On 21st September 2008, Hueiyen Lanpao, a newspaper company, published the first Meetei Mayek newspaper set entirely using Meetei Mayek script. Although there have been small columns for Meetei Mayek in other newspapers, Hueiyen Lanpao is still the only local newspaper in all of Manipur to be printed using Meetei Mayek script till date." Earlier the presenter says that Bengali is starting to disappear from public signage. Is that a different question to what the 'customary' script is? > To me, things like newspapers are among the most indicative of customary use. >From what I understand, someone who wants to support this language should prepare to support both Beng and Mtei, with emphasis on Beng now and Mtei later. markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sun Oct 6 13:57:36 2019 From: unicode at unicode.org (=?utf-8?B?5qKB5rW3IExpYW5nIEhhaQ==?= via Unicode) Date: Sun, 6 Oct 2019 13:57:36 -0500 Subject: =?utf-8?Q?Alternative_encodings_for_Malayalam_=E2=80=9Cnta?= =?utf-8?Q?=E2=80=9D?= Message-ID: <49FE1395-9719-45A4-B3F6-0DACF22D313D@gmail.com> Folks, (Microsoft Peter and Andrew, search for ?Windows? in the document.) (Asmus, in the document there?s a section 5, ICANN RZ-LGR situation?let me know if there?s some news.) This is a pretty straightforward document about the notoriously problematic encoding of Malayalam . I always wanted to properly document this, so finally here it is: L2/19-345 Alternative encodings for Malayalam "nta" Liang Hai 2019-10-06 Unfortunately, as has already become the de facto standard encoding, now we have to recognize it in the Core Spec. It?s a bit like another Tamil sr? situation. An excerpt of the proposal: Document the following widely used encoding in the Core Specification as an alternative representation for Malayalam [glyph] () that is a special case and does not suggest any productive rule in the encoding model: Best, ?? Liang Hai https://lianghai.github.io -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sun Oct 6 16:08:13 2019 From: unicode at unicode.org (Cibu via Unicode) Date: Sun, 6 Oct 2019 22:08:13 +0100 Subject: =?UTF-8?Q?Re=3A_Alternative_encodings_for_Malayalam_=E2=80=9Cnta=E2=80=9D?= In-Reply-To: <49FE1395-9719-45A4-B3F6-0DACF22D313D@gmail.com> References: <49FE1395-9719-45A4-B3F6-0DACF22D313D@gmail.com> Message-ID: Thanks for addressing this. Here is my response: https://docs.google.com/document/d/1K6L82VRmCGc9Fb4AOitNk4MT7Nu4V8aKUJo_1mW5X1o/ In summary, my take is: The sequence for ??? (<>) should not be legitimized as an alternate encoding; but should be recognized as a prevailing non-standard legacy encoding. On Sun, Oct 6, 2019 at 7:57 PM ?? Liang Hai wrote: > Folks, > > (Microsoft Peter and Andrew, search for ?Windows? in the document.) > > (Asmus, in the document there?s a section 5, *ICANN RZ-LGR situation*?let > me know if there?s some news.) > > This is a pretty straightforward document about the notoriously > problematic encoding of Malayalam <*chillu n*, bottom-side sign of *rra*>. > I always wanted to properly document this, so finally here it is: > > L2/19-345 > *Alternative encodings for Malayalam "nta"* > Liang Hai > 2019-10-06 > > > Unfortunately, as has already become the de facto > standard encoding, now we have to recognize it in the Core Spec. It?s a bit > like another Tamil *sr?* situation. > > An excerpt of the proposal: > > Document the following widely used encoding in the Core Specification as > an alternative representation for Malayalam [glyph] ( sign of rra>) that is a special case and does not suggest any productive > rule in the encoding model: > > VIRAMA, U+0D31 ? MALAYALAM LETTER RRA> > > > Best, > ?? Liang Hai > https://lianghai.github.io > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sun Oct 6 17:03:01 2019 From: unicode at unicode.org (Asmus Freytag (c) via Unicode) Date: Sun, 6 Oct 2019 15:03:01 -0700 Subject: =?UTF-8?Q?Re=3a_Alternative_encodings_for_Malayalam_=e2=80=9cnta?= =?UTF-8?B?4oCd?= In-Reply-To: References: <49FE1395-9719-45A4-B3F6-0DACF22D313D@gmail.com> Message-ID: Have you submitted that response as a UTC document? A./ On 10/6/2019 2:08 PM, Cibu wrote: > Thanks for addressing this. Here is my response: > https://docs.google.com/document/d/1K6L82VRmCGc9Fb4AOitNk4MT7Nu4V8aKUJo_1mW5X1o/ > > In summary, my take is: > > The sequence for ??? (<>) > should not be legitimized as an alternate encoding; but should be > recognized as a prevailing non-standard legacy encoding. > > > On Sun, Oct 6, 2019 at 7:57 PM ?? Liang Hai > wrote: > > Folks, > > (Microsoft Peter and Andrew, search for ?Windows? in the document.) > > (Asmus, in the document there?s a section 5, /ICANN RZ-LGR > situation/?let me know if there?s some news.) > > This is a pretty straightforward document about the notoriously > problematic encoding of Malayalam /rra/>. I always wanted to properly document this, so finally here > it is: > > L2/19-345 > > *Alternative encodings?for Malayalam "nta"* > Liang?Hai > 2019-10-06 > > > Unfortunately, as has already become the de > facto standard encoding, now we have to recognize it in the Core > Spec. It?s a bit like another Tamil /sr?/ situation. > > An excerpt of the proposal: > > Document the following widely used encoding in > the?Core?Specification?as an alternative?representation for > Malayalam [glyph]?()?that > is a special?case and?does not suggest any productive rule in > the encoding model: > > VIRAMA,?U+0D31???MALAYALAM?LETTER RRA> > > > Best, > ?? Liang Hai > https://lianghai.github.io > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sun Oct 6 17:06:11 2019 From: unicode at unicode.org (Asmus Freytag (c) via Unicode) Date: Sun, 6 Oct 2019 15:06:11 -0700 Subject: =?UTF-8?Q?Re=3a_Alternative_encodings_for_Malayalam_=e2=80=9cnta?= =?UTF-8?B?4oCd?= In-Reply-To: <49FE1395-9719-45A4-B3F6-0DACF22D313D@gmail.com> References: <49FE1395-9719-45A4-B3F6-0DACF22D313D@gmail.com> Message-ID: On 10/6/2019 11:57 AM, ?? Liang Hai wrote: > Folks, > > (Microsoft Peter and Andrew, search for ?Windows? in the document.) > > (Asmus, in the document there?s a section 5, /ICANN RZ-LGR > situation/?let me know if there?s some news.) The issue, as it affects domain names, has been brought to the authors of the Malayalam Root Zone LGR proposal, the Neo-Brahmi Generation Panel; however, there is no new status to report at this time. I would appreciate if you could keep me updated on any details of the UTC decision (particularly those that do not make the rather terse UTC minutes). A./ > > This is a pretty straightforward document about the notoriously > problematic encoding of Malayalam /rra/>. I always wanted to properly document this, so finally here it is: > > L2/19-345 > > *Alternative encodings?for Malayalam "nta"* > Liang?Hai > 2019-10-06 > > > Unfortunately, as has already become the de facto > standard encoding, now we have to recognize it in the Core Spec. It?s > a bit like another Tamil /sr?/ situation. > > An excerpt of the proposal: > > Document the following widely used encoding in > the?Core?Specification?as an alternative?representation for > Malayalam [glyph]?()?that is a > special?case and?does not suggest any productive rule in the > encoding model: > > VIRAMA,?U+0D31???MALAYALAM?LETTER RRA> > > > Best, > ?? Liang Hai > https://lianghai.github.io > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sun Oct 6 17:10:30 2019 From: unicode at unicode.org (Cibu via Unicode) Date: Sun, 6 Oct 2019 23:10:30 +0100 Subject: =?UTF-8?Q?Re=3A_Alternative_encodings_for_Malayalam_=E2=80=9Cnta=E2=80=9D?= In-Reply-To: References: <49FE1395-9719-45A4-B3F6-0DACF22D313D@gmail.com> Message-ID: Yes; it is now available as L2/19-348 . On Sun, Oct 6, 2019 at 11:03 PM Asmus Freytag (c) wrote: > Have you submitted that response as a UTC document? > A./ > > On 10/6/2019 2:08 PM, Cibu wrote: > > Thanks for addressing this. Here is my response: > https://docs.google.com/document/d/1K6L82VRmCGc9Fb4AOitNk4MT7Nu4V8aKUJo_1mW5X1o/ > > In summary, my take is: > > The sequence for ??? (<>) > should not be legitimized as an alternate encoding; but should be > recognized as a prevailing non-standard legacy encoding. > > > On Sun, Oct 6, 2019 at 7:57 PM ?? Liang Hai wrote: > >> Folks, >> >> (Microsoft Peter and Andrew, search for ?Windows? in the document.) >> >> (Asmus, in the document there?s a section 5, *ICANN RZ-LGR situation*?let >> me know if there?s some news.) >> >> This is a pretty straightforward document about the notoriously >> problematic encoding of Malayalam <*chillu n*, bottom-side sign of *rra*>. >> I always wanted to properly document this, so finally here it is: >> >> L2/19-345 >> *Alternative encodings for Malayalam "nta"* >> Liang Hai >> 2019-10-06 >> >> >> Unfortunately, as has already become the de facto >> standard encoding, now we have to recognize it in the Core Spec. It?s a bit >> like another Tamil *sr?* situation. >> >> An excerpt of the proposal: >> >> Document the following widely used encoding in the Core Specification as >> an alternative representation for Malayalam [glyph] (> sign of rra>) that is a special case and does not suggest any productive >> rule in the encoding model: >> >> > VIRAMA, U+0D31 ? MALAYALAM LETTER RRA> >> >> >> Best, >> ?? Liang Hai >> https://lianghai.github.io >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sun Oct 6 18:05:16 2019 From: unicode at unicode.org (Tex via Unicode) Date: Sun, 6 Oct 2019 16:05:16 -0700 Subject: comma ellipses Message-ID: <001501d57c9a$8c41ada0$a4c508e0$@xencraft.com> Now that comma ellipses (,,,) are a thing (at least on social media) do we need a character proposal? Asking for a friend,,, J tex -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sun Oct 6 19:01:04 2019 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Sun, 6 Oct 2019 17:01:04 -0700 Subject: comma ellipses In-Reply-To: <001501d57c9a$8c41ada0$a4c508e0$@xencraft.com> References: <001501d57c9a$8c41ada0$a4c508e0$@xencraft.com> Message-ID: An HTML attachment was scrubbed... URL: From unicode at unicode.org Sun Oct 6 22:21:06 2019 From: unicode at unicode.org (Garth Wallace via Unicode) Date: Sun, 6 Oct 2019 20:21:06 -0700 Subject: comma ellipses In-Reply-To: References: <001501d57c9a$8c41ada0$a4c508e0$@xencraft.com> Message-ID: It?s deliberately incorrect for humorous effect. It gets used, but making it ?official? would almost defeat the purpose. On Sun, Oct 6, 2019 at 5:02 PM Asmus Freytag via Unicode < unicode at unicode.org> wrote: > On 10/6/2019 4:05 PM, Tex via Unicode wrote: > > Now that comma ellipses (,,,) are a thing (at least on social media) do we > need a character proposal? > > > > Asking for a friend,,, J > > > > tex > > I thought the main reason we ended up with the period (dot) one is because > it was originally needed for CJK-style fixed grid layout purposes. But It > could be wrong. > > What's the current status for 3-dot ellipsis. Does it get used? Do we have > autocorrect for it? If so, that would argue that implementers have settled > and any derivative usage (comma) should be kept compatible. > > > A./ > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon Oct 7 00:20:48 2019 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Sun, 6 Oct 2019 22:20:48 -0700 Subject: comma ellipses In-Reply-To: References: <001501d57c9a$8c41ada0$a4c508e0$@xencraft.com> Message-ID: An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon Oct 7 00:39:07 2019 From: unicode at unicode.org (Tex via Unicode) Date: Sun, 6 Oct 2019 22:39:07 -0700 Subject: comma ellipses In-Reply-To: References: <001501d57c9a$8c41ada0$a4c508e0$@xencraft.com> Message-ID: <000601d57cd1$89d897e0$9d89c7a0$@xencraft.com> Just for additional info on the subject: https://www.theguardian.com/science/2019/oct/05/linguist-gretchen-mcculloch-interview-because-internet-book ??I?ve been spending a fair bit of time recently with the comma ellipsis, which is three commas (,,,) instead of dot-dot-dot. I?ve been looking at it for over a year and I?m still figuring out what?s going on there. There seems to be something but possibly several somethings. One use is by older people who, in some cases where they would use the classic ellipsis, use commas instead. It?s not quite clear if that?s a typo in some cases, but it seems to be more systematic than that. Maybe they?re preferring the comma because it?s a little bit easier to see if you?re on the older side, and your vision is not what it once was. Or maybe they just see the two as equivalent. It then seems to have jumped the shark into parody form. There?s a Facebook group in which younger people pretend to be to be baby boomers, and one of the features people use there is this comma ellipsis. And then in some circles there also seems to be a use of comma ellipses that is very, very heavily ironic. But what exactly the nature is of that heavy irony is still something that I?m working on figuring out?.? From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Asmus Freytag via Unicode Sent: Sunday, October 6, 2019 10:21 PM To: unicode at unicode.org Subject: Re: comma ellipses On 10/6/2019 8:21 PM, Garth Wallace via Unicode wrote: It?s deliberately incorrect for humorous effect. It gets used, but making it ?official? would almost defeat the purpose. Well then it should encode a "typographically incorrect" comma ellipsis :) A./ On Sun, Oct 6, 2019 at 5:02 PM Asmus Freytag via Unicode wrote: On 10/6/2019 4:05 PM, Tex via Unicode wrote: Now that comma ellipses (,,,) are a thing (at least on social media) do we need a character proposal? Asking for a friend,,, J tex I thought the main reason we ended up with the period (dot) one is because it was originally needed for CJK-style fixed grid layout purposes. But It could be wrong. What's the current status for 3-dot ellipsis. Does it get used? Do we have autocorrect for it? If so, that would argue that implementers have settled and any derivative usage (comma) should be kept compatible. A./ -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon Oct 7 00:59:17 2019 From: unicode at unicode.org (David Starner via Unicode) Date: Sun, 6 Oct 2019 22:59:17 -0700 Subject: comma ellipses In-Reply-To: <000601d57cd1$89d897e0$9d89c7a0$@xencraft.com> References: <001501d57c9a$8c41ada0$a4c508e0$@xencraft.com> <000601d57cd1$89d897e0$9d89c7a0$@xencraft.com> Message-ID: I still see the encoding of the original ellipsis as a mistake, probably for compatibility with some older standard that included it because the system wasn't smart enough to intelligently handle "..." as ellipsis. -- Kie ekzistas vivo, ekzistas espero. From unicode at unicode.org Mon Oct 7 01:09:50 2019 From: unicode at unicode.org (Asmus Freytag (c) via Unicode) Date: Sun, 6 Oct 2019 23:09:50 -0700 Subject: comma ellipses In-Reply-To: <000601d57cd1$89d897e0$9d89c7a0$@xencraft.com> References: <001501d57c9a$8c41ada0$a4c508e0$@xencraft.com> <000601d57cd1$89d897e0$9d89c7a0$@xencraft.com> Message-ID: <543956a9-6d30-617d-c527-0d7ffd2aa7f2@ix.netcom.com> Now you are introducing research - that kills all the fun . . . (oops , , , ) A./ On 10/6/2019 10:39 PM, Tex wrote: > > Just for additional info on the subject: > > https://www.theguardian.com/science/2019/oct/05/linguist-gretchen-mcculloch-interview-because-internet-book > > ??I?ve been spending a fair bit of time recently with the comma > ellipsis, which is three commas (,,,) instead of dot-dot-dot. I?ve > been looking at it for over a year and I?m still figuring out what?s > going on there. There seems to be something but possibly several > somethings. > > One use is by older people who, in some cases where they would use the > classic ellipsis, use commas instead. It?s not quite clear if that?s a > typo in some cases, but it seems to be more systematic than that. > Maybe they?re preferring the comma because it?s a little bit easier to > see if you?re on the older side, and your vision is not what it once > was. Or maybe they just see the two as equivalent. It then seems to > have jumped the shark into parody form. There?s a Facebook group in > which younger people pretend to be to be baby boomers, and one of the > features people use there is this comma ellipsis. And then in some > circles there also seems to be a use of comma ellipses that is very, > very heavily ironic. But what exactly the nature is of that heavy > irony is still something that I?m working on figuring out?.? > > *From:*Unicode [mailto:unicode-bounces at unicode.org] *On Behalf Of > *Asmus Freytag via Unicode > *Sent:* Sunday, October 6, 2019 10:21 PM > *To:* unicode at unicode.org > *Subject:* Re: comma ellipses > > On 10/6/2019 8:21 PM, Garth Wallace via Unicode wrote: > > It?s deliberately incorrect for humorous effect. It gets used, but > making it ?official? would almost defeat the purpose. > > Well then it should encode a "typographically incorrect" comma ellipsis :) > > A./ > > On Sun, Oct 6, 2019 at 5:02 PM Asmus Freytag via Unicode > > wrote: > > On 10/6/2019 4:05 PM, Tex via Unicode wrote: > > Now that comma ellipses (,,,) are a thing (at least on > social media) do we need a character proposal? > > Asking for a friend,,, J > > tex > > I thought the main reason we ended up with the period (dot) > one is because it was originally needed for CJK-style fixed > grid layout purposes. But It could be wrong. > > What's the current status for 3-dot ellipsis. Does it get > used? Do we have autocorrect for it? If so, that would argue > that implementers have settled and any derivative usage > (comma) should be kept compatible. > > A./ > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon Oct 7 01:30:16 2019 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Sun, 6 Oct 2019 23:30:16 -0700 Subject: comma ellipses In-Reply-To: References: <001501d57c9a$8c41ada0$a4c508e0$@xencraft.com> <000601d57cd1$89d897e0$9d89c7a0$@xencraft.com> Message-ID: <99abf2a4-6195-4ef7-59c5-d3cbf52ed7cd@ix.netcom.com> An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon Oct 7 02:49:05 2019 From: unicode at unicode.org (=?UTF-8?B?V8OhbmcgWWlmw6Fu?= via Unicode) Date: Mon, 7 Oct 2019 16:49:05 +0900 Subject: worldswritingsystems.org In-Reply-To: <8295C9C0-05BB-44F6-88D9-21B1B34AD902@bergerhausen.com> References: <8295C9C0-05BB-44F6-88D9-21B1B34AD902@bergerhausen.com> Message-ID: A very comprehensive website, but a couple of details have come to my attention. - Bronze script: despite its traditional name, it's hard to say that it is a single consistent system (just jump to WP via your link). They are various "scripts" at best, only grouped by writing medium which was the only thing sure at the early stage of study. - Seal script: while I'm not sure strictly what the dates stand for, 121 CE is when the oldest extant dictionary of it was compiled (because its usage had declined) and not likely when its usage started. The dated use goes back to around 200 BCE, and some unearthed materials apparently predates it by some centuries. The end date is much more puzzling. AFAIK there's no essential gap between 20c and today; you can either say it was officially obsolete before 121, or still has ritual use to this day. 2019?10?4?(?) 0:15 Johannes Bergerhausen via Unicode : > > Dear list, > > FYI: we?ve updated http://worldswritingsystems.org to Unicode 12.1 and fixed a few little bugs and errors. > > All the best, > Johannes > > > > > ? Helmig Bergerhausen > > Gladbacher Stra?e 40, D-50672 K?ln, Germany > > www.helmigbergerhausen.de > > ? Prof. Bergerhausen > > Hochschule Mainz, School of Design > > Holzstra?e 36, D-55116 Mainz, Germany > > www.worldswritingsystems.org > www.decodeunicode.org > www.designlabor-gutenberg.de > www.hs-mainz.de/gestaltung > From unicode at unicode.org Mon Oct 7 03:42:56 2019 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Mon, 7 Oct 2019 10:42:56 +0200 Subject: comma ellipses In-Reply-To: <001501d57c9a$8c41ada0$a4c508e0$@xencraft.com> References: <001501d57c9a$8c41ada0$a4c508e0$@xencraft.com> Message-ID: Commas may be used instead of dots by users of French keyboards (it's easier to type the comma, when the dot/full stop requires pressing the SHIFT key). I may be wrong, but I've quire frequently seen commas or semicolons instead of dot/full stops under normal orthography. But the web and notably social networks can invent their own "rule": pretending that the dot/full stop at end of sentence is "aggressive" is probably a deviation from the English-only designation of the dot as a "full stop", reinterpreted as "stop talking about this, my sentence is final, I don't want to give more justification" (when for such case the user should have better used the exclamation mark!) Anyway I've never liked the 3-dot ellipsis which just occurs in Unicode for compatiblity with fixed-width fonts on terminals, just to compact 3 cells into one (or in CJK styles to replace the "bubble" dots with their 1/2 cell gap on the right side of each cell, contracting them to three smaller dots in just one CJK cell). But another reason could be that using commas instead of dots allows distinguishing the ellipsis from an abbreviation dot used jut before it. Or making the distinction to explicitly mark the end of sentence by a regular dot/full stop after the ellipsis, when the ellipsis could be used in the middle of a sentence (no clear distinction when what follows the ellipsis is a proper name starting by a capital or not a word: where is the end of sentence?) and for which the alternative using comma ellipsis would explicitly say that the ellipsis does not terminate the sentence as in "I need to spend $2... $4 to return" (one sentences, the meaning is different from "I need to spend $2,,, $4 to return" where that comma ellipsis would be an abbreviation for "between $2 and $4"). Anyway, people of the right to use commas if they prefer it for the semantics they intend to distinguish. This does not mean that we need to encode this sequence as a separate unbreakable character like it was done for the dot ellipsis. Otherwise, we would have to encode "etc." also as a single character, or we would end up adding also many more leader dots (in classic metal types regular dots/fullstops were used, but some type compositors may have liked to use mount a single "..." character to avoid having to keep them glued or keep them regularly spaced with special spacers when justifying lines mechanically: this saved them a little time for compositing rows of metal types). There's no real need for CJK or for monospaced terminals to get a more compact presentation. And for regular text, just using multiple separate commas will still render as intended. And metal types are no longer used. Personnally I don't like the 3-dot ellipsis character because it plays badly even in monospaced fonts. And there's no demonstrated use where a single 3-commas ellipsis character would have to be distinguished semantically and visually from 3 separate commas. If people want to use ",,," for their informal speech on social networks, or in chat sessions, they can do that today without needing any new character and a new keyboard layout or input method. And nobody will really know if this ",,," was mistyped instead of "..." to avoid pressing SHIFT on a French AZERTY keyboard (not extended by a numeric keypad where the dot/full stop may also be typed easily without SHIFT). As well a French typist could have used ";;;" with semicolons when forgetting to press the SHIFT key. If we encode ",,," as a single character, then why not "???" or "!!!", or "----", or "**", or and many other variants mixing multiple punctuation signs or symbols (like "$$" as an "angry" mark or the abbreviation for "costly", then also "??" or "??"...) Then also why not "eeeeeee" or "hmmmmmmmm" for noting hesitations? This would become endless, without any limit: Unicode would ten start encoding millions of whole words of thousands languages as single characters, much more than the whole existing set of CJK ideographs (including its extensions in nearly two planes). Interoperability would worsen. Le lun. 7 oct. 2019 ? 01:14, Tex via Unicode a ?crit : > Now that comma ellipses (,,,) are a thing (at least on social media) do we > need a character proposal? > > > > Asking for a friend,,, J > > > > tex > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon Oct 7 15:05:08 2019 From: unicode at unicode.org (=?utf-8?B?5qKB5rW3IExpYW5nIEhhaQ==?= via Unicode) Date: Mon, 7 Oct 2019 13:05:08 -0700 Subject: =?utf-8?Q?Re=3A_Alternative_encodings_for_Malayalam_=E2=80=9Cnta?= =?utf-8?Q?=E2=80=9D?= In-Reply-To: References: <49FE1395-9719-45A4-B3F6-0DACF22D313D@gmail.com> Message-ID: <5E95EE1F-F41E-4407-AE44-BFFD1146DEAD@gmail.com> [Putting the public mailing list back to the recipient list.] Cibu, Thanks for your L2/19-348 (Response to L2/19-345). My comments: > I am curious to know the reference for the phonetic analysis described in section ?A chillu-less analysis in the proposal L2/19-345. How can a phonetic analysis be the basis for an important double encoding decision? The basis is not the phonetic analysis (the phonetic analysis is only provided in the document as an fyi, so readers understand why many people use it), but the fact of a widespread alternative encoding. Basically we need to properly recognize the failure of ensuring a single, ideal encoding. It?s not helpful to keep the Core Spec detached from the reality. > Anycase, the sequence implied by this particular analysis is an artifact of the evolution of Unicode for Malayalam; it is not grounded in any prior writing traditions or academic literature. We?re not talking about legitimacy of the phonetic encoding. > In Malayalam, dental /n?/ and alveolar /n/ are not allophones as implied in the proposal. I actually didn?t suggest any allophone relationship, on purpose. If it?s helpful, I can change the ?~? notation in ?[n?a ~ na]? (and [ra ~ ta]) to ?/? or ?,? in a revision. > So using for CHILLU N is not phonetically accurate. This is not a valid argument (see the next paragraph), although accuracy is not relevant anyway (as I said, I was trying to explain why people use , not trying to legitimize it.). The written form ? is the syllable-coda specific form of the written form ?, and the pronunciation of ? being limited to [n] is a result of Malayalam?s phonology ([n?] not usually appearing in a syllable-coda position, unless preceding another dental sound). The reason for ?? being used in the phonetic encoding is mostly because ? is not considered to be eligible for conjunct forming, and ?? is the natural fallback. Again, I?m not trying to legitimize the encoding, but only explaining my observation of the widespread encoding. > Moreover, if you show the visual ???? (<>) to a native user (who is unaware of Unicode particulars), they will not identify it as (<> /nt?/); instead, they would read it as /n?r?/. Not relevant. I avoided ?????? particularly for this kind of argument. The?? was only there to mark an inherent vowel suppressed ?. I almost avoided?? altogether because of its ambiguity, but didn?t do it, because that would make the document too obscure. The point of an an inherent vowel suppressed ? is used in the phonetic encoding, and?? just happens to be used there. > This proposal does not address the remaining chillu conjuncts described in ?L2/19-086R?. The document doesn?t propose any productive encoding rule. Why does it need to address other cases? > It also does not address the legacy sequence supported by MS Windows for (<>). I can make it clearer that is just plainly unacceptable as it clashes with our general rule of chillu not forming a conjunct with its following letter automatically (without a conjoiner), in Section 4, Real-world encodings. > I am not sure how this proposal is going to solve the issue of inadequate support for , without explicitly rescinding this sequence. Double encoding for (<>) is not going to solve any issue, if not, making the issue more acute. Double encoding is never a desirable quality for Unicode. So the decision should not be taken lightly or hastly. It needs to be clearly thought through, probably through a PRI. Double encoding will not be solved. The proposal is about recognizing the reality of failure. With Windows on the loose for so many years, we?ve already missed the opportunity of ensuring a single encoding for the written form. Now the standard needs to first recognize the widespread encoding that won?t go away, so implementers are informed. Then we see which direction we should push Microsoft and Apple to converge. I agree that the Unicode Standard might need to have a clear disposition/preference between the graphic and phonetic encodings, so the two are not considered to be just equal, so we can have a direction for pushing the implementations to converge. > Prior to Unicode 5.2, the encoding of the cluster [glyph] (<> /nt?/) was not clearly defined. ? You mean 5.1, right? The encoding has been specified since 5.1. > ? and ? How can implementations support this encoding without breaking the side-by-side form ?? though? Best, ?? Liang Hai https://lianghai.github.io >> On Oct 6, 2019, at 15:10, Cibu > wrote: >> >> Yes; it is now available as L2/19-348 . >> >> On Sun, Oct 6, 2019 at 11:03 PM Asmus Freytag (c) > wrote: >> Have you submitted that response as a UTC document? >> A./ >> >> On 10/6/2019 2:08 PM, Cibu wrote: >>> Thanks for addressing this. Here is my response: https://docs.google.com/document/d/1K6L82VRmCGc9Fb4AOitNk4MT7Nu4V8aKUJo_1mW5X1o/ >>> >>> In summary, my take is: >>> >>> The sequence for ??? (<>) should not be legitimized as an alternate encoding; but should be recognized as a prevailing non-standard legacy encoding. >>> >>> >>> On Sun, Oct 6, 2019 at 7:57 PM ?? Liang Hai > wrote: >>> Folks, >>> >>> (Microsoft Peter and Andrew, search for ?Windows? in the document.) >>> >>> (Asmus, in the document there?s a section 5, ICANN RZ-LGR situation?let me know if there?s some news.) >>> >>> This is a pretty straightforward document about the notoriously problematic encoding of Malayalam . I always wanted to properly document this, so finally here it is: >>> >>> L2/19-345 >>> Alternative encodings for Malayalam "nta" >>> Liang Hai >>> 2019-10-06 >>> >>> Unfortunately, as has already become the de facto standard encoding, now we have to recognize it in the Core Spec. It?s a bit like another Tamil sr? situation. >>> >>> An excerpt of the proposal: >>> >>> Document the following widely used encoding in the Core Specification as an alternative representation for Malayalam [glyph] () that is a special case and does not suggest any productive rule in the encoding model: >>> >>> >>> >>> Best, >>> ?? Liang Hai >>> https://lianghai.github.io >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue Oct 8 09:25:34 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Tue, 8 Oct 2019 15:25:34 +0100 Subject: Pure Regular Expression Engines and Literal Clusters Message-ID: <20191008152534.2068db6c@JRWUBU2> I've been puzzling over how a pure regular expression engine that works via a non-deterministic finite automaton can be bent to accommodate 'literal clusters' as in Requirement RL2.2 'Extended Grapheme Clusters' of UTS#18 'Unicode Regular Expressions' - "To meet this requirement, an implementation shall provide a mechanism for matching against an arbitrary extended grapheme cluster, a literal cluster, and matching extended grapheme cluster boundaries." It works from a regular expression by stitching together the FSMs corresponding to its elements. An example UTS#18 gives for matching a literal cluster can be simplified to, in its notation: [c \q{ch}] This is interpreted as 'match against "ch" if possible, otherwise against "c". Thus the strings "ca" and "cha" would both match the expression [c \q{ch}]a while "chh" but not "ch" would match against [c \q{ch}]h Or have I got this wrong? Thus, while "[c \q{ch}]" may be a regex, it is clearly not any notation for a regular expression in the mathematical sense. It seems to me that this expression requires backtracking, which is totally alien to the design of the regular expression engine. One problem then is that the engine supports both the union and intersection of regular languages. While algebraic manipulation might raise union to the highest level, eliminating intersection is an expensive operation which I have deliberately avoided. While backtracking is feasible if state progression has been restricted to the FSM for a literal cluster, it is far more difficult if multiple FSMs have been running in parallel. As the engine fully respects canonical equivalence (with the result that it can find an accented letter of the Vietnamese alphabet even if it bears a subscript tone mark), concatenated subexpressions can divide the input streams between them. Consequently, the backtracking mechanism gets complicated. May I correctly argue instead that matching against literal clusters would be satisfied by instead supporting, for this example, the regular subexpression "(c|ch)" or the UnicodeSet expression "[c{ch}]"? Richard. From unicode at unicode.org Wed Oct 9 02:04:35 2019 From: unicode at unicode.org (Cibu via Unicode) Date: Wed, 9 Oct 2019 08:04:35 +0100 Subject: =?UTF-8?Q?Re=3A_Alternative_encodings_for_Malayalam_=E2=80=9Cnta=E2=80=9D?= In-Reply-To: <5E95EE1F-F41E-4407-AE44-BFFD1146DEAD@gmail.com> References: <49FE1395-9719-45A4-B3F6-0DACF22D313D@gmail.com> <5E95EE1F-F41E-4407-AE44-BFFD1146DEAD@gmail.com> Message-ID: On Mon, Oct 7, 2019 at 9:05 PM ?? Liang Hai wrote: > > Prior to Unicode 5.2, the encoding of the cluster [glyph] (< N, subscript RRA>> /nt?/) was not clearly defined. ? > > > You mean 5.1, right? The encoding has been specified since 5.1. > I couldn't get the text for 5.1 from https://www.unicode.org/versions/Unicode5.1.0. So I had to specify 5.2 for which the text is clear in https://www.unicode.org/versions/Unicode5.2.0/ch09.pdf > > ? and ? > > > How can implementations support this encoding without breaking the > side-by-side form ?? though? > Here is the difference between our approaches. You probably are trying to say that is a valid sequence and hence the requirement of being non-conflicting with the rest. I am not recommending that. I just wanted to document the fact there is significant usage of for stacked ??? and , to a lesser degree. Fonts may or may not resolve the conflict of sequence. However, higher level systems may be able to resolve it by additional context information. We should also continue to specify that is the standard sequence to help the input methods and other normalisation logic. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Oct 9 12:00:48 2019 From: unicode at unicode.org (=?utf-8?B?5qKB5rW3IExpYW5nIEhhaQ==?= via Unicode) Date: Wed, 9 Oct 2019 10:00:48 -0700 Subject: =?utf-8?Q?Re=3A_Alternative_encodings_for_Malayalam_=E2=80=9Cnta?= =?utf-8?Q?=E2=80=9D?= In-Reply-To: References: <49FE1395-9719-45A4-B3F6-0DACF22D313D@gmail.com> <5E95EE1F-F41E-4407-AE44-BFFD1146DEAD@gmail.com> Message-ID: <3E7E6D66-D868-44D6-89C9-432E1AA035E3@gmail.com> > On Oct 9, 2019, at 00:04, Cibu wrote: > > On Mon, Oct 7, 2019 at 9:05 PM ?? Liang Hai > wrote: > >> Prior to Unicode 5.2, the encoding of the cluster [glyph] (<> /nt?/) was not clearly defined. ? > > You mean 5.1, right? The encoding has been specified since 5.1. > > I couldn't get the text for 5.1 from https://www.unicode.org/versions/Unicode5.1.0 . So I had to specify 5.2 for which the text is clear in https://www.unicode.org/versions/Unicode5.2.0/ch09.pdf Oh the Core Spec?s 5.0 -> 5.1 delta is presented on the webpage itself, but not incorporated into the PDF: https://unicode.org/versions/Unicode5.1.0/#Malayalam_Chillu_Characters >> ? and ? > > How can implementations support this encoding without breaking the side-by-side form ?? though? > > Here is the difference between our approaches. You probably are trying to say that is a valid sequence and hence the requirement of being non-conflicting with the rest. I am not recommending that. I just wanted to document the fact there is significant usage of for stacked ??? and , to a lesser degree. Fonts may or may not resolve the conflict of sequence. However, higher level systems may be able to resolve it by additional context information. We should also continue to specify that is the standard sequence to help the input methods and other normalisation logic. Right, I see. This aligns with the comments I received at the plenary discussion too. Gonna include both unideal encodings in a piece of proposed Core Spec edit, in a revised document. Best, ?? Liang Hai https://lianghai.github.io -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Oct 10 11:37:12 2019 From: unicode at unicode.org (Cibu via Unicode) Date: Thu, 10 Oct 2019 17:37:12 +0100 Subject: =?UTF-8?Q?Re=3A_Alternative_encodings_for_Malayalam_=E2=80=9Cnta=E2=80=9D?= In-Reply-To: <3E7E6D66-D868-44D6-89C9-432E1AA035E3@gmail.com> References: <49FE1395-9719-45A4-B3F6-0DACF22D313D@gmail.com> <5E95EE1F-F41E-4407-AE44-BFFD1146DEAD@gmail.com> <3E7E6D66-D868-44D6-89C9-432E1AA035E3@gmail.com> Message-ID: > > Oh the Core Spec?s 5.0 -> 5.1 delta is presented on the webpage itself, > but not incorporated into the PDF: > > https://unicode.org/versions/Unicode5.1.0/#Malayalam_Chillu_Characters > > Thanks for pointing this out. ?? I had missed it. > Here is the difference between our approaches. You probably are trying to > say that is a valid sequence and hence the requirement of > being non-conflicting with the rest. I am not recommending that. I just > wanted to document the fact there is significant usage of > for stacked ??? and , to a lesser degree. Fonts may > or may not resolve the conflict of sequence. > However, higher level systems may be able to resolve it by additional > context information. We should also continue to specify that VIRAMA, RRA> is the standard sequence to help the input methods and other > normalisation logic. > > > Right, I see. This aligns with the comments I received at the plenary > discussion too. Gonna include both unideal encodings in a piece of proposed > Core Spec edit, in a revised document. > So I assume the plan is to include this in the Core Spec edits along with the planned ones corresponding to L2/19-086R (chillu conjuncts) and L2/18-346 (general historical characters). Please keep me posted. Thanks! -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Oct 10 16:54:35 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Thu, 10 Oct 2019 22:54:35 +0100 Subject: Pure Regular Expression Engines and Literal Clusters In-Reply-To: <20191008152534.2068db6c@JRWUBU2> References: <20191008152534.2068db6c@JRWUBU2> Message-ID: <20191010225435.567382c6@JRWUBU2> On Tue, 8 Oct 2019 15:25:34 +0100 Richard Wordingham via Unicode wrote: > An example UTS#18 gives for matching a literal cluster can be > simplified to, in its notation: > > [c \q{ch}] > > This is interpreted as 'match against "ch" if possible, otherwise > against "c". Thus the strings "ca" and "cha" would both match the > expression > > [c \q{ch}]a > > while "chh" but not "ch" would match against > > [c \q{ch}]h > > Or have I got this wrong? After comparing this with the Perl behaviour of /(:?ch|c) and /(:?ch|c)h, I've come to the conclusion that I've got the interpretation wrong. The former may match "ch" or "c", and I conclude that the only funny meaning of \q is to indicate a preference for the sequence of two characters - if the engine yields all matches, it has no meaning. This greatly simplifies matters. Richard. From unicode at unicode.org Thu Oct 10 17:23:00 2019 From: unicode at unicode.org (Markus Scherer via Unicode) Date: Thu, 10 Oct 2019 15:23:00 -0700 Subject: Pure Regular Expression Engines and Literal Clusters In-Reply-To: <20191008152534.2068db6c@JRWUBU2> References: <20191008152534.2068db6c@JRWUBU2> Message-ID: On Tue, Oct 8, 2019 at 7:28 AM Richard Wordingham via Unicode < unicode at unicode.org> wrote: > An example UTS#18 gives for matching a literal cluster can be simplified > to, in its notation: > > [c \q{ch}] > > This is interpreted as 'match against "ch" if possible, otherwise > against "c". Thus the strings "ca" and "cha" would both match the > expression > > [c \q{ch}]a > > while "chh" but not "ch" would match against > > [c \q{ch}]h > Right. We just independently discussed this today in the UTC meeting, connected with the "properties of strings" discussion in the proposed update. [c \q{ch}]h should work like (ch|c)h. Note that the order matters in the alternation -- so this works equivalently if longer strings are sorted first. May I correctly argue instead that matching against literal clusters > would be satisfied by instead supporting, for this example, the regular > subexpression "(c|ch)" or the UnicodeSet expression "[c{ch}]"? > ICU UnicodeSet [c{ch}] is equivalent to UTS #18 [c\q{ch}]. ICU's UnicodeSet syntax is simpler, the UTS #18 syntax is more backward-compatible. Best regards, markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Oct 11 01:46:21 2019 From: unicode at unicode.org (=?utf-8?B?TWFydGluIEouIETDvHJzdA==?= via Unicode) Date: Fri, 11 Oct 2019 06:46:21 +0000 Subject: Fwd: The Most Frequent Emoji In-Reply-To: <5D9DF525.5070300@unicode.org> References: <5D9DF525.5070300@unicode.org> Message-ID: I had a look at the page with the frequencies. Many emoji didn't display, but that's my browser's problem. What was worse was that the sidebar and the stuff at the bottom was all looking weird. I hope this can be fixed. Regards, Martin. -------- Forwarded Message -------- Subject: The Most Frequent Emoji Date: Wed, 09 Oct 2019 07:56:37 -0700 From: announcements at unicode.org Reply-To: root at unicode.org To: announcements at unicode.org Emoji Frequency ImageHow does the Unicode Consortium choose which new emoji to add? One important factor is data about how frequently the current emoji are used. Patterns of usage help to inform decisions about future emoji. The Consortium has been working to assemble this information and make it available to the public. And the two most frequently used emoji in the world are... ?? and ?? The new Unicode Emoji Frequency page shows a list of the Unicode v12.0 emoji ranked in order of how frequently they are used. ?The forecasted frequency of use is a key factor in determining whether to encode new emoji, and for that it is important to know the frequency of use of existing emoji,? said Mark Davis, President of the Unicode Consortium. ?Understanding how frequently emoji are used helps prioritize which categories to focus on and which emoji to add to the Standard.? ------------------------------------------------------------------------ /Over 136,000 characters are available for adoption , to help the Unicode Consortium?s work on digitally disadvantaged languages./ [badge] http://blog.unicode.org/2019/10/the-most-frequent-emoji.html -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Oct 11 04:14:41 2019 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Fri, 11 Oct 2019 02:14:41 -0700 Subject: Fwd: The Most Frequent Emoji In-Reply-To: References: <5D9DF525.5070300@unicode.org> Message-ID: An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Oct 11 05:39:56 2019 From: unicode at unicode.org (Elizabeth Mattijsen via Unicode) Date: Fri, 11 Oct 2019 12:39:56 +0200 Subject: Pure Regular Expression Engines and Literal Clusters In-Reply-To: References: <20191008152534.2068db6c@JRWUBU2> Message-ID: <33ADA35B-D882-4F39-8693-83B0C5F9796B@dijkmat.nl> > On 11 Oct 2019, at 00:23, Markus Scherer via Unicode wrote: > > On Tue, Oct 8, 2019 at 7:28 AM Richard Wordingham via Unicode wrote: > An example UTS#18 gives for matching a literal cluster can be simplified > to, in its notation: > > [c \q{ch}] > > This is interpreted as 'match against "ch" if possible, otherwise > against "c". Thus the strings "ca" and "cha" would both match the > expression > > [c \q{ch}]a > > while "chh" but not "ch" would match against > > [c \q{ch}]h > > Right. We just independently discussed this today in the UTC meeting, connected with the "properties of strings" discussion in the proposed update. > > [c \q{ch}]h should work like (ch|c)h. Note that the order matters in the alternation -- so this works equivalently if longer strings are sorted first. > > May I correctly argue instead that matching against literal clusters > would be satisfied by instead supporting, for this example, the regular > subexpression "(c|ch)" or the UnicodeSet expression "[c{ch}]"? > > ICU UnicodeSet [c{ch}] is equivalent to UTS #18 [c\q{ch}]. > > ICU's UnicodeSet syntax is simpler, the UTS #18 syntax is more backward-compatible. Not quite following this discussion, but I got triggered by the use of Perl in this discussion. In Perl 6 (which is a different language from Perl 5 altogether), regular expressions have been completely revamped. In Perl 6, the use of "|" indicates alternatives using longest token matching (LTM): https://docs.perl6.org/language/regexes#index-entry-regex_|-Longest_alternation:_| In Perl 6, the use of "||" indicates first matching alternative wins: https://docs.perl6.org/language/regexes#index-entry-regex_||-Alternation:_|| Furthermore, Perl 6 uses Normalization Form Grapheme for matching: https://docs.perl6.org/type/Cool#index-entry-Grapheme Hope this has some relevance to this discussion / gives new viewpoints. Elizabeth Mattijsen From unicode at unicode.org Fri Oct 11 06:35:07 2019 From: unicode at unicode.org (Fred Brennan via Unicode) Date: Fri, 11 Oct 2019 19:35:07 +0800 Subject: Will TAGALOG LETTER RA, currently in the pipeline, be in the next version of Unicode? Message-ID: <1712595.prWeGnbi0f@pc> Many users are asking me and I'm not sure of the answer (nor how to find it out). The UTC approved it, so it will be in the next version of Unicode, right? We sure hope so...it is a character needed to write a script in current use. Although only a minority of people care about it, that minority is dedicated! Best, Fred Brennan From unicode at unicode.org Fri Oct 11 11:50:16 2019 From: unicode at unicode.org (Markus Scherer via Unicode) Date: Fri, 11 Oct 2019 09:50:16 -0700 Subject: Will TAGALOG LETTER RA, currently in the pipeline, be in the next version of Unicode? In-Reply-To: <1712595.prWeGnbi0f@pc> References: <1712595.prWeGnbi0f@pc> Message-ID: On Fri, Oct 11, 2019 at 4:37 AM Fred Brennan via Unicode < unicode at unicode.org> wrote: > Many users are asking me and I'm not sure of the answer (nor how to find > it > out). > You can find out by looking at the data files that are being developed for Unicode 13. Look at the latest UnicodeData.txt in https://www.unicode.org/Public/13.0.0/ucd/ I don't see a TAGALOG LETTER RA there. DerivedAge.txt there shows Tagalog characters only from Unicode 3.2. The next place to check would be the pipeline page: https://www.unicode.org/alloc/Pipeline.html It shows TAGALOG LETTER RA in the section "Characters Accepted or In Ballot for Future Versions". UTC accepted it just in July of this year, but it's not yet in ISO ballot. If all goes well, it could go into Unicode 14, March 2021. Best regards, markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Oct 11 12:17:19 2019 From: unicode at unicode.org (Ken Whistler via Unicode) Date: Fri, 11 Oct 2019 10:17:19 -0700 Subject: Will TAGALOG LETTER RA, currently in the pipeline, be in the next version of Unicode? In-Reply-To: <1712595.prWeGnbi0f@pc> References: <1712595.prWeGnbi0f@pc> Message-ID: Short answer is no. The characters in the pipeline section labeled "Characters Accepted for Version 13.0" are what will be in the beta review for 13.0 (look for that sometime next month), and then eventually in the published Version 13.0 next month: https://www.unicode.org/alloc/Pipeline.html#planned_next_version Characters listed in the "Characters for Future Versions" table: https://www.unicode.org/alloc/Pipeline.html#future are not yet targeted for any particular version. Many of them, including the Tagalog letter RA, will end up published in Unicode 14.0, but the detailed decisions on what makes it into Unicode 14.0 won't happen until sometime next summer. Production of new versions of the Unicode Standard is a ponderous and lengthy operation, involving 4 UTC meetings, uncounted subcommittee meetings, dozens of specifications, hundreds of character properties, thousands of characters, hundreds of fonts, and intricate charts and QA process. It doesn't happen at the drop of a hat, which is why we schedule a full year for each new major release. So, in general, no, you can *never* assume that once the UTC has just approved a new character that it will be in the next version of Unicode. --Ken On 10/11/2019 4:35 AM, Fred Brennan via Unicode wrote: > Many users are asking me and I'm not sure of the answer (nor how to find it > out). > > The UTC approved it, so it will be in the next version of Unicode, right? > > We sure hope so...it is a character needed to write a script in current use. > Although only a minority of people care about it, that minority is dedicated! > > Best, > Fred Brennan From unicode at unicode.org Fri Oct 11 12:21:51 2019 From: unicode at unicode.org (Ken Whistler via Unicode) Date: Fri, 11 Oct 2019 10:21:51 -0700 Subject: Will TAGALOG LETTER RA, currently in the pipeline, be in the next version of Unicode? In-Reply-To: References: <1712595.prWeGnbi0f@pc> Message-ID: <29d0aa85-88db-50da-ab40-411c898d407e@sonic.net> Sorry about the typo there. I meant "the published Version 13.0 next March" --Ken On 10/11/2019 10:17 AM, Ken Whistler wrote: > then eventually in the published Version 13.0 next month: From unicode at unicode.org Fri Oct 11 12:35:45 2019 From: unicode at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via Unicode) Date: Fri, 11 Oct 2019 10:35:45 -0700 Subject: Unicode website glitches. (was The Most Frequent Emoji) In-Reply-To: References: <5D9DF525.5070300@unicode.org> Message-ID: There was a caching problem with WordPress, where you have to do a hard reload in some browsers. See if the problem still exists, and if the hard reload fixes it. If anyone else is having trouble with that, let us know. BTW, if you want to comment on the format as opposed to glitches, please change the subject line. Mark On Thu, Oct 10, 2019 at 11:50 PM Martin J. D?rst via Unicode < unicode at unicode.org> wrote: > I had a look at the page with the frequencies. Many emoji didn't > display, but that's my browser's problem. What was worse was that the > sidebar and the stuff at the bottom was all looking weird. I hope this > can be fixed. > > Regards, Martin. > > -------- Forwarded Message -------- > Subject: The Most Frequent Emoji > Date: Wed, 09 Oct 2019 07:56:37 -0700 > From: announcements at unicode.org > Reply-To: root at unicode.org > To: announcements at unicode.org > > Emoji Frequency ImageHow does the Unicode Consortium choose which new > emoji to add? One important factor is data about how frequently the > current emoji are used. Patterns of usage help to inform decisions about > future emoji. The Consortium has been working to assemble this > information and make it available to the public. > > And the two most frequently used emoji in the world are... > ?? and ?? > The new Unicode Emoji Frequency > page shows a list of > the Unicode v12.0 emoji ranked in order of how frequently they are used. > > ?The forecasted frequency of use is a key factor in determining whether > to encode new emoji, and for that it is important to know the frequency > of use of existing emoji,? said Mark Davis, President of the Unicode > Consortium. ?Understanding how frequently emoji are used helps > prioritize which categories to focus on and which emoji to add to the > Standard.? > > ------------------------------------------------------------------------ > /Over 136,000 characters are available for adoption > , to help the > Unicode Consortium?s work on digitally disadvantaged languages./ > > [badge] > > http://blog.unicode.org/2019/10/the-most-frequent-emoji.html > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Oct 11 13:18:46 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Fri, 11 Oct 2019 19:18:46 +0100 Subject: Pure Regular Expression Engines and Literal Clusters In-Reply-To: <33ADA35B-D882-4F39-8693-83B0C5F9796B@dijkmat.nl> References: <20191008152534.2068db6c@JRWUBU2> <33ADA35B-D882-4F39-8693-83B0C5F9796B@dijkmat.nl> Message-ID: <20191011191846.39018209@JRWUBU2> On Fri, 11 Oct 2019 12:39:56 +0200 Elizabeth Mattijsen via Unicode wrote: > Furthermore, Perl 6 uses Normalization Form Grapheme for matching: > https://docs.perl6.org/type/Cool#index-entry-Grapheme I seriously doubt that a Thai considers each combination of consonant (44), non-spacing vowel (7) and tone mark (4) a different character. Moreover, if what you say is correct, perl6 will be useless for finding such combinations in correctly spelled text. The regular expression \p{insc=consonant}\p{insc=vowel_dependent}\p{insc=tone_mark} would find only misspellings because in correct Thai spelling, matching sequences constitute grapheme clusters. I trust perl6 will actually continue to support analyses of strings as sequences of codepoints. Richard. From unicode at unicode.org Fri Oct 11 14:01:58 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Fri, 11 Oct 2019 20:01:58 +0100 Subject: Pure Regular Expression Engines and Literal Clusters In-Reply-To: References: <20191008152534.2068db6c@JRWUBU2> Message-ID: <20191011200158.41a948f4@JRWUBU2> On Thu, 10 Oct 2019 15:23:00 -0700 Markus Scherer via Unicode wrote: > [c \q{ch}]h should work like (ch|c)h. Note that the order matters in > the alternation -- so this works equivalently if longer strings are > sorted first. Thanks for answering the question. Does conformance UTS#18 to level 2 mandate the choice of matching substring? This would appear to prohibit compliance to POSIX rules, where the length of overall match counts. Richard. From unicode at unicode.org Fri Oct 11 16:35:33 2019 From: unicode at unicode.org (Markus Scherer via Unicode) Date: Fri, 11 Oct 2019 14:35:33 -0700 Subject: Pure Regular Expression Engines and Literal Clusters In-Reply-To: <20191011200158.41a948f4@JRWUBU2> References: <20191008152534.2068db6c@JRWUBU2> <20191011200158.41a948f4@JRWUBU2> Message-ID: On Fri, Oct 11, 2019 at 12:05 PM Richard Wordingham via Unicode < unicode at unicode.org> wrote: > On Thu, 10 Oct 2019 15:23:00 -0700 > Markus Scherer via Unicode wrote: > > > [c \q{ch}]h should work like (ch|c)h. Note that the order matters in > > the alternation -- so this works equivalently if longer strings are > > sorted first. > > Thanks for answering the question. > > Does conformance UTS#18 to level 2 mandate the choice of matching > substring? This would appear to prohibit compliance to POSIX rules, > where the length of overall match counts. > We just had a discussion this week. Mark will revise the proposed update. The idea is currently to specify properties-of-strings (and I think a range/class with "clusters") behaving like an alternation where the longest strings are first, and leaving it up to the regex engine exactly what that means. In general, UTS #18 offers a lot of things that regex implementers may or may not adopt. If you have specific ideas, please send them as PRI feedback. (Discussion on the list is good and useful, but does not guarantee that it gets looked at when it counts.) Best regards, markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Oct 11 17:04:54 2019 From: unicode at unicode.org (Doug Ewell via Unicode) Date: Fri, 11 Oct 2019 15:04:54 -0700 Subject: Will TAGALOG LETTER RA, currently in the pipeline, be in the next version of =?UTF-8?Q?Unicode=3F?= Message-ID: <20191011150454.665a7a7059d7ee80bb4d670165c8327d.151fa3f752.wbe@email03.godaddy.com> Ken Whistler wrote: > So, in general, no, you can *never* assume that once the UTC has just > approved a new character that it will be in the next version of > Unicode. I got quite a few messages like this when UTC approved the legacy computing characters in L2/19-025 last January. Great, that means I'll be able to start using and exchanging them in March, when Unicode 12.1 is released, right? Uh, no: 1. What Ken said above. 2. Unicode 12.1 was always just about the Reiwa sign. 3. Even when 13 comes out, fonts won't be immediately and magically updated to include them. -- Doug Ewell | Thornton, CO, US | ewellic.org From unicode at unicode.org Fri Oct 11 17:28:01 2019 From: unicode at unicode.org (Rebecca Bettencourt via Unicode) Date: Fri, 11 Oct 2019 15:28:01 -0700 Subject: Will TAGALOG LETTER RA, currently in the pipeline, be in the next version of Unicode? In-Reply-To: <20191011150454.665a7a7059d7ee80bb4d670165c8327d.151fa3f752.wbe@email03.godaddy.com> References: <20191011150454.665a7a7059d7ee80bb4d670165c8327d.151fa3f752.wbe@email03.godaddy.com> Message-ID: > 3. Even when 13 comes out, fonts won't be immediately and magically > updated to include them. In this case, though, several fonts actually already include TAGALOG LETTER RA. :) "This spot, U+170D, has become a *de facto* standard among *baybayin* writers in the Philippines and the Filipino diaspora. Several modern fonts, including the one that appears on Philippine currency to write the word *Pilipino*, use U+170D as a ?ra?. (See ?0.13) Software, if it can output ['ra'], uses U+170D. (See ?0.14) Documents online, if they include ['ra'], most often have it encoded as U+170D." (L2/19-258R, page 6) This proposal was special in that it was asking the Unicode Consortium to recognize a character that was already being used unofficially, so that organizations like the Google Noto team who are sticklers for Unicode compliance would include it. :) -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Oct 11 19:05:23 2019 From: unicode at unicode.org (=?utf-8?B?TWFydGluIEouIETDvHJzdA==?= via Unicode) Date: Sat, 12 Oct 2019 00:05:23 +0000 Subject: Unicode website glitches. (was The Most Frequent Emoji) In-Reply-To: References: <5D9DF525.5070300@unicode.org> Message-ID: Hello Mark, On 2019/10/12 02:35, Mark Davis ?? wrote: > There was a caching problem with WordPress, where you have to do a hard > reload in some browsers. See if the problem still exists, and if the hard > reload fixes it. If anyone else is having trouble with that, let us know. I can confirm that a hard reload fixed the problem. > BTW, if you want to comment on the format as opposed to glitches, please > change the subject line. I think it's less the format and much more the split personality of the Unicode Web site(s?) that I have problems with. Regards, Martin. > Mark > > > On Thu, Oct 10, 2019 at 11:50 PM Martin J. D?rst via Unicode < > unicode at unicode.org> wrote: > >> I had a look at the page with the frequencies. Many emoji didn't >> display, but that's my browser's problem. What was worse was that the >> sidebar and the stuff at the bottom was all looking weird. I hope this >> can be fixed. >> >> Regards, Martin. >> The new Unicode Emoji Frequency >> page shows a list of >> the Unicode v12.0 emoji ranked in order of how frequently they are used. From unicode at unicode.org Fri Oct 11 20:02:12 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Sat, 12 Oct 2019 02:02:12 +0100 Subject: Pure Regular Expression Engines and Literal Clusters In-Reply-To: References: <20191008152534.2068db6c@JRWUBU2> <20191011200158.41a948f4@JRWUBU2> Message-ID: <20191012020212.6db1634a@JRWUBU2> On Fri, 11 Oct 2019 14:35:33 -0700 Markus Scherer via Unicode wrote: > > > [c \q{ch}]h should work like (ch|c)h. Note that the order matters > > > in the alternation -- so this works equivalently if longer > > > strings are sorted first. > > Does conformance UTS#18 to level 2 mandate the choice of matching > > substring? This would appear to prohibit compliance to POSIX rules, > > where the length of overall match counts. > The idea is currently to specify properties-of-strings (and I think a > range/class with "clusters") behaving like an alternation where the > longest strings are first, and leaving it up to the regex engine > exactly what that means. > > In general, UTS #18 offers a lot of things that regex implementers > may or may not adopt. > If you have specific ideas, please send them as PRI feedback. > (Discussion on the list is good and useful, but does not guarantee > that it gets looked at when it counts.) You claimed the order of alternatives mattered. That is an important issue for anyone rash enough to think that the standard is fit to be used as a specification. I'm still not entirely clear what a regular expression /[\u00c1\u00e1]/ can mean. If the system uses NFD to simulate Unicode conformance, shall the expression then be converted to /[{A\u0301}{a\u0301}]/? Or should it simply fail to match any NFD string? I've been implementing the view that all or none of the canonical equivalents of a string match. (I therefore support mildly discontiguous substrings, though I don't support splitting undecomposable characters.) Richard. From unicode at unicode.org Fri Oct 11 20:37:18 2019 From: unicode at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via Unicode) Date: Fri, 11 Oct 2019 18:37:18 -0700 Subject: Pure Regular Expression Engines and Literal Clusters In-Reply-To: <20191012020212.6db1634a@JRWUBU2> References: <20191008152534.2068db6c@JRWUBU2> <20191011200158.41a948f4@JRWUBU2> <20191012020212.6db1634a@JRWUBU2> Message-ID: > > You claimed the order of alternatives mattered. That is an important > issue for anyone rash enough to think that the standard is fit to be > used as a specification. > Regex engines differ in how they handle the interpretation of the matching of alternatives, and it is not possible for us to wave a magic wand to change them. What we can do is specify how the interpretation of the properties of strings works. By specifying that they behave like alternation AND adding the extra constraint of having longer first, we minimize the differences across regex engines. > > I'm still not entirely clear what a regular expression /[\u00c1\u00e1]/ > can mean. If the system uses NFD to simulate Unicode conformance, > shall the expression then be converted to /[{A\u0301}{a\u0301}]/? Or > should it simply fail to match any NFD string? I've been implementing > the view that all or none of the canonical equivalents of a string > match. (I therefore support mildly discontiguous substrings, though I > don't support splitting undecomposable characters.) > We came to the conclusion years ago that regex engines cannot reasonably be expected to implement canonical equivalence; they are really working at a lower level. So you see the advice we give at http://unicode.org/reports/tr18/#Canonical_Equivalents. (Again, no magic wand.) > Richard. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sat Oct 12 03:16:30 2019 From: unicode at unicode.org (=?utf-8?Q?Daniel_B=C3=BCnzli?= via Unicode) Date: Sat, 12 Oct 2019 10:16:30 +0200 Subject: Website format (was Re: Unicode website glitches. (was The Most Frequent Emoji)) In-Reply-To: References: <5D9DF525.5070300@unicode.org> Message-ID: On 12 October 2019 at 02:05:23, Martin J. D?rst via Unicode (unicode at unicode.org) wrote: > I think it's less the format and much more the split personality of the > Unicode Web site(s?) that I have problems with. I also do.? One thing that is particulary annoying is the fact that the "home" link on the "technical" (unchanged) subpart of the website gets back to the "marketing" home page which is particularly inefficient (the links you are looking for are not above the fold on a laptop screen) and confusing (the whole layout shifts and the theme changes) for perusing the technical part of the website. With all due respect for the work that has been done on the new website I think that the new structure significantly decreased the usability of the website for technical users. Best,? Daniel From unicode at unicode.org Sat Oct 12 03:23:36 2019 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Sat, 12 Oct 2019 01:23:36 -0700 Subject: Website format (was Re: Unicode website glitches. (was The Most Frequent Emoji)) In-Reply-To: References: <5D9DF525.5070300@unicode.org> Message-ID: <9222b376-02cf-2530-5777-21120a4917da@ix.netcom.com> An HTML attachment was scrubbed... URL: From unicode at unicode.org Sat Oct 12 05:15:38 2019 From: unicode at unicode.org (Fred Brennan via Unicode) Date: Sat, 12 Oct 2019 18:15:38 +0800 Subject: Will TAGALOG LETTER RA, currently in the pipeline, be in the next version of Unicode? In-Reply-To: References: <20191011150454.665a7a7059d7ee80bb4d670165c8327d.151fa3f752.wbe@email03.godaddy.com> Message-ID: <4306889.x27fmyrm67@pc> On Saturday, October 12, 2019 6:28:01 AM PST Rebecca Bettencourt via Unicode wrote: > This proposal was special in that it was asking the Unicode Consortium to > recognize a character that was already being used unofficially, so that > organizations like the Google Noto team who are sticklers for Unicode > compliance would include it. :) Indeed - it is extremely unfortunate that users will need to wait until 2021(!) to get it into Unicode so Google will finally add it to the Noto fonts. There seems to be no conscionable reason for such a long delay after the approval. If that's just how things are done, fine, I certainly can't change the whole system. But imagine if you had to wait two years to even have a chance of using a letter you desperately need to write your language? Imagine if the letter "Q" was unencoded and Noto refused to add it for two more years? From unicode at unicode.org Sat Oct 12 07:17:55 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Sat, 12 Oct 2019 13:17:55 +0100 Subject: Pure Regular Expression Engines and Literal Clusters In-Reply-To: References: <20191008152534.2068db6c@JRWUBU2> <20191011200158.41a948f4@JRWUBU2> <20191012020212.6db1634a@JRWUBU2> Message-ID: <20191012131755.7749a622@JRWUBU2> On Fri, 11 Oct 2019 18:37:18 -0700 Mark Davis ?? via Unicode wrote: > > > > You claimed the order of alternatives mattered. That is an > > important issue for anyone rash enough to think that the standard > > is fit to be used as a specification. > > > > Regex engines differ in how they handle the interpretation of the > matching of alternatives, and it is not possible for us to wave a > magic wand to change them. But you are close to waving a truncheon to deprecate some of them. And even if you do not wave the truncheon, you will provide other people a stick to beat them with. > What we can do is specify how the interpretation of the properties of > strings works. By specifying that they behave like alternation AND > adding the extra constraint of having longer first, we minimize the > differences across regex engines. But remember that 'having longer first' is meaningless for a non-deterministic finite automaton that does a single pass through the string to be searched. > > I'm still not entirely clear what a regular > > expression /[\u00c1\u00e1]/ can mean. If the system uses NFD to > > simulate Unicode conformance, shall the expression then be > > converted to /[{A\u0301}{a\u0301}]/? Or should it simply fail to > > match any NFD string? I've been implementing the view that all or > > none of the canonical equivalents of a string match. (I therefore > > support mildly discontiguous substrings, though I don't support > > splitting undecomposable characters.) > > We came to the conclusion years ago that regex engines cannot > reasonably be expected to implement canonical equivalence; they are > really working at a lower level. So does a lot of text processing. The issue should simply be that the change is too complicated for straightforward implementation: (1) One winds up with slightly discontiguous substrings: the non-starters at the beginning and end may not be contiguous. (2) If one does not work with NFD, one ends up with parts of characters in substrings. (3) If one does not work with NFD (thereby formally avoiding the issue of Unicode equivalence), replacing a non-starter by a character of a different ccc is in general not a Unicode-compliant process. (This avoidance technique can be necessary for the Unicode Collation Algorithm.) (4) The algorithm for recognising concatenation and iteration (more precisely, their closures under canonical equivalence) need to be significantly rewritten. One needs to be careful with optimisation - some approaches could lead to reducing an FSM with over 2^54 states. The issue of concatenation and iteration is largely solved in the theory of traces and regular expressions, though there is still the issue of when the iteration (Kleene star) of a regular expression (for traces) is itself regular. In the literature, this issue is called the 'star problem'. One practical answer is that the Kleene star is itself regular if it is generated from the set of strings matching the regular expression that either contain NFD non-starters or all of whose characters have the same ccc. An unchecked requirement that Kleene stars all be of this form would probably not be too great a problem - one could probably dress this up by 'only fully supporting Kleene star that is the same as the "concurrent star"'. Another one is that recognition algorithms do not need to restrict themselves to *regular* expressions - back references are not 'regular' either. /\u0F73*/ is probably the simplest example of a non-regular Kleene star in the Unicode strings under canonical equivalence. (That character is a problem character for ICU collation.) However, /[[:Tibetan:]&[:insc=vowel_dependent:]]*/ is regular, as removing U+0F73 from the Unicode set does not change its iteration. Contrariwise, there might be a formal issue with giving preference over if one used the iteration algorithm for regular-only Kleene star. > So you see the advice we give at > http://unicode.org/reports/tr18/#Canonical_Equivalents. (Again, no > magic wand.) So who's got tools for converting the USE's expression for a 'standard cluster' into a regular expression that catches all NFD equivalents of the original expression? There may be perfection issues - the star problem may be unsolved for sequences of Unicode strings under canonical equivalence. Annoyingly, I can't find any text but my own that relates traces to Unicode! The trick of converting strings to NFD before searching them is certainly useful. Even with an engine respecting canonical equivalence, it cuts the 2^54 I mentioned down to 54, the number of non-zero canonical combining classes currently in use. Of course, such a reduction is not fully consistent with the spirit of a finite state machine. Richard. From unicode at unicode.org Sat Oct 12 07:50:57 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Sat, 12 Oct 2019 13:50:57 +0100 Subject: Will TAGALOG LETTER RA, currently in the pipeline, be in the next version of Unicode? In-Reply-To: <4306889.x27fmyrm67@pc> References: <20191011150454.665a7a7059d7ee80bb4d670165c8327d.151fa3f752.wbe@email03.godaddy.com> <4306889.x27fmyrm67@pc> Message-ID: <20191012135057.4fd93a51@JRWUBU2> On Sat, 12 Oct 2019 18:15:38 +0800 Fred Brennan via Unicode wrote: > Indeed - it is extremely unfortunate that users will need to wait > until 2021(!) to get it into Unicode so Google will finally add it to > the Noto fonts. > There seems to be no conscionable reason for such a long delay after > the approval. The UTC's accepting a character does not mean it will make it into Unicode. In the ISO process it may yet be rejected, renumbered or renamed. These things have certainly happen for new scripts. Richard. From unicode at unicode.org Sat Oct 12 10:06:25 2019 From: unicode at unicode.org (Ken Whistler via Unicode) Date: Sat, 12 Oct 2019 08:06:25 -0700 Subject: Will TAGALOG LETTER RA, currently in the pipeline, be in the next version of Unicode? In-Reply-To: <4306889.x27fmyrm67@pc> References: <20191011150454.665a7a7059d7ee80bb4d670165c8327d.151fa3f752.wbe@email03.godaddy.com> <4306889.x27fmyrm67@pc> Message-ID: On 10/12/2019 3:15 AM, Fred Brennan via Unicode wrote: > There seems to be no conscionable reason for such a long delay after the > approval. > > If that's just how things are done, fine, I certainly can't change the whole > system. But imagine if you had to wait two years to even have a chance of > using a letter you desperately need to write your language? Imagine if the > letter "Q" was unencoded and Noto refused to add it for two more years? Well, as long as we are imagining things, then consider a scenario where the UTC is presented a proposal for encoding a writing system which is reported as an historic artifact of the 18th century, "fallen out of normal use", yet encodes it anyway based on the proposal provided in 1999: https://www.unicode.org/L2/L1999/n1933.pdf and publishes it in Unicode 3.2 in 2002: https://www.unicode.org/standard/supported.html Then imagine that a community works to revive use of that script (now known as Baybayin) and extends character use in it based on similar characters in related, more contemporaneous scripts, but that the first time the UTC actually formally hears about that extension is on July 18, 2019: https://www.unicode.org/L2/L2019/19258r-baybayin-ra.pdf And then imagine that despite a 17 year gap before this supposedly urgent defect in an encoding is reported to the UTC, that the UTC in fact approves encoding of U+170D TAGALOG LETTER RA at its very *first* opportunity, eight days later, on July 26, 2019. Further imagine that the UTC immediately publishes what amounts to a "letter of intent" to publish this character when it can: https://www.unicode.org/alloc/Pipeline.html#future It may then be understandable that some UTC participants might be puzzled to be accused of unconscionable delays in this case. I understand the frustration that you are expressing, but it simply isn't feasible for every proposal's advocates to get their particular candidates pushed to the front of the line for publication. Unicode 13.0 is creaking down the track towards its March 10, 2020 publication, but it already is contending with 5930 new characters (as well as additional emoji sequences beyond that), every one of which was approved by the UTC *prior* to July 26, 2019 and all of which are already in some advanced stage of ISO ballot consideration. In the meantime, Baybayin users are inconvenienced, sure, but it is unlikely that the interim solutions will just break, because nobody is opposed to U+170D TAGALOG LETTER RA, and it is exceedingly unlikely that that code point would be moved before its eventual publication in the standard in March, 2021. --Ken From unicode at unicode.org Sat Oct 12 13:28:02 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Sat, 12 Oct 2019 19:28:02 +0100 Subject: Will TAGALOG LETTER RA, currently in the pipeline, be in the next version of Unicode? In-Reply-To: <4306889.x27fmyrm67@pc> References: <20191011150454.665a7a7059d7ee80bb4d670165c8327d.151fa3f752.wbe@email03.godaddy.com> <4306889.x27fmyrm67@pc> Message-ID: <20191012192802.01bd24b3@JRWUBU2> On Sat, 12 Oct 2019 18:15:38 +0800 Fred Brennan via Unicode wrote: > Indeed - it is extremely unfortunate that users will need to wait > until 2021(!) to get it into Unicode so Google will finally add it to > the Noto fonts. > If that's just how things are done, fine, I certainly can't change > the whole system. But imagine if you had to wait two years to even > have a chance of using a letter you desperately need to write your > language? Update me on what the problem with using the character *now* is. If the character is so important, why do you need to wait for Noto fonts? I can imagine a much bigger problem - you could have the problem that the Babayin script is 'supported'. This could result in dotted circles between RA and the combining marks. It took ages between the addition of U+0BB6 TAMI LETTER SHA to Unicode and obtaining a renderer that acknowledged it as a Tamil letter. You should be (or are you?) badgering HarfBuzz to speculatively support it. (There may be other problems in the system.) > Imagine if the letter "Q" was unencoded and Noto refused to > add it for two more years? On private PCs, having Noto support for a script can actually be an unmitigated disaster. Richard. From unicode at unicode.org Sat Oct 12 14:36:45 2019 From: unicode at unicode.org (=?utf-8?Q?Hans_=C3=85berg?= via Unicode) Date: Sat, 12 Oct 2019 21:36:45 +0200 Subject: Pure Regular Expression Engines and Literal Clusters In-Reply-To: <20191012131755.7749a622@JRWUBU2> References: <20191008152534.2068db6c@JRWUBU2> <20191011200158.41a948f4@JRWUBU2> <20191012020212.6db1634a@JRWUBU2> <20191012131755.7749a622@JRWUBU2> Message-ID: <3EA16101-7FD3-4BE8-82F7-DB32E36BBFAE@telia.com> > On 12 Oct 2019, at 14:17, Richard Wordingham via Unicode wrote: > > But remember that 'having longer first' is meaningless for a > non-deterministic finite automaton that does a single pass through the > string to be searched. It is possible to identify all submatches deterministically in linear time without backtracking ? I a made an algorithm for that. A selection among different submatches then requires additional rules. From unicode at unicode.org Sat Oct 12 17:03:17 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Sat, 12 Oct 2019 23:03:17 +0100 Subject: Pure Regular Expression Engines and Literal Clusters In-Reply-To: <33ADA35B-D882-4F39-8693-83B0C5F9796B@dijkmat.nl> References: <20191008152534.2068db6c@JRWUBU2> <33ADA35B-D882-4F39-8693-83B0C5F9796B@dijkmat.nl> Message-ID: <20191012230317.661f36c4@JRWUBU2> On Fri, 11 Oct 2019 12:39:56 +0200 Elizabeth Mattijsen via Unicode wrote: > Furthermore, Perl 6 uses Normalization Form Grapheme for matching: > https://docs.perl6.org/type/Cool#index-entry-Grapheme This approach does address the issue Mark Davis mentioned about regex engines working at the wrong level. Perhaps you can put my mind at rest about whether it works at all with scripts that subordinate vowels. If I wanted to find the occurrences of the Pali word _pacati_ 'to cook' in Latin script text using form NFG, I could use a Perl regular expression like /\b(:?a|pa)?p[a?]c(:?\B.)*/. (At least, grep -P '\b(:?a|pa)?p[a?]c\p{Ll}*' file.txt works on text in NFC. I couldn't work out the command-line expression to display a list of matches from Perl, and the PCRE \B is broken beyond ASCII in GNU grep 2.25.) How would I do such a search in an Indic script using form NFG? The main issue is that the single character 'c' would have to expand to a list of all but one of the Pali grapheme clusters whose initial consonant transliterates to 'c'. Have you a notation for such a class? Regards, Richard. From unicode at unicode.org Sat Oct 12 17:37:05 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Sat, 12 Oct 2019 23:37:05 +0100 Subject: Pure Regular Expression Engines and Literal Clusters In-Reply-To: <3EA16101-7FD3-4BE8-82F7-DB32E36BBFAE@telia.com> References: <20191008152534.2068db6c@JRWUBU2> <20191011200158.41a948f4@JRWUBU2> <20191012020212.6db1634a@JRWUBU2> <20191012131755.7749a622@JRWUBU2> <3EA16101-7FD3-4BE8-82F7-DB32E36BBFAE@telia.com> Message-ID: <20191012233705.52544fb9@JRWUBU2> On Sat, 12 Oct 2019 21:36:45 +0200 Hans ?berg via Unicode wrote: > > On 12 Oct 2019, at 14:17, Richard Wordingham via Unicode > > wrote: > > > > But remember that 'having longer first' is meaningless for a > > non-deterministic finite automaton that does a single pass through > > the string to be searched. > > It is possible to identify all submatches deterministically in linear > time without backtracking ? I a made an algorithm for that. That's impressive, as the number of possible submatches for a*(a*)a* is quadratic in the string length. > A selection among different submatches then requires additional rules. Regards, Richard. From unicode at unicode.org Sun Oct 13 03:04:34 2019 From: unicode at unicode.org (=?utf-8?Q?Hans_=C3=85berg?= via Unicode) Date: Sun, 13 Oct 2019 10:04:34 +0200 Subject: Pure Regular Expression Engines and Literal Clusters In-Reply-To: <20191012233705.52544fb9@JRWUBU2> References: <20191008152534.2068db6c@JRWUBU2> <20191011200158.41a948f4@JRWUBU2> <20191012020212.6db1634a@JRWUBU2> <20191012131755.7749a622@JRWUBU2> <3EA16101-7FD3-4BE8-82F7-DB32E36BBFAE@telia.com> <20191012233705.52544fb9@JRWUBU2> Message-ID: > On 13 Oct 2019, at 00:37, Richard Wordingham via Unicode wrote: > > On Sat, 12 Oct 2019 21:36:45 +0200 > Hans ?berg via Unicode wrote: > >>> On 12 Oct 2019, at 14:17, Richard Wordingham via Unicode >>> wrote: >>> >>> But remember that 'having longer first' is meaningless for a >>> non-deterministic finite automaton that does a single pass through >>> the string to be searched. >> >> It is possible to identify all submatches deterministically in linear >> time without backtracking ? I a made an algorithm for that. > > That's impressive, as the number of possible submatches for a*(a*)a* is > quadratic in the string length. That is probably after the possibilities in the matching graph have been expanded, which can even be exponential. As an analogy, think of a polynomial product, I compute the product, not the expansion. From unicode at unicode.org Sun Oct 13 08:00:18 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Sun, 13 Oct 2019 14:00:18 +0100 Subject: Pure Regular Expression Engines and Literal Clusters In-Reply-To: References: <20191008152534.2068db6c@JRWUBU2> <20191011200158.41a948f4@JRWUBU2> <20191012020212.6db1634a@JRWUBU2> <20191012131755.7749a622@JRWUBU2> <3EA16101-7FD3-4BE8-82F7-DB32E36BBFAE@telia.com> <20191012233705.52544fb9@JRWUBU2> Message-ID: <20191013140018.5ea512bc@JRWUBU2> On Sun, 13 Oct 2019 10:04:34 +0200 Hans ?berg via Unicode wrote: > > On 13 Oct 2019, at 00:37, Richard Wordingham via Unicode > > wrote: > > > > On Sat, 12 Oct 2019 21:36:45 +0200 > > Hans ?berg via Unicode wrote: > > > >>> On 12 Oct 2019, at 14:17, Richard Wordingham via Unicode > >>> wrote: > >>> > >>> But remember that 'having longer first' is meaningless for a > >>> non-deterministic finite automaton that does a single pass through > >>> the string to be searched. > >> > >> It is possible to identify all submatches deterministically in > >> linear time without backtracking ? I a made an algorithm for > >> that. > > > > That's impressive, as the number of possible submatches for > > a*(a*)a* is quadratic in the string length. > > That is probably after the possibilities in the matching graph have > been expanded, which can even be exponential. As an analogy, think of > a polynomial product, I compute the product, not the expansion. I'm now beginning to wonder what you are claiming. One thing one can do without backtracking is to determine which capture groups capture something, and which combinations of capturing or not occur. That's a straightforward extension of doing the overall 'recognition' in linear time - at least, linear in length (n) of the searched string. (I say straightforward, but it would mess up my state naming algorithm.) The time can also depend on the complexity of the regular expression, which can be bounded by the length (m) of the expression if working with mere strings, giving time O(mn) if one doesn't undertake the worst case O(2^m) task of converting the non-deterministic FSM to a deterministic FSM. Using m as a complexity measure for traces may be misleading, and I think plain wrong; for moderate m, the complexity can easily go up as fast as m^10, and I think higher powers are possible. Strings exercising the higher complexities are linguistically implausible. Regards, Richard. From unicode at unicode.org Sun Oct 13 08:29:04 2019 From: unicode at unicode.org (=?utf-8?Q?Hans_=C3=85berg?= via Unicode) Date: Sun, 13 Oct 2019 15:29:04 +0200 Subject: Pure Regular Expression Engines and Literal Clusters In-Reply-To: <20191013140018.5ea512bc@JRWUBU2> References: <20191008152534.2068db6c@JRWUBU2> <20191011200158.41a948f4@JRWUBU2> <20191012020212.6db1634a@JRWUBU2> <20191012131755.7749a622@JRWUBU2> <3EA16101-7FD3-4BE8-82F7-DB32E36BBFAE@telia.com> <20191012233705.52544fb9@JRWUBU2> <20191013140018.5ea512bc@JRWUBU2> Message-ID: <3893F0F9-3199-46A6-BC62-0404708CD71A@telia.com> > On 13 Oct 2019, at 15:00, Richard Wordingham via Unicode wrote: > >>> On Sat, 12 Oct 2019 21:36:45 +0200 >>> Hans ?berg via Unicode wrote: >>> >>>>> On 12 Oct 2019, at 14:17, Richard Wordingham via Unicode >>>>> wrote: >>>>> >>>>> But remember that 'having longer first' is meaningless for a >>>>> non-deterministic finite automaton that does a single pass through >>>>> the string to be searched. >>>> >>>> It is possible to identify all submatches deterministically in >>>> linear time without backtracking ? I a made an algorithm for >>>> that. > > I'm now beginning to wonder what you are claiming. I start with a NFA with no empty transitions and apply the subset DFA construction dynamically for a given string along with some reverse NFA-data that is enough to transverse backwards when a final state arrives. The result is a NFA where all transverses is a match of the string at that position. From unicode at unicode.org Sun Oct 13 14:17:54 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Sun, 13 Oct 2019 20:17:54 +0100 Subject: Pure Regular Expression Engines and Literal Clusters In-Reply-To: <3893F0F9-3199-46A6-BC62-0404708CD71A@telia.com> References: <20191008152534.2068db6c@JRWUBU2> <20191011200158.41a948f4@JRWUBU2> <20191012020212.6db1634a@JRWUBU2> <20191012131755.7749a622@JRWUBU2> <3EA16101-7FD3-4BE8-82F7-DB32E36BBFAE@telia.com> <20191012233705.52544fb9@JRWUBU2> <20191013140018.5ea512bc@JRWUBU2> <3893F0F9-3199-46A6-BC62-0404708CD71A@telia.com> Message-ID: <20191013201754.6597fdd0@JRWUBU2> On Sun, 13 Oct 2019 15:29:04 +0200 Hans ?berg via Unicode wrote: > > On 13 Oct 2019, at 15:00, Richard Wordingham via Unicode > > I'm now beginning to wonder what you are claiming. > I start with a NFA with no empty transitions and apply the subset DFA > construction dynamically for a given string along with some reverse > NFA-data that is enough to transverse backwards when a final state > arrives. The result is a NFA where all transverses is a match of the > string at that position. And then the speed comparison depends on how quickly one can extract the match information required from that data structure. Incidentally, at least some of the sizes and timings I gave seem to be wrong even for strings. They won't work with numeric quantifiers, as in /[ab]{0,20}[ac]{10,20}[ad]{0,20}e/. One gets lesser issues in quantifying complexity if one wants "?" to match \p{Lu} when working in NFD - potentially a different state for each prefix of the capital letters. (It's also the case except for UTF-32 if characters are treated as sequences of code units.) Perhaps 'upper case letter that Unicode happens to have encoded as a single character' isn't a concept that regular expressions need to support concisely. What's needed is to have a set somewhere between [\p{Lu}&\p{isNFD}] and [\p{Lu}],though perhaps it should be extended to include "ff" - there are English surnames like "ffrench". Regards, Richard. From unicode at unicode.org Sun Oct 13 15:14:10 2019 From: unicode at unicode.org (=?utf-8?Q?Hans_=C3=85berg?= via Unicode) Date: Sun, 13 Oct 2019 22:14:10 +0200 Subject: Pure Regular Expression Engines and Literal Clusters In-Reply-To: <20191013201754.6597fdd0@JRWUBU2> References: <20191008152534.2068db6c@JRWUBU2> <20191011200158.41a948f4@JRWUBU2> <20191012020212.6db1634a@JRWUBU2> <20191012131755.7749a622@JRWUBU2> <3EA16101-7FD3-4BE8-82F7-DB32E36BBFAE@telia.com> <20191012233705.52544fb9@JRWUBU2> <20191013140018.5ea512bc@JRWUBU2> <3893F0F9-3199-46A6-BC62-0404708CD71A@telia.com> <20191013201754.6597fdd0@JRWUBU2> Message-ID: > On 13 Oct 2019, at 21:17, Richard Wordingham via Unicode wrote: > > On Sun, 13 Oct 2019 15:29:04 +0200 > Hans ?berg via Unicode wrote: > >>> On 13 Oct 2019, at 15:00, Richard Wordingham via Unicode >>> I'm now beginning to wonder what you are claiming. > >> I start with a NFA with no empty transitions and apply the subset DFA >> construction dynamically for a given string along with some reverse >> NFA-data that is enough to transverse backwards when a final state >> arrives. The result is a NFA where all transverses is a match of the >> string at that position. > > And then the speed comparison depends on how quickly one can extract > the match information required from that data structure. Yes. For example, one should match the saved DFA in constant time, if matched as dynamic sets which is linear in set size, then one can get quadratic time complexity in string size. Even though one can iterate through each match NFA in linear time, it could have say two choices at each character position each leading to the next, which would give an exponential size relative the string length. Normally one is not interested in all matches, this is the disambiguation rules that do that. > Incidentally, at least some of the sizes and timings I gave seem to be > wrong even for strings. They won't work with numeric quantifiers, as > in /[ab]{0,20}[ac]{10,20}[ad]{0,20}e/. For those, one normally implements a loop iteration. I did not that. I mentioned this method to Tim Shen on the libstdc++ list, so perhaps he might have implemented something. > One gets lesser issues in quantifying complexity if one wants "?" to > match \p{Lu} when working in NFD - potentially a different state for > each prefix of the capital letters. (It's also the case except for > UTF-32 if characters are treated as sequences of code units.) Perhaps > 'upper case letter that Unicode happens to have encoded as a single > character' isn't a concept that regular expressions need to support > concisely. What's needed is to have a set somewhere between > [\p{Lu}&\p{isNFD}] and [\p{Lu}],though perhaps it should be extended to > include "ff" - there are English surnames like "ffrench?. I made some C++ templates that translate Unicode code point character classes into UTF-8/32 regular expressions. So anything that can be reduced to actual regular expressions would work. From unicode at unicode.org Sun Oct 13 16:54:12 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Sun, 13 Oct 2019 22:54:12 +0100 Subject: Pure Regular Expression Engines and Literal Clusters In-Reply-To: References: <20191008152534.2068db6c@JRWUBU2> <20191011200158.41a948f4@JRWUBU2> <20191012020212.6db1634a@JRWUBU2> <20191012131755.7749a622@JRWUBU2> <3EA16101-7FD3-4BE8-82F7-DB32E36BBFAE@telia.com> <20191012233705.52544fb9@JRWUBU2> <20191013140018.5ea512bc@JRWUBU2> <3893F0F9-3199-46A6-BC62-0404708CD71A@telia.com> <20191013201754.6597fdd0@JRWUBU2> Message-ID: <20191013225412.4f1772ca@JRWUBU2> On Sun, 13 Oct 2019 22:14:10 +0200 Hans ?berg via Unicode wrote: > > On 13 Oct 2019, at 21:17, Richard Wordingham via Unicode > > wrote: > > Incidentally, at least some of the sizes and timings I gave seem to > > be wrong even for strings. They won't work with numeric > > quantifiers, as in /[ab]{0,20}[ac]{10,20}[ad]{0,20}e/. > > One gets lesser issues in quantifying complexity if one wants "?" to > > match \p{Lu} when working in NFD - potentially a different state for > > each prefix of the capital letters. (It's also the case except for > > UTF-32 if characters are treated as sequences of code units.) > > Perhaps 'upper case letter that Unicode happens to have encoded as > > a single character' isn't a concept that regular expressions need > > to support concisely. What's needed is to have a set somewhere > > between [\p{Lu}&\p{isNFD}] and [\p{Lu}],though perhaps it should be > > extended to include "ff" - there are English surnames like > > "ffrench?. The point about these examples is that the estimate of one state per character becomes a severe underestimate. For example, after processing 20 a's, the NFA for /[ab]{0,20}[ac]{10,20}[ad]{0,20}e/ can be in any of about 50 states. The number of possible states is not linear in the length of the expression. While a 'loop iteration' can keep the size of the compiled regex down, it doesn't prevent the proliferation of states - just add zeroes to my example. > I made some C++ templates that translate Unicode code point character > classes into UTF-8/32 regular expressions. So anything that can be > reduced to actual regular expressions would work. Besides invalidating complexity metrics, the issue was what \p{Lu} should match. For example, with PCRE syntax, GNU grep Version 2.25 \p{Lu} matches U+0100 but not . When I'm respecting canonical equivalence, I want both to match [:Lu:], and that's what I do. [:Lu:] can then match a sequence of up to 4 NFD characters. Regards, Richard. From unicode at unicode.org Sun Oct 13 17:22:36 2019 From: unicode at unicode.org (=?utf-8?Q?Hans_=C3=85berg?= via Unicode) Date: Mon, 14 Oct 2019 00:22:36 +0200 Subject: Pure Regular Expression Engines and Literal Clusters In-Reply-To: <20191013225412.4f1772ca@JRWUBU2> References: <20191008152534.2068db6c@JRWUBU2> <20191011200158.41a948f4@JRWUBU2> <20191012020212.6db1634a@JRWUBU2> <20191012131755.7749a622@JRWUBU2> <3EA16101-7FD3-4BE8-82F7-DB32E36BBFAE@telia.com> <20191012233705.52544fb9@JRWUBU2> <20191013140018.5ea512bc@JRWUBU2> <3893F0F9-3199-46A6-BC62-0404708CD71A@telia.com> <20191013201754.6597fdd0@JRWUBU2> <20191013225412.4f1772ca@JRWUBU2> Message-ID: <7DFB381E-EAE3-4B04-AA50-2E18E764E731@telia.com> > On 13 Oct 2019, at 23:54, Richard Wordingham via Unicode wrote: > > The point about these examples is that the estimate of one state per > character becomes a severe underestimate. For example, after > processing 20 a's, the NFA for /[ab]{0,20}[ac]{10,20}[ad]{0,20}e/ can > be in any of about 50 states. The number of possible states is not > linear in the length of the expression. While a 'loop iteration' can > keep the size of the compiled regex down, it doesn't prevent the > proliferation of states - just add zeroes to my example. Formally only the expansion of such ranges are NFA, and I haven?t seen anyone considering the complexity with them included. So to me, it seems just a hack. >> I made some C++ templates that translate Unicode code point character >> classes into UTF-8/32 regular expressions. So anything that can be >> reduced to actual regular expressions would work. > > Besides invalidating complexity metrics, the issue was what \p{Lu} > should match. For example, with PCRE syntax, GNU grep Version 2.25 > \p{Lu} matches U+0100 but not . When I'm respecting > canonical equivalence, I want both to match [:Lu:], and that's what I > do. [:Lu:] can then match a sequence of up to 4 NFD characters. Hopefully some experts here can tune in, explaining exactly what regular expressions they have in mind. From unicode at unicode.org Sun Oct 13 19:10:45 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Mon, 14 Oct 2019 01:10:45 +0100 Subject: Pure Regular Expression Engines and Literal Clusters In-Reply-To: <7DFB381E-EAE3-4B04-AA50-2E18E764E731@telia.com> References: <20191008152534.2068db6c@JRWUBU2> <20191011200158.41a948f4@JRWUBU2> <20191012020212.6db1634a@JRWUBU2> <20191012131755.7749a622@JRWUBU2> <3EA16101-7FD3-4BE8-82F7-DB32E36BBFAE@telia.com> <20191012233705.52544fb9@JRWUBU2> <20191013140018.5ea512bc@JRWUBU2> <3893F0F9-3199-46A6-BC62-0404708CD71A@telia.com> <20191013201754.6597fdd0@JRWUBU2> <20191013225412.4f1772ca@JRWUBU2> <7DFB381E-EAE3-4B04-AA50-2E18E764E731@telia.com> Message-ID: <20191014011045.35c851e9@JRWUBU2> On Mon, 14 Oct 2019 00:22:36 +0200 Hans ?berg via Unicode wrote: > > On 13 Oct 2019, at 23:54, Richard Wordingham via Unicode > > wrote: >> Besides invalidating complexity metrics, the issue was what \p{Lu} >> should match. For example, with PCRE syntax, GNU grep Version 2.25 >> \p{Lu} matches U+0100 but not . When I'm respecting >> canonical equivalence, I want both to match [:Lu:], and that's what >> I do. [:Lu:] can then match a sequence of up to 4 NFD characters. > Hopefully some experts here can tune in, explaining exactly what > regular expressions they have in mind. The best indication lies at https://www.unicode.org/reports/tr18/tr18-13.html#Canonical_Equivalents (2008), which is the last version before support for canonical equivalence was dropped as a requirement. It's not entirely coherent, as the authors don't seem to find an expression like \p{L}\p{gcb=extend}* a natural thing to use, as the second factor is mostly sequences of non-starters. At that point, I would say they weren't expecting \p{Lu} to not match , as they were still expecting [?] to match both "?" and "a\u0308". They hadn't given any thought to [\p{L}&&\p{isNFD}]\p{gcb=extend}*, and were expecting normalisation (even to NFC) to be a possible cure. They had begun to realise that converting expressions to match all or none of a set of canonical equivalents was hard; the issue of non-contiguous matches wasn't mentioned. When I say 'hard', I'm thinking of the problem that concatenation may require dissolution of the two constituent expressions and involve the temporary creation of 54-fold (if text is handled as NFD) or 2^54-fold (no normalisation) sets of extra states. That's what's driven me to write my own regular expression engine for traces. Regards, Richard. From unicode at unicode.org Sun Oct 13 19:13:28 2019 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Sun, 13 Oct 2019 17:13:28 -0700 Subject: Pure Regular Expression Engines and Literal Clusters In-Reply-To: <20191013225412.4f1772ca@JRWUBU2> References: <20191008152534.2068db6c@JRWUBU2> <20191011200158.41a948f4@JRWUBU2> <20191012020212.6db1634a@JRWUBU2> <20191012131755.7749a622@JRWUBU2> <3EA16101-7FD3-4BE8-82F7-DB32E36BBFAE@telia.com> <20191012233705.52544fb9@JRWUBU2> <20191013140018.5ea512bc@JRWUBU2> <3893F0F9-3199-46A6-BC62-0404708CD71A@telia.com> <20191013201754.6597fdd0@JRWUBU2> <20191013225412.4f1772ca@JRWUBU2> Message-ID: <7b42cd5c-d78f-3583-39c1-65ee209eefa5@ix.netcom.com> An HTML attachment was scrubbed... URL: From unicode at unicode.org Sun Oct 13 20:38:58 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Mon, 14 Oct 2019 02:38:58 +0100 Subject: Pure Regular Expression Engines and Literal Clusters In-Reply-To: <7b42cd5c-d78f-3583-39c1-65ee209eefa5@ix.netcom.com> References: <20191008152534.2068db6c@JRWUBU2> <20191011200158.41a948f4@JRWUBU2> <20191012020212.6db1634a@JRWUBU2> <20191012131755.7749a622@JRWUBU2> <3EA16101-7FD3-4BE8-82F7-DB32E36BBFAE@telia.com> <20191012233705.52544fb9@JRWUBU2> <20191013140018.5ea512bc@JRWUBU2> <3893F0F9-3199-46A6-BC62-0404708CD71A@telia.com> <20191013201754.6597fdd0@JRWUBU2> <20191013225412.4f1772ca@JRWUBU2> <7b42cd5c-d78f-3583-39c1-65ee209eefa5@ix.netcom.com> Message-ID: <20191014023858.5d2be8ae@JRWUBU2> On Sun, 13 Oct 2019 17:13:28 -0700 Asmus Freytag via Unicode wrote: > On 10/13/2019 2:54 PM, Richard Wordingham via Unicode wrote: > Besides invalidating complexity metrics, the issue was what \p{Lu} > should match. For example, with PCRE syntax, GNU grep Version 2.25 > \p{Lu} matches U+0100 but not . When I'm respecting > canonical equivalence, I want both to match [:Lu:], and that's what I > do. [:Lu:] can then match a sequence of up to 4 NFD characters. > > Formally, wouldn't that be rewriting \p{Lu} to match \p{Lu}\p{Mn}*; > instead of formally handling NFD, you could extend the syntax to > handle "inherited" properties across combining sequences. > > Am I missing anything? Yes. There is no precomposed LATIN LETTER M WITH CIRCUMFLEX, so [:Lu:] should not match . Now, I could invent a string property so that \p{xLu} that meant (:?\p{Lu}\p{Mn}*). I don't entirely understand what you said; you may have missed the distinction between "[:Lu:] can then match" and "[:Lu:] will then match". I think only Greek letters expand to 4 characters in NFD. When I'm respecting canonical equivalence/working with traces, I want [:insc=vowel_dependent:][:insc=tone_mark:] to match both and its canonical equivalent . The canonical closure of that sequence can be messy even within scripts. Some pairs commute: others don't, usually for good reasons. Regards, Richard. From unicode at unicode.org Sun Oct 13 22:25:25 2019 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Sun, 13 Oct 2019 20:25:25 -0700 Subject: Pure Regular Expression Engines and Literal Clusters In-Reply-To: <20191014023858.5d2be8ae@JRWUBU2> References: <20191008152534.2068db6c@JRWUBU2> <20191011200158.41a948f4@JRWUBU2> <20191012020212.6db1634a@JRWUBU2> <20191012131755.7749a622@JRWUBU2> <3EA16101-7FD3-4BE8-82F7-DB32E36BBFAE@telia.com> <20191012233705.52544fb9@JRWUBU2> <20191013140018.5ea512bc@JRWUBU2> <3893F0F9-3199-46A6-BC62-0404708CD71A@telia.com> <20191013201754.6597fdd0@JRWUBU2> <20191013225412.4f1772ca@JRWUBU2> <7b42cd5c-d78f-3583-39c1-65ee209eefa5@ix.netcom.com> <20191014023858.5d2be8ae@JRWUBU2> Message-ID: <3dbebbdf-04fa-e9b2-9780-cc40f4e5d15e@ix.netcom.com> An HTML attachment was scrubbed... URL: From unicode at unicode.org Sun Oct 13 23:28:34 2019 From: unicode at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via Unicode) Date: Sun, 13 Oct 2019 21:28:34 -0700 Subject: Pure Regular Expression Engines and Literal Clusters In-Reply-To: <3dbebbdf-04fa-e9b2-9780-cc40f4e5d15e@ix.netcom.com> References: <20191008152534.2068db6c@JRWUBU2> <20191011200158.41a948f4@JRWUBU2> <20191012020212.6db1634a@JRWUBU2> <20191012131755.7749a622@JRWUBU2> <3EA16101-7FD3-4BE8-82F7-DB32E36BBFAE@telia.com> <20191012233705.52544fb9@JRWUBU2> <20191013140018.5ea512bc@JRWUBU2> <3893F0F9-3199-46A6-BC62-0404708CD71A@telia.com> <20191013201754.6597fdd0@JRWUBU2> <20191013225412.4f1772ca@JRWUBU2> <7b42cd5c-d78f-3583-39c1-65ee209eefa5@ix.netcom.com> <20191014023858.5d2be8ae@JRWUBU2> <3dbebbdf-04fa-e9b2-9780-cc40f4e5d15e@ix.netcom.com> Message-ID: The problem is that most regex engines are not written to handle some "interesting" features of canonical equivalence, like discontinuity. Suppose that X is canonically equivalent to AB. - A query /X/ can match the separated A and C in the target string "AbC". So if I have code do [replace /X/ in "AbC" by "pq"], how should it behave? "pqb", "pbq", "bpq"? If the input was in NFD (for example), should the output be rearranged/decomposed so that it is NFD? and so on. - A query /A/ can match *part* of the X in the target string "aXb". So if I have code to do [replace /A/ in "aXb" by "pq"], what should result: "apqBb"? The syntax and APIs for regex engines are not built to handle these features. It introduces a enough complications in the code, syntax, and semantics that no major implementation has seen fit to do it. We used to have a section in the spec about this, but were convinced that it was better off handled at a higher level. Mark On Sun, Oct 13, 2019 at 8:31 PM Asmus Freytag via Unicode < unicode at unicode.org> wrote: > On 10/13/2019 6:38 PM, Richard Wordingham via Unicode wrote: > > On Sun, 13 Oct 2019 17:13:28 -0700 > Asmus Freytag via Unicode wrote: > > > On 10/13/2019 2:54 PM, Richard Wordingham via Unicode wrote: > Besides invalidating complexity metrics, the issue was what \p{Lu} > should match. For example, with PCRE syntax, GNU grep Version 2.25 > \p{Lu} matches U+0100 but not . When I'm respecting > canonical equivalence, I want both to match [:Lu:], and that's what I > do. [:Lu:] can then match a sequence of up to 4 NFD characters. > > Formally, wouldn't that be rewriting \p{Lu} to match \p{Lu}\p{Mn}*; > instead of formally handling NFD, you could extend the syntax to > handle "inherited" properties across combining sequences. > > Am I missing anything? > > Yes. There is no precomposed LATIN LETTER M WITH CIRCUMFLEX, so [:Lu:] > should not match CIRCUMFLEX ACCENT>. > > Why does it matter if it is precomposed? Why should it? (For anyone other > than a character coding maven). > > Now, I could invent a string property so > that \p{xLu} that meant (:?\p{Lu}\p{Mn}*). > > I don't entirely understand what you said; you may have missed the > distinction between "[:Lu:] can then match" and "[:Lu:] will then > match". I think only Greek letters expand to 4 characters in NFD. > > When I'm respecting canonical equivalence/working with traces, I want > [:insc=vowel_dependent:][:insc=tone_mark:] to match both CHARACTER SARA UU, U+0E49 THAI CHARACTER MAI THO> and its canonical > equivalent . The canonical closure of that > sequence can be messy even within scripts. Some pairs commute: others > don't, usually for good reasons. > > Some models may be more natural for different scripts. Certainly, in SEA > or Indic scripts, most combining marks are not best modeled with properties > as "inherited". But for L/G/C etc. it would be a different matter. > > For general recommendations, such as UTS#18, it would be good to move the > state of the art so that the "primitives" are in line with the way typical > writing systems behave, so that people can write "linguistically correct" > regexes. > > A./ > > > Regards, > > Richard. > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon Oct 14 02:05:49 2019 From: unicode at unicode.org (Eli Zaretskii via Unicode) Date: Mon, 14 Oct 2019 10:05:49 +0300 Subject: Pure Regular Expression Engines and Literal Clusters In-Reply-To: <20191014011045.35c851e9@JRWUBU2> (message from Richard Wordingham via Unicode on Mon, 14 Oct 2019 01:10:45 +0100) References: <20191008152534.2068db6c@JRWUBU2> <20191011200158.41a948f4@JRWUBU2> <20191012020212.6db1634a@JRWUBU2> <20191012131755.7749a622@JRWUBU2> <3EA16101-7FD3-4BE8-82F7-DB32E36BBFAE@telia.com> <20191012233705.52544fb9@JRWUBU2> <20191013140018.5ea512bc@JRWUBU2> <3893F0F9-3199-46A6-BC62-0404708CD71A@telia.com> <20191013201754.6597fdd0@JRWUBU2> <20191013225412.4f1772ca@JRWUBU2> <7DFB381E-EAE3-4B04-AA50-2E18E764E731@telia.com> <20191014011045.35c851e9@JRWUBU2> Message-ID: <83mue3kdrm.fsf@gnu.org> > Date: Mon, 14 Oct 2019 01:10:45 +0100 > From: Richard Wordingham via Unicode > > >> Besides invalidating complexity metrics, the issue was what \p{Lu} > >> should match. For example, with PCRE syntax, GNU grep Version 2.25 > >> \p{Lu} matches U+0100 but not . When I'm respecting > >> canonical equivalence, I want both to match [:Lu:], and that's what > >> I do. [:Lu:] can then match a sequence of up to 4 NFD characters. > > > Hopefully some experts here can tune in, explaining exactly what > > regular expressions they have in mind. > > The best indication lies at > https://www.unicode.org/reports/tr18/tr18-13.html#Canonical_Equivalents > (2008), which is the last version before support for canonical > equivalence was dropped as a requirement. > > It's not entirely coherent, as the authors don't seem to find an > expression like > > \p{L}\p{gcb=extend}* > > a natural thing to use, as the second factor is mostly sequences of > non-starters. At that point, I would say they weren't expecting > \p{Lu} to not match , as they were still expecting [?] to > match both "?" and "a\u0308". > > They hadn't given any thought to [\p{L}&&\p{isNFD}]\p{gcb=extend}*, and > were expecting normalisation (even to NFC) to be a possible cure. They > had begun to realise that converting expressions to match all or none > of a set of canonical equivalents was hard; the issue of non-contiguous > matches wasn't mentioned. I think these are two separate issues: whether search should normalize (a.k.a. performs character folding) should be a user option. You are talking only about canonical equivalence, but there's also compatibility decomposition, so, for example, searching for "1" should perhaps match ? and ?. From unicode at unicode.org Mon Oct 14 02:18:54 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Mon, 14 Oct 2019 08:18:54 +0100 Subject: Pure Regular Expression Engines and Literal Clusters In-Reply-To: <3dbebbdf-04fa-e9b2-9780-cc40f4e5d15e@ix.netcom.com> References: <20191008152534.2068db6c@JRWUBU2> <20191011200158.41a948f4@JRWUBU2> <20191012020212.6db1634a@JRWUBU2> <20191012131755.7749a622@JRWUBU2> <3EA16101-7FD3-4BE8-82F7-DB32E36BBFAE@telia.com> <20191012233705.52544fb9@JRWUBU2> <20191013140018.5ea512bc@JRWUBU2> <3893F0F9-3199-46A6-BC62-0404708CD71A@telia.com> <20191013201754.6597fdd0@JRWUBU2> <20191013225412.4f1772ca@JRWUBU2> <7b42cd5c-d78f-3583-39c1-65ee209eefa5@ix.netcom.com> <20191014023858.5d2be8ae@JRWUBU2> <3dbebbdf-04fa-e9b2-9780-cc40f4e5d15e@ix.netcom.com> Message-ID: <20191014081854.020a0f2d@JRWUBU2> On Sun, 13 Oct 2019 20:25:25 -0700 Asmus Freytag via Unicode wrote: > On 10/13/2019 6:38 PM, Richard Wordingham via Unicode wrote: > On Sun, 13 Oct 2019 17:13:28 -0700 >> Yes. There is no precomposed LATIN LETTER M WITH CIRCUMFLEX, so >> [:Lu:] should not match > COMBINING CIRCUMFLEX ACCENT>. > Why does it matter if it is precomposed? Why should it? (For anyone > other than a character coding maven). Because general_category is a property of characters, not strings. It matters to anyone who intends to conform to a standard. >> Now, I could invent a string >> property so that \p{xLu} that meant (:?\p{Lu}\p{Mn}*). No, I shouldn't! \m{xLu} is infinite, which would not be allowed for a Unicode set. I'd have to resort to a wordy definition for it to be a property. Richard. From unicode at unicode.org Mon Oct 14 02:46:07 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Mon, 14 Oct 2019 08:46:07 +0100 Subject: Pure Regular Expression Engines and Literal Clusters In-Reply-To: References: <20191008152534.2068db6c@JRWUBU2> <20191011200158.41a948f4@JRWUBU2> <20191012020212.6db1634a@JRWUBU2> <20191012131755.7749a622@JRWUBU2> <3EA16101-7FD3-4BE8-82F7-DB32E36BBFAE@telia.com> <20191012233705.52544fb9@JRWUBU2> <20191013140018.5ea512bc@JRWUBU2> <3893F0F9-3199-46A6-BC62-0404708CD71A@telia.com> <20191013201754.6597fdd0@JRWUBU2> <20191013225412.4f1772ca@JRWUBU2> <7b42cd5c-d78f-3583-39c1-65ee209eefa5@ix.netcom.com> <20191014023858.5d2be8ae@JRWUBU2> <3dbebbdf-04fa-e9b2-9780-cc40f4e5d15e@ix.netcom.com> Message-ID: <20191014084607.6d133fd6@JRWUBU2> On Sun, 13 Oct 2019 21:28:34 -0700 Mark Davis ?? via Unicode wrote: > The problem is that most regex engines are not written to handle some > "interesting" features of canonical equivalence, like discontinuity. > Suppose that X is canonically equivalent to AB. > > - A query /X/ can match the separated A and C in the target string > "AbC". So if I have code do [replace /X/ in "AbC" by "pq"], how > should it behave? "pqb", "pbq", "bpq"? If A contains a non-starter, pqbC. If C contains a non-starter, Abpq. Otherwise, if the results are canonically inequivalent, it should raise an exception for attempting a process that is either ill-defined or not Unicode-compliant. > If the input was in NFD (for > example), should the output be rearranged/decomposed so that it is > NFD? and so on. That is not a new issue. It exists already. > - A query /A/ can match *part* of the X in the target string > "aXb". So if I have code to do [replace /A/ in "aXb" by "pq"], what > should result: "apqBb"? Yes, unless raising an exception is appropriate (see above). > The syntax and APIs for regex engines are not built to handle these > features. It introduces a enough complications in the code, syntax, > and semantics that no major implementation has seen fit to do it. We > used to have a section in the spec about this, but were convinced > that it was better off handled at a higher level. What higher level? If anything, I would say that the handler is at a lower level (character fragments and the like). The potential requirement should be restored, but not subsumed in Levels 1 to 3. It is a sufficiently different level of endeavour. Richard. From unicode at unicode.org Mon Oct 14 08:08:01 2019 From: unicode at unicode.org (=?utf-8?Q?Hans_=C3=85berg?= via Unicode) Date: Mon, 14 Oct 2019 15:08:01 +0200 Subject: Pure Regular Expression Engines and Literal Clusters In-Reply-To: <20191014011045.35c851e9@JRWUBU2> References: <20191008152534.2068db6c@JRWUBU2> <20191011200158.41a948f4@JRWUBU2> <20191012020212.6db1634a@JRWUBU2> <20191012131755.7749a622@JRWUBU2> <3EA16101-7FD3-4BE8-82F7-DB32E36BBFAE@telia.com> <20191012233705.52544fb9@JRWUBU2> <20191013140018.5ea512bc@JRWUBU2> <3893F0F9-3199-46A6-BC62-0404708CD71A@telia.com> <20191013201754.6597fdd0@JRWUBU2> <20191013225412.4f1772ca@JRWUBU2> <7DFB381E-EAE3-4B04-AA50-2E18E764E731@telia.com> <20191014011045.35c851e9@JRWUBU2> Message-ID: <6EFD1B4D-BF9F-4586-8E9C-878B59B61FC4@telia.com> > On 14 Oct 2019, at 02:10, Richard Wordingham via Unicode wrote: > > On Mon, 14 Oct 2019 00:22:36 +0200 > Hans ?berg via Unicode wrote: > >>> On 13 Oct 2019, at 23:54, Richard Wordingham via Unicode >>> wrote: > >>> Besides invalidating complexity metrics, the issue was what \p{Lu} >>> should match. For example, with PCRE syntax, GNU grep Version 2.25 >>> \p{Lu} matches U+0100 but not . When I'm respecting >>> canonical equivalence, I want both to match [:Lu:], and that's what >>> I do. [:Lu:] can then match a sequence of up to 4 NFD characters. > >> Hopefully some experts here can tune in, explaining exactly what >> regular expressions they have in mind. > > The best indication lies at > https://www.unicode.org/reports/tr18/tr18-13.html#Canonical_Equivalents The certificate has expired, one day ago, risking to steal personal and financial information says the browser, refusing to load it. So one has to load the totally insecure HTTP page for risk of creating a mayhem on the computer. :-) > (2008), which is the last version before support for canonical > equivalence was dropped as a requirement. As said there, one might add all the equivalents if one can find them. Alternatively, one could normalize the regex and the string, keeping track of the translation boundaries on the string so that it can be translated back to a match on the original string if called for. From unicode at unicode.org Mon Oct 14 13:29:39 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Mon, 14 Oct 2019 19:29:39 +0100 Subject: Pure Regular Expression Engines and Literal Clusters In-Reply-To: <83mue3kdrm.fsf@gnu.org> References: <20191008152534.2068db6c@JRWUBU2> <20191011200158.41a948f4@JRWUBU2> <20191012020212.6db1634a@JRWUBU2> <20191012131755.7749a622@JRWUBU2> <3EA16101-7FD3-4BE8-82F7-DB32E36BBFAE@telia.com> <20191012233705.52544fb9@JRWUBU2> <20191013140018.5ea512bc@JRWUBU2> <3893F0F9-3199-46A6-BC62-0404708CD71A@telia.com> <20191013201754.6597fdd0@JRWUBU2> <20191013225412.4f1772ca@JRWUBU2> <7DFB381E-EAE3-4B04-AA50-2E18E764E731@telia.com> <20191014011045.35c851e9@JRWUBU2> <83mue3kdrm.fsf@gnu.org> Message-ID: <20191014192939.34ea39ce@JRWUBU2> On Mon, 14 Oct 2019 10:05:49 +0300 Eli Zaretskii via Unicode wrote: > > Date: Mon, 14 Oct 2019 01:10:45 +0100 > > From: Richard Wordingham via Unicode > > They hadn't given any thought to [\p{L}&&\p{isNFD}]\p{gcb=extend}*, > > and were expecting normalisation (even to NFC) to be a possible > > cure. They had begun to realise that converting expressions to > > match all or none of a set of canonical equivalents was hard; the > > issue of non-contiguous matches wasn't mentioned. > I think these are two separate issues: whether search should normalize > (a.k.a. performs character folding) should be a user option. You are > talking only about canonical equivalence, but there's also > compatibility decomposition, so, for example, searching for "1" should > perhaps match ? and ?. HERETIC! The official position is that text that is canonically equivalent is the same. There are problem areas where traditional modes of expression require that canonically equivalent text be treated differently. For these, it is useful to have tools that treat them differently. However, the normal presumption should be that canonically equivalent text is the same. The party line seems to be that most searching should actually be done using a 'collation', which brings with it different levels of 'folding'. In multilingual use, a collation used for searching should be quite different to one used for sorting. Now, there is a case for being able to switch off normalisation and canonical equivalence generally, e.g. when dealing with ISO 10646 text instead of Unicode text. This of course still leaves the question of what character classes defined by Unicode properties then mean. If one converts the regular expression so that what it matches is closed under canonical equivalence, then visibly normalising the searched text becomes irrelevant. For working with Unicode traces, I actually do both. I convert the text to NFD but report matches in terms of the original code point sequence; working this way simplifies the conversion of the regular expression, which I do as part of its compilation. For traces, it seems only natural to treat precomposed characters as syntactic sugar for the NFD decompositions. (They have no place in the formal theory of traces.) However, I go further and convert the decomposed text to NFD. (Recall that conversion to NFD can change the stored order of combining marks.) One of the simplifications I get is that straight runs of text in the regular expression then match in the middle just by converting that run and the searched strings. For the concatenation of expressions A and B, once I am looking at the possible interleaving of two traces, I am dealing with NFA states of the form states(A) ? {1..254} ? states(B), so that for an element (a, n, b), a corresponds to starts of words with a match in A, b corresponds to starts of _words_ with a match in B, and n is the ccc of the last character used to advance to b. The element n blocks non-starters that can't belong to a word matching A. If I didn't (internally) convert the searched text to NFD, the element n would have to be a set of blocked canonical combining classes, changing the number of possible values from 54 to 2^54 - 1. While aficionados of regular languages may object that converting the searched text to NFD is cheating, there is a theorem that if I have a finite automaton that recognises a family of NFD strings, there is another finite automaton that will recognise all their canonical equivalents. Richard. From unicode at unicode.org Mon Oct 14 13:41:19 2019 From: unicode at unicode.org (Eli Zaretskii via Unicode) Date: Mon, 14 Oct 2019 21:41:19 +0300 Subject: Pure Regular Expression Engines and Literal Clusters In-Reply-To: <20191014192939.34ea39ce@JRWUBU2> (message from Richard Wordingham via Unicode on Mon, 14 Oct 2019 19:29:39 +0100) References: <20191008152534.2068db6c@JRWUBU2> <20191011200158.41a948f4@JRWUBU2> <20191012020212.6db1634a@JRWUBU2> <20191012131755.7749a622@JRWUBU2> <3EA16101-7FD3-4BE8-82F7-DB32E36BBFAE@telia.com> <20191012233705.52544fb9@JRWUBU2> <20191013140018.5ea512bc@JRWUBU2> <3893F0F9-3199-46A6-BC62-0404708CD71A@telia.com> <20191013201754.6597fdd0@JRWUBU2> <20191013225412.4f1772ca@JRWUBU2> <7DFB381E-EAE3-4B04-AA50-2E18E764E731@telia.com> <20191014011045.35c851e9@JRWUBU2> <83mue3kdrm.fsf@gnu.org> <20191014192939.34ea39ce@JRWUBU2> Message-ID: <83bluji300.fsf@gnu.org> > Date: Mon, 14 Oct 2019 19:29:39 +0100 > From: Richard Wordingham via Unicode > > On Mon, 14 Oct 2019 10:05:49 +0300 > Eli Zaretskii via Unicode wrote: > > > I think these are two separate issues: whether search should normalize > > (a.k.a. performs character folding) should be a user option. You are > > talking only about canonical equivalence, but there's also > > compatibility decomposition, so, for example, searching for "1" should > > perhaps match ? and ?. > > HERETIC! > > The official position is that text that is canonically > equivalent is the same. There are problem areas where traditional > modes of expression require that canonically equivalent text be treated > differently. For these, it is useful to have tools that treat them > differently. However, the normal presumption should be that > canonically equivalent text is the same. I'm well aware of the official position. However, when we attempted to implement it unconditionally in Emacs, some people objected, and brought up good reasons. You can, of course, elect to disregard this experience, and instead learn it from your own. > The party line seems to be that most searching should actually be done > using a 'collation', which brings with it different levels of > 'folding'. In multilingual use, a collation used for searching should > be quite different to one used for sorting. Alas, collation is locale- and language-dependent. And, if you are going to use your search in a multilingual application (Emacs is such an application), you will have hard time even knowing which tailoring to apply for each potential match, because you will need to support the use case of working with text that mixes languages. Leaving the conundrum to the user to resolve seems to be a good compromise, and might actually teach us something that is useful for future modifications of the "party line". From unicode at unicode.org Mon Oct 14 18:23:59 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Tue, 15 Oct 2019 00:23:59 +0100 Subject: Annoyances from Implementation of Canonical Equivalence (was: Pure Regular Expression Engines and Literal Clusters) In-Reply-To: <83bluji300.fsf@gnu.org> References: <20191008152534.2068db6c@JRWUBU2> <20191011200158.41a948f4@JRWUBU2> <20191012020212.6db1634a@JRWUBU2> <20191012131755.7749a622@JRWUBU2> <3EA16101-7FD3-4BE8-82F7-DB32E36BBFAE@telia.com> <20191012233705.52544fb9@JRWUBU2> <20191013140018.5ea512bc@JRWUBU2> <3893F0F9-3199-46A6-BC62-0404708CD71A@telia.com> <20191013201754.6597fdd0@JRWUBU2> <20191013225412.4f1772ca@JRWUBU2> <7DFB381E-EAE3-4B04-AA50-2E18E764E731@telia.com> <20191014011045.35c851e9@JRWUBU2> <83mue3kdrm.fsf@gnu.org> <20191014192939.34ea39ce@JRWUBU2> <83bluji300.fsf@gnu.org> Message-ID: <20191015002359.700a5df0@JRWUBU2> On Mon, 14 Oct 2019 21:41:19 +0300 Eli Zaretskii via Unicode wrote: > > Date: Mon, 14 Oct 2019 19:29:39 +0100 > > From: Richard Wordingham via Unicode > > The official position is that text that is canonically > > equivalent is the same. There are problem areas where traditional > > modes of expression require that canonically equivalent text be > > treated differently. For these, it is useful to have tools that > > treat them differently. However, the normal presumption should be > > that canonically equivalent text is the same. > I'm well aware of the official position. However, when we attempted > to implement it unconditionally in Emacs, some people objected, and > brought up good reasons. You can, of course, elect to disregard this > experience, and instead learn it from your own. Is there a good record of these complaints anywhere? It is annoying when a text entry function does not keep the text as one enters it, but it would be interesting to know what the other complaints were. (It would occasionally be useful to have an easily issued command like 'delete preceding NFD codepoint'.) I did mention above that occasionally one needs to know what codepoints were used and in what order. Richard. From unicode at unicode.org Tue Oct 15 01:43:23 2019 From: unicode at unicode.org (Eli Zaretskii via Unicode) Date: Tue, 15 Oct 2019 09:43:23 +0300 Subject: Annoyances from Implementation of Canonical Equivalence (was: Pure Regular Expression Engines and Literal Clusters) In-Reply-To: <20191015002359.700a5df0@JRWUBU2> (message from Richard Wordingham via Unicode on Tue, 15 Oct 2019 00:23:59 +0100) References: <20191008152534.2068db6c@JRWUBU2> <20191011200158.41a948f4@JRWUBU2> <20191012020212.6db1634a@JRWUBU2> <20191012131755.7749a622@JRWUBU2> <3EA16101-7FD3-4BE8-82F7-DB32E36BBFAE@telia.com> <20191012233705.52544fb9@JRWUBU2> <20191013140018.5ea512bc@JRWUBU2> <3893F0F9-3199-46A6-BC62-0404708CD71A@telia.com> <20191013201754.6597fdd0@JRWUBU2> <20191013225412.4f1772ca@JRWUBU2> <7DFB381E-EAE3-4B04-AA50-2E18E764E731@telia.com> <20191014011045.35c851e9@JRWUBU2> <83mue3kdrm.fsf@gnu.org> <20191014192939.34ea39ce@JRWUBU2> <83bluji300.fsf@gnu.org> <20191015002359.700a5df0@JRWUBU2> Message-ID: <83tv8ah5kk.fsf@gnu.org> > Date: Tue, 15 Oct 2019 00:23:59 +0100 > From: Richard Wordingham via Unicode > > > I'm well aware of the official position. However, when we attempted > > to implement it unconditionally in Emacs, some people objected, and > > brought up good reasons. You can, of course, elect to disregard this > > experience, and instead learn it from your own. > > Is there a good record of these complaints anywhere? You could look up these discussions: https://lists.gnu.org/archive/html/emacs-devel/2016-02/msg00189.html https://lists.gnu.org/archive/html/emacs-devel/2016-02/msg00506.html > (It would occasionally be useful to have an easily issued command > like 'delete preceding NFD codepoint'.) I agree. Emacs commands that delete characters backward (usually invoked by the Backspace key) do that automatically, if the text before cursor was produced by composing several codepoints. > I did mention above that occasionally one needs to know what > codepoints were used and in what order. Sure. There's an Emacs command (C-u C-x =) which shows that information for the text at a given position. From unicode at unicode.org Tue Oct 15 14:52:15 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Tue, 15 Oct 2019 20:52:15 +0100 Subject: Annoyances from Implementation of Canonical Equivalence (was: Pure Regular Expression Engines and Literal Clusters) In-Reply-To: <83tv8ah5kk.fsf@gnu.org> References: <20191008152534.2068db6c@JRWUBU2> <20191011200158.41a948f4@JRWUBU2> <20191012020212.6db1634a@JRWUBU2> <20191012131755.7749a622@JRWUBU2> <3EA16101-7FD3-4BE8-82F7-DB32E36BBFAE@telia.com> <20191012233705.52544fb9@JRWUBU2> <20191013140018.5ea512bc@JRWUBU2> <3893F0F9-3199-46A6-BC62-0404708CD71A@telia.com> <20191013201754.6597fdd0@JRWUBU2> <20191013225412.4f1772ca@JRWUBU2> <7DFB381E-EAE3-4B04-AA50-2E18E764E731@telia.com> <20191014011045.35c851e9@JRWUBU2> <83mue3kdrm.fsf@gnu.org> <20191014192939.34ea39ce@JRWUBU2> <83bluji300.fsf@gnu.org> <20191015002359.700a5df0@JRWUBU2> <83tv8ah5kk.fsf@gnu.org> Message-ID: <20191015205215.773ac298@JRWUBU2> On Tue, 15 Oct 2019 09:43:23 +0300 Eli Zaretskii via Unicode wrote: > > Date: Tue, 15 Oct 2019 00:23:59 +0100 > > From: Richard Wordingham via Unicode > > > > > I'm well aware of the official position. However, when we > > > attempted to implement it unconditionally in Emacs, some people > > > objected, and brought up good reasons. You can, of course, elect > > > to disregard this experience, and instead learn it from your > > > own. > > > > Is there a good record of these complaints anywhere? > > You could look up these discussions: > > https://lists.gnu.org/archive/html/emacs-devel/2016-02/msg00189.html > https://lists.gnu.org/archive/html/emacs-devel/2016-02/msg00506.html These are complaints about primary-level searches, not canonical equivalence. > > (It would occasionally be useful to have an easily issued command > > like 'delete preceding NFD codepoint'.) > > I agree. Emacs commands that delete characters backward (usually > invoked by the Backspace key) do that automatically, if the text > before cursor was produced by composing several codepoints. That's pretty standard, though it looks as though GTK has chosen to reject the principle that backwards deletion deletes the last character entered. > Sure. There's an Emacs command (C-u C-x =) which shows that > information for the text at a given position. Or commands what-cursor-position and describe-char if an emulator gets in the way. Having forward-char-intrusive would make it perfect. Richard, From unicode at unicode.org Wed Oct 16 01:33:38 2019 From: unicode at unicode.org (Eli Zaretskii via Unicode) Date: Wed, 16 Oct 2019 09:33:38 +0300 Subject: Annoyances from Implementation of Canonical Equivalence (was: Pure Regular Expression Engines and Literal Clusters) In-Reply-To: <20191015205215.773ac298@JRWUBU2> (message from Richard Wordingham via Unicode on Tue, 15 Oct 2019 20:52:15 +0100) References: <20191008152534.2068db6c@JRWUBU2> <20191011200158.41a948f4@JRWUBU2> <20191012020212.6db1634a@JRWUBU2> <20191012131755.7749a622@JRWUBU2> <3EA16101-7FD3-4BE8-82F7-DB32E36BBFAE@telia.com> <20191012233705.52544fb9@JRWUBU2> <20191013140018.5ea512bc@JRWUBU2> <3893F0F9-3199-46A6-BC62-0404708CD71A@telia.com> <20191013201754.6597fdd0@JRWUBU2> <20191013225412.4f1772ca@JRWUBU2> <7DFB381E-EAE3-4B04-AA50-2E18E764E731@telia.com> <20191014011045.35c851e9@JRWUBU2> <83mue3kdrm.fsf@gnu.org> <20191014192939.34ea39ce@JRWUBU2> <83bluji300.fsf@gnu.org> <20191015002359.700a5df0@JRWUBU2> <83tv8ah5kk.fsf@gnu.org> <20191015205215.773ac298@JRWUBU2> Message-ID: <83imopfbct.fsf@gnu.org> > Date: Tue, 15 Oct 2019 20:52:15 +0100 > From: Richard Wordingham via Unicode > > > > > I'm well aware of the official position. However, when we > > > > attempted to implement it unconditionally in Emacs, some people > > > > objected, and brought up good reasons. You can, of course, elect > > > > to disregard this experience, and instead learn it from your > > > > own. > > > > > > Is there a good record of these complaints anywhere? > > > > You could look up these discussions: > > > > https://lists.gnu.org/archive/html/emacs-devel/2016-02/msg00189.html > > https://lists.gnu.org/archive/html/emacs-devel/2016-02/msg00506.html > > These are complaints about primary-level searches, not canonical > equivalence. Not sure what you call primary-level searches, but if you deduced the complaints were only about searches for base characters, then that's not so. They are long discussions with many sub-threads, so it might be hard to find the specific details you are looking for. However, the conclusion was very firm, and since we made the folding optional 3 years ago, we had no complaints. From unicode at unicode.org Wed Oct 16 20:26:35 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Thu, 17 Oct 2019 02:26:35 +0100 Subject: Annoyances from Implementation of Canonical Equivalence In-Reply-To: <83imopfbct.fsf@gnu.org> References: <20191008152534.2068db6c@JRWUBU2> <20191011200158.41a948f4@JRWUBU2> <20191012020212.6db1634a@JRWUBU2> <20191012131755.7749a622@JRWUBU2> <3EA16101-7FD3-4BE8-82F7-DB32E36BBFAE@telia.com> <20191012233705.52544fb9@JRWUBU2> <20191013140018.5ea512bc@JRWUBU2> <3893F0F9-3199-46A6-BC62-0404708CD71A@telia.com> <20191013201754.6597fdd0@JRWUBU2> <20191013225412.4f1772ca@JRWUBU2> <7DFB381E-EAE3-4B04-AA50-2E18E764E731@telia.com> <20191014011045.35c851e9@JRWUBU2> <83mue3kdrm.fsf@gnu.org> <20191014192939.34ea39ce@JRWUBU2> <83bluji300.fsf@gnu.org> <20191015002359.700a5df0@JRWUBU2> <83tv8ah5kk.fsf@gnu.org> <20191015205215.773ac298@JRWUBU2> <83imopfbct.fsf@gnu.org> Message-ID: <20191017022635.301df2b7@JRWUBU2> On Wed, 16 Oct 2019 09:33:38 +0300 Eli Zaretskii via Unicode wrote: > > These are complaints about primary-level searches, not canonical > > equivalence. > > Not sure what you call primary-level searches, but if you deduced the > complaints were only about searches for base characters, then that's > not so. They are long discussions with many sub-threads, so it might > be hard to find the specific details you are looking for. The nearest I've found to complaints about including canonical equivalences are: (a) an observation that very occasionally one would need to switch canonical equivalence off. In such cases, one is not concerned with the text as such, but rather with how Unicode non-compliant processes will handle it. Compliant processes are often built out of non-compliant processes. (b) just possibly "What we have seen is that the behavior that comes from that Unicode data does not please the users very much. Users seem to have many different ideas of what folding is useful, and disagree with each other greatly." - https://lists.gnu.org/archive/html/emacs-devel/2016-02/msg01359.html I can't tell what (b) was talking about; it may well have been about folding or asymmetric search, as opposed to supporting canonical equivalence. (c) A search for 'n' finding '?'. When it comes to canonical equivalence, one answer to (c) is that as soon as one adds the next letter letter, e.g. 'na', the search will no longer match '?'. (This doesn't apply to diacritic-ignoring folding.) That argument doesn't work with the Polish letter '?' though, as it can be word-final. In programming, one might be able to prevent the issue by using 'n\b{g}', but that is a requirement of RL2.2, which doesn't seem to be high on the list of implementer's priorities, especially as it depends on properties outwith the UCD, defined in a non-ASCII file to boot. A better supported solution is probably 'n\P{Mn}'. In many cases, the answer might be a search by collation graphemes, but that has other issues besides language sensitivity. Richard. From unicode at unicode.org Thu Oct 17 02:42:19 2019 From: unicode at unicode.org (Eli Zaretskii via Unicode) Date: Thu, 17 Oct 2019 10:42:19 +0300 Subject: Annoyances from Implementation of Canonical Equivalence In-Reply-To: <20191017022635.301df2b7@JRWUBU2> (message from Richard Wordingham on Thu, 17 Oct 2019 02:26:35 +0100) References: <20191008152534.2068db6c@JRWUBU2> <20191011200158.41a948f4@JRWUBU2> <20191012020212.6db1634a@JRWUBU2> <20191012131755.7749a622@JRWUBU2> <3EA16101-7FD3-4BE8-82F7-DB32E36BBFAE@telia.com> <20191012233705.52544fb9@JRWUBU2> <20191013140018.5ea512bc@JRWUBU2> <3893F0F9-3199-46A6-BC62-0404708CD71A@telia.com> <20191013201754.6597fdd0@JRWUBU2> <20191013225412.4f1772ca@JRWUBU2> <7DFB381E-EAE3-4B04-AA50-2E18E764E731@telia.com> <20191014011045.35c851e9@JRWUBU2> <83mue3kdrm.fsf@gnu.org> <20191014192939.34ea39ce@JRWUBU2> <83bluji300.fsf@gnu.org> <20191015002359.700a5df0@JRWUBU2> <83tv8ah5kk.fsf@gnu.org> <20191015205215.773ac298@JRWUBU2> <83imopfbct.fsf@gnu.org> <20191017022635.301df2b7@JRWUBU2> Message-ID: <838spjddic.fsf@gnu.org> > Date: Thu, 17 Oct 2019 02:26:35 +0100 > From: Richard Wordingham > Cc: Eli Zaretskii > > (c) A search for 'n' finding '?'. > > When it comes to canonical equivalence, one answer to (c) is that as > soon as one adds the next letter letter, e.g. 'na', the search will no > longer match '?'. Sounds arbitrary to me. How do we know that all the users will want that? > (This doesn't apply to diacritic-ignoring folding.) But the issue _was_ diacritic-ignoring folding. > That argument doesn't work with the Polish letter '?' though, as it can > be word-final. It actually doesn't work in general, and one factor is indeed different languages. The problem with ? was raised by Spanish-speaking users, and only they were very much against folding in this case. Users of other languages didn't consider that a problem, and many considered it a welcome feature. > In many cases, the answer might be a search by collation graphemes, but > that has other issues besides language sensitivity. It is also unworkable, because search has to work in contexts where the text is not displayed at all, and graphemes only exist at display time. From unicode at unicode.org Thu Oct 17 15:58:50 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Thu, 17 Oct 2019 21:58:50 +0100 Subject: Annoyances from Implementation of Canonical Equivalence In-Reply-To: <838spjddic.fsf@gnu.org> References: <20191008152534.2068db6c@JRWUBU2> <20191012020212.6db1634a@JRWUBU2> <20191012131755.7749a622@JRWUBU2> <3EA16101-7FD3-4BE8-82F7-DB32E36BBFAE@telia.com> <20191012233705.52544fb9@JRWUBU2> <20191013140018.5ea512bc@JRWUBU2> <3893F0F9-3199-46A6-BC62-0404708CD71A@telia.com> <20191013201754.6597fdd0@JRWUBU2> <20191013225412.4f1772ca@JRWUBU2> <7DFB381E-EAE3-4B04-AA50-2E18E764E731@telia.com> <20191014011045.35c851e9@JRWUBU2> <83mue3kdrm.fsf@gnu.org> <20191014192939.34ea39ce@JRWUBU2> <83bluji300.fsf@gnu.org> <20191015002359.700a5df0@JRWUBU2> <83tv8ah5kk.fsf@gnu.org> <20191015205215.773ac298@JRWUBU2> <83imopfbct.fsf@gnu.org> <20191017022635.301df2b7@JRWUBU2> <838spjddic.fsf@gnu.org> Message-ID: <20191017215850.106b0475@JRWUBU2> On Thu, 17 Oct 2019 10:42:19 +0300 Eli Zaretskii via Unicode wrote: > > Date: Thu, 17 Oct 2019 02:26:35 +0100 > > From: Richard Wordingham > > Cc: Eli Zaretskii > > > > (c) A search for 'n' finding '?'. > > > > When it comes to canonical equivalence, one answer to (c) is that as > > soon as one adds the next letter letter, e.g. 'na', the search will > > no longer match '?'. > > Sounds arbitrary to me. How do we know that all the users will want > that? If the change from codepoint by codepoint matching is just canonical equivalence, then there is no way that the ?n? of ?na? will be matched by the ?n? within ???. > > (This doesn't apply to diacritic-ignoring folding.) > But the issue _was_ diacritic-ignoring folding. Then we don't seem to have any evidence of user discontent arising from supporting canonical equivalence. > > That argument doesn't work with the Polish letter '?' though, as it > > can be word-final. > It actually doesn't work in general, and one factor is indeed > different languages. The problem with ? was raised by > Spanish-speaking users, and only they were very much against folding > in this case. I'm not talking about folding. I'm talking about canonical equivalence, which largely but not solely consists of treating precomposed characters as the same as their *canonical* decompositions. > > In many cases, the answer might be a search by collation graphemes, > > but that has other issues besides language sensitivity. > It is also unworkable, because search has to work in contexts where > the text is not displayed at all, and graphemes only exist at display > time. The definition of a grapheme cluster is given in Section 9.9 of UTS#10, which is currently at Version 12.1.0. It is only connected to display at a deep level, so display time is irrelevant. Formally, it depends on a collation, though the sorting aspect is irrelevant and is removed for many 'search' collations in the CLDR. So, if one were using a Spanish collation, on typing 'n' into the incremental search string (and having it committed), the search wouldn't consider a match with '?'. Then, on further typing the combining tilde, it would reject the matches it had found and choose those matches with '?', whether one codepoint or two. Would that behaviour cause serious grief for incremental search? As I use an XSAMPA-based input implemented in quail that attempts to generate text in form NFC, I would type 'n~' to get the Spanish character, and so would never get an intermediate state where the incremental search was searching for 'n'. (At least, not in Emacs 25.3.1.) Richard. From unicode at unicode.org Thu Oct 17 17:11:55 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Thu, 17 Oct 2019 23:11:55 +0100 Subject: Collation Grapheme Clusters and Canonical Equivalence Message-ID: <20191017231155.0f447b26@JRWUBU2> There seems to be a Unicode non-compliance (C6) issue in the definition of collation grapheme clusters (defined in UTS#10 Section 9.9). Using the DUCET collation, the canonically equivalent strings ??? and ??? decompose into collation grapheme clusters in two different ways. The first decomposes into and and the second decomposes into and . Thus UTS#18 RL3.2 'Tailored Grapheme Clusters', namely "To meet this requirement, an implementation shall provide for collation grapheme clusters matches based on a locale's collation order", requires canonically equivalent sequences to be interpreted differently. Is this a known issue? Should I report it against UTS#10 or UTS#18? Is the phrase 'collation order' intended to preclude the use of search collations? Search collations allow one to find a collation grapheme cluster starting with U+0E15 THAI CHARACTER TO TAO in its exemplifying word ???? . DUCET splits it into , , but most (all?) CLDR search collations split it into , , , matching the division into grapheme clusters. If we accept that in the Latin script Vietnamese tone marks have primary weights (this only shows up with strings more than one syllable long), I can produce more egregious examples based on the various sequences canonically equivalent to U+1EAD LATIN SMALL LETTER A WITH CIRCUMFLEX AND DOT BELOW or to U+1EDB LATIN SMALL LETTER O WITH HORN AND ACUTE. The root of the problem is the desire to match only contiguous substrings. This does not play nicely with canonical equivalence. Richard. From unicode at unicode.org Fri Oct 18 01:45:14 2019 From: unicode at unicode.org (Eli Zaretskii via Unicode) Date: Fri, 18 Oct 2019 09:45:14 +0300 Subject: Annoyances from Implementation of Canonical Equivalence In-Reply-To: <20191017215850.106b0475@JRWUBU2> (message from Richard Wordingham via Unicode on Thu, 17 Oct 2019 21:58:50 +0100) References: <20191008152534.2068db6c@JRWUBU2> <20191012020212.6db1634a@JRWUBU2> <20191012131755.7749a622@JRWUBU2> <3EA16101-7FD3-4BE8-82F7-DB32E36BBFAE@telia.com> <20191012233705.52544fb9@JRWUBU2> <20191013140018.5ea512bc@JRWUBU2> <3893F0F9-3199-46A6-BC62-0404708CD71A@telia.com> <20191013201754.6597fdd0@JRWUBU2> <20191013225412.4f1772ca@JRWUBU2> <7DFB381E-EAE3-4B04-AA50-2E18E764E731@telia.com> <20191014011045.35c851e9@JRWUBU2> <83mue3kdrm.fsf@gnu.org> <20191014192939.34ea39ce@JRWUBU2> <83bluji300.fsf@gnu.org> <20191015002359.700a5df0@JRWUBU2> <83tv8ah5kk.fsf@gnu.org> <20191015205215.773ac298@JRWUBU2> <83imopfbct.fsf@gnu.org> <20191017022635.301df2b7@JRWUBU2> <838spjddic.fsf@gnu.org> <20191017215850.106b0475@JRWUBU2> Message-ID: <83y2xi8scl.fsf@gnu.org> > Date: Thu, 17 Oct 2019 21:58:50 +0100 > From: Richard Wordingham via Unicode > > > Sounds arbitrary to me. How do we know that all the users will want > > that? > > If the change from codepoint by codepoint matching is just canonical > equivalence, then there is no way that the ?n? of ?na? will be matched > by the ?n? within ???. "Just canonical equivalence" is also quite arbitrary, for the user's POV. At least IME. > > > (This doesn't apply to diacritic-ignoring folding.) > > But the issue _was_ diacritic-ignoring folding. > > Then we don't seem to have any evidence of user discontent arising from > supporting canonical equivalence. Again, these are very closely related from user's POV. Most users don't understand the difference, in fact. They are not Unicode experts. So maybe I was replying on a very different level, in which case apologies for taking your time. From unicode at unicode.org Fri Oct 18 06:21:20 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Fri, 18 Oct 2019 12:21:20 +0100 Subject: Collation Grapheme Clusters and Canonical Equivalence In-Reply-To: <20191017231155.0f447b26@JRWUBU2> References: <20191017231155.0f447b26@JRWUBU2> Message-ID: <20191018122120.0e1f3f05@JRWUBU2> On Thu, 17 Oct 2019 23:11:55 +0100 Richard Wordingham via Unicode wrote: > There seems to be a Unicode non-compliance (C6) issue in the > definition of collation grapheme clusters (defined in UTS#10 Section > 9.9). Using the DUCET collation, the canonically equivalent strings > ??? U+0E49 THAI CHARACTER MAI THO> and ??? > decompose into collation grapheme clusters in two different ways. > The first decomposes into and and the > second decomposes into and . Correction: One has to take the collating elements in NFD order, so the tone mark (secondary weight) and the vowel (primary weight) also form a cluster, so the division into clusters is , . This split respects canonical equivalence. Replacement: Now, one form of typo one may see in Thai is where the vowel is typed twice. Thai fonts often lack mark-to-mark positioning for sequences that should not occur, so the two copies of the vowel may be overlaid. Proof-reading will not spot the mistake if the font or layout engine does not assist. Thus we can get (417,000 raw Google hits, the first 10 all good). That splits into *three* collation grapheme clusters - , and . Its canonical equivalence splits into two grapheme clusters, for to form a sequence of collating elements without skipping starting at the U+0E49, one must take all three characters. Overall, we end up with *two* collation grapheme clusters, and . > Thus UTS#18 RL3.2 'Tailored Grapheme Clusters', namely "To meet this > requirement, an implementation shall provide for collation grapheme > clusters matches based on a locale's collation order", requires > canonically equivalent sequences to be interpreted differently. Richard. From unicode at unicode.org Fri Oct 18 07:44:31 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Fri, 18 Oct 2019 13:44:31 +0100 Subject: Annoyances from Implementation of Canonical Equivalence In-Reply-To: <83y2xi8scl.fsf@gnu.org> References: <20191008152534.2068db6c@JRWUBU2> <20191012020212.6db1634a@JRWUBU2> <20191012131755.7749a622@JRWUBU2> <3EA16101-7FD3-4BE8-82F7-DB32E36BBFAE@telia.com> <20191012233705.52544fb9@JRWUBU2> <20191013140018.5ea512bc@JRWUBU2> <3893F0F9-3199-46A6-BC62-0404708CD71A@telia.com> <20191013201754.6597fdd0@JRWUBU2> <20191013225412.4f1772ca@JRWUBU2> <7DFB381E-EAE3-4B04-AA50-2E18E764E731@telia.com> <20191014011045.35c851e9@JRWUBU2> <83mue3kdrm.fsf@gnu.org> <20191014192939.34ea39ce@JRWUBU2> <83bluji300.fsf@gnu.org> <20191015002359.700a5df0@JRWUBU2> <83tv8ah5kk.fsf@gnu.org> <20191015205215.773ac298@JRWUBU2> <83imopfbct.fsf@gnu.org> <20191017022635.301df2b7@JRWUBU2> <838spjddic.fsf@gnu.org> <20191017215850.106b0475@JRWUBU2> <83y2xi8scl.fsf@gnu.org> Message-ID: <20191018134431.13ff0238@JRWUBU2> On Fri, 18 Oct 2019 09:45:14 +0300 Eli Zaretskii via Unicode wrote: > > Date: Thu, 17 Oct 2019 21:58:50 +0100 > > From: Richard Wordingham via Unicode > > > > > Sounds arbitrary to me. How do we know that all the users will > > > want that? > > > > If the change from codepoint by codepoint matching is just canonical > > equivalence, then there is no way that the ?n? of ?na? will be > > matched by the ?n? within ???. > > "Just canonical equivalence" is also quite arbitrary, for the user's > POV. At least IME. Here's a similar issue. If I do an incremental search in Welsh text, entering bac (on the way to entering bach) will find words like "bach" and "bachgen" even though their third letter is 'ch', not 'c'. 'Canonical equivalence' is 'DTRT', unless you're working with systems too lazy or too primitive to DTRT. It involves treating sequences of character sequences declared to be identical in signification identically. The only pleasant justification for treating canonical sequences inequivalently that I can think of is to treat the difference as a way of recording how the text was typed. Quite a few editing systems erase that information, and I doubt people care how someone else typed the text. Richard. From unicode at unicode.org Mon Oct 21 04:21:03 2019 From: unicode at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via Unicode) Date: Mon, 21 Oct 2019 11:21:03 +0200 Subject: Coding for Emoji: how to modify programs to work with emoji Message-ID: FYI, here is my presentation from the IUC43: http://bit.ly/iuc43davis Mark -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue Oct 22 02:37:22 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Tue, 22 Oct 2019 08:37:22 +0100 Subject: Coding for Emoji: how to modify programs to work with emoji In-Reply-To: References: Message-ID: <20191022083722.6b456367@JRWUBU2> On Mon, 21 Oct 2019 11:21:03 +0200 Mark Davis ?? via Unicode wrote: > FYI, here is my presentation from the IUC43: http://bit.ly/iuc43davis When it comes to the second sentence of the text of Slide 7 'Grapheme Clusters', my overwhelming reaction is one of extreme anger. Slide 8 does nothing to lessen the offence. The problem is that it gives the impression that in general it is acceptable for backspace to delete the whole grapheme cluster. Richard. From unicode at unicode.org Tue Oct 22 04:04:01 2019 From: unicode at unicode.org (=?utf-8?Q?Daniel_B=C3=BCnzli?= via Unicode) Date: Tue, 22 Oct 2019 11:04:01 +0200 Subject: Grapheme clusters and backspace (was Re: Coding for Emoji: how to modify programs to work with emoji) In-Reply-To: <20191022083722.6b456367@JRWUBU2> References: <20191022083722.6b456367@JRWUBU2> Message-ID: On 22 October 2019 at 09:37:22, Richard Wordingham via Unicode (unicode at unicode.org) wrote: > When it comes to the second sentence of the text of Slide 7 'Grapheme > Clusters', my overwhelming reaction is one of extreme anger. Slide 8 > does nothing to lessen the offence. The problem is that it gives the > impression that in general it is acceptable for backspace to delete the > whole grapheme cluster. Let's turn extreme anger into knowledge.? I'm not very knowledgable in ligature heavy scripts (I suspect that's what you refer to) and what you describe is the first thing I went with for a readline editor data structure.? Would maybe care to expand when exactly you think it's not acceptable and what kind of tools or standard I can find the Unicode toolbox to implement an acceptable behaviour for backspace on general Unicode text.? Best,? Daniel From unicode at unicode.org Tue Oct 22 05:18:06 2019 From: unicode at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via Unicode) Date: Tue, 22 Oct 2019 12:18:06 +0200 Subject: Coding for Emoji: how to modify programs to work with emoji In-Reply-To: <20191022083722.6b456367@JRWUBU2> References: <20191022083722.6b456367@JRWUBU2> Message-ID: That sentence is specific to Emoji sequences. I added a note make it clear that the behavior of backspace for combining marks may be language or script dependent. BTW, the speaker notes were added quickly; feedback on them is welcome. Mark On Tue, Oct 22, 2019 at 9:41 AM Richard Wordingham via Unicode < unicode at unicode.org> wrote: > On Mon, 21 Oct 2019 11:21:03 +0200 > Mark Davis ?? via Unicode wrote: > > > FYI, here is my presentation from the IUC43: http://bit.ly/iuc43davis > > When it comes to the second sentence of the text of Slide 7 'Grapheme > Clusters', my overwhelming reaction is one of extreme anger. Slide 8 > does nothing to lessen the offence. The problem is that it gives the > impression that in general it is acceptable for backspace to delete the > whole grapheme cluster. > > Richard. > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue Oct 22 15:44:10 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Tue, 22 Oct 2019 21:44:10 +0100 Subject: Grapheme clusters and backspace (was Re: Coding for Emoji: how to modify programs to work with emoji) In-Reply-To: References: <20191022083722.6b456367@JRWUBU2> Message-ID: <20191022214410.6020c96b@JRWUBU2> On Tue, 22 Oct 2019 11:04:01 +0200 Daniel B?nzli via Unicode wrote: > On 22 October 2019 at 09:37:22, Richard Wordingham via Unicode > (unicode at unicode.org) wrote: > > > When it comes to the second sentence of the text of Slide 7 > > 'Grapheme Clusters', my overwhelming reaction is one of extreme > > anger. Slide 8 does nothing to lessen the offence. The problem is > > that it gives the impression that in general it is acceptable for > > backspace to delete the whole grapheme cluster. > > Let's turn extreme anger into knowledge.? > > I'm not very knowledgable in ligature heavy scripts (I suspect that's > what you refer to) and what you describe is the first thing I went > with for a readline editor data structure.? Not necessarily ligature-heavy, but heavy in combining characters. Examples at the light end include IPA and pointed Hebrew. The Thai script is another fairly well-known one but Siamese itself doesn't use more than two marks on a consonant. (The vowel marks before and after don't count - they work like letters.) > Would maybe care to expand when exactly you think it's not acceptable > and what kind of tools or standard I can find the Unicode toolbox to > implement an acceptable behaviour for backspace on general Unicode > text.? The compromise that has generally been reached is that 'delete' deletes a grapheme cluster and 'backspace' deletes a scalar value. (There are good editors like Emacs that delete only a single character.) The rationale for this is that backspace undoes the effect of a keystroke. For a perfect match, the keyboard would need to handle the backspace - and everyone editing the text would have to use compatible keyboards! That's not a very plausible scenario for a Wikipedia article. Now, deleting the last character is not very Unicode compliant; there is a family of keyboard designs in development that by default deletes the last character in NFC form if it is precomposed and otherwise the last character in NFD forms. UTS#35 Issue 36 Part 7 Section 5.21 allows for more elaborate behaviours. I would contend that deleting the last character is the best simple approximation. However, it's not impossible for a dead key implementation to decide that dead acute plus 'e' should be emitted as two characters, even though its more usual for it to be emitted as a single character. Now, there are cases where one may be unlikely to type a single character. I can imagine a variation sequence or being implemented as a 'ligature', i.e. a single stroke (or IME selection action) yielding the entry of a base character plus variation selector. Emoji may be another, though I must say I would probably enter a regional indicator pair as two characters, and expect to be able to delete just the last if I made an error, contra Davis 2019. While stacker + consonant might be expected to be a unit, the original designs envisaged them being a sequence. Additionally, I would expect an edit to change the subscripted consonant rather than remove it. In this case, delete last character and delete grapheme cluster agree for the language-independent rules. Richard. From unicode at unicode.org Tue Oct 22 16:27:27 2019 From: unicode at unicode.org (=?utf-8?Q?Daniel_B=C3=BCnzli?= via Unicode) Date: Tue, 22 Oct 2019 23:27:27 +0200 Subject: Grapheme clusters and backspace (was Re: Coding for Emoji: how to modify programs to work with emoji) In-Reply-To: <20191022214410.6020c96b@JRWUBU2> References: <20191022083722.6b456367@JRWUBU2> <20191022214410.6020c96b@JRWUBU2> Message-ID: Thanks for you answer. > The compromise that has generally been reached is that 'delete' deletes > a grapheme cluster and 'backspace' deletes a scalar value. (There are > good editors like Emacs that delete only a single character.) Just to make things clear. When you say character in your message, you consistently mean scalar value right ? Best,? Daniel From unicode at unicode.org Tue Oct 22 17:32:31 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Tue, 22 Oct 2019 23:32:31 +0100 Subject: Grapheme clusters and backspace (was Re: Coding for Emoji: how to modify programs to work with emoji) In-Reply-To: References: <20191022083722.6b456367@JRWUBU2> <20191022214410.6020c96b@JRWUBU2> Message-ID: <20191022233231.441d2af1@JRWUBU2> On Tue, 22 Oct 2019 23:27:27 +0200 Daniel B?nzli via Unicode wrote: > Thanks for you answer. > > > The compromise that has generally been reached is that 'delete' > > deletes a grapheme cluster and 'backspace' deletes a scalar value. > > (There are good editors like Emacs that delete only a single > > character.) > > Just to make things clear. When you say character in your message, > you consistently mean scalar value right ? Yes. I find it hard to imagine that having to type them doesn't endow then with some sort of reality in the users' minds, though some, such as invisible stackers, are probably envisaged as control characters. One does come across some odd entry methods, such as typing an Indic akshara using the Latin script and then entering it as a whole. That is no more conducive to seeing the constituents as characters than is typing wab- to get the hieroglyph ??. Richard. From unicode at unicode.org Tue Oct 22 18:15:57 2019 From: unicode at unicode.org (=?utf-8?B?TWFydGluIEouIETDvHJzdA==?= via Unicode) Date: Tue, 22 Oct 2019 23:15:57 +0000 Subject: Grapheme clusters and backspace (was Re: Coding for Emoji: how to modify programs to work with emoji) In-Reply-To: <20191022233231.441d2af1@JRWUBU2> References: <20191022083722.6b456367@JRWUBU2> <20191022214410.6020c96b@JRWUBU2> <20191022233231.441d2af1@JRWUBU2> Message-ID: <3e07968d-33a5-9864-570c-c365022883ce@it.aoyama.ac.jp> Hello Richard, others, On 2019/10/23 07:32, Richard Wordingham via Unicode wrote: > On Tue, 22 Oct 2019 23:27:27 +0200 > Daniel B?nzli via Unicode wrote: >> Just to make things clear. When you say character in your message, >> you consistently mean scalar value right ? > > Yes. > > I find it hard to imagine that having to type them doesn't endow then > with some sort of reality in the users' minds, though some, such as > invisible stackers, are probably envisaged as control characters. I think this to some extent is a question of "reality in the users' minds". But to a very large extent, this is an issue of muscle memory. If a user works with a keyboard/input method that deletes a whole combination, their muscles will get used to that the same way they will get used to the other case. Users are perfectly capable of talking about characters and in the same sentence use that word once for something like individual codepoints and later for a whole combination. > One does come across some odd entry methods, such as typing an Indic > akshara using the Latin script and then entering it as a whole. That > is no more conducive to seeing the constituents as characters than is > typing wab- to get the hieroglyph ??. The input of Japanese Kana is usually done from a Latin keyboard. As an example, to input the syllable "ka" (?), one presses the keys for 'k' and 'a'. In all the IMEs I have used, a backspace deletes the whole "?", not only the 'a'. One has to get used to it (I still occasionally want to press two backspaces when realizing I made a typo), but one gets used to it. There are also cases such as "kya" ? "??", where the three Latin keyboard presses cannot be allocated 2-1 or 1-2 to the two resulting Hiragana. In a sophisticated implementation, a backspace could go from "??" to "ky", but that would only work immediately after input. Of course, for Japanese input, Latin ? Kana is only the first layer, the second layer is Kana ? Kanji. Regards, Martin. From unicode at unicode.org Tue Oct 22 21:31:09 2019 From: unicode at unicode.org (Ben Morphett via Unicode) Date: Wed, 23 Oct 2019 02:31:09 +0000 Subject: Unicode Digest, Vol 70, Issue 17 In-Reply-To: References: Message-ID: It totally depends on the editor. In Notepad++, when I backspace over "Man Teacher: Dark Skin Tone", I get "Man Teacher: Dark Skin Tone" => ""Man: Dark Skin Tone" => gone. In the Outlook e-mail editor, I get ??????? ????? ?? -- Cheers, Ben Morphett -----Original Message----- From: Richard Wordingham via Unicode Sent: Tuesday, 22 October 2019 6:37 PM To: unicode at unicode.org Subject: Re: Coding for Emoji: how to modify programs to work with emoji On Mon, 21 Oct 2019 11:21:03 +0200 Mark Davis ?? via Unicode wrote: > FYI, here is my presentation from the IUC43: http://bit.ly/iuc43davis When it comes to the second sentence of the text of Slide 7 'Grapheme Clusters', my overwhelming reaction is one of extreme anger. Slide 8 does nothing to lessen the offence. The problem is that it gives the impression that in general it is acceptable for backspace to delete the whole grapheme cluster. Richard. From unicode at unicode.org Wed Oct 23 03:02:44 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Wed, 23 Oct 2019 09:02:44 +0100 Subject: Grapheme clusters and backspace (was Re: Coding for Emoji: how to modify programs to work with emoji) In-Reply-To: <3e07968d-33a5-9864-570c-c365022883ce@it.aoyama.ac.jp> References: <20191022083722.6b456367@JRWUBU2> <20191022214410.6020c96b@JRWUBU2> <20191022233231.441d2af1@JRWUBU2> <3e07968d-33a5-9864-570c-c365022883ce@it.aoyama.ac.jp> Message-ID: <20191023090244.05b22cf4@JRWUBU2> On Tue, 22 Oct 2019 23:15:57 +0000 Martin J. D?rst via Unicode wrote: > I think this to some extent is a question of "reality in the users' > minds". But to a very large extent, this is an issue of muscle > memory. If a user works with a keyboard/input method that deletes a > whole combination, their muscles will get used to that the same way > they will get used to the other case. The issue is one of being able to edit the cluster. Large clusters call out for editing rather than replacement. Richard. From unicode at unicode.org Wed Oct 23 11:39:04 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Wed, 23 Oct 2019 17:39:04 +0100 Subject: Grapheme clusters & backspace (was: Unicode Digest, Vol 70, Issue 17) In-Reply-To: References: Message-ID: <20191023173904.4be31e5e@JRWUBU2> On Wed, 23 Oct 2019 02:31:09 +0000 Ben Morphett via Unicode wrote: > It totally depends on the editor. In Notepad++, when I backspace > over "Man Teacher: Dark Skin Tone", I get "Man Teacher: Dark Skin > Tone" => ""Man: Dark Skin Tone" => gone. In MS Word 2016 on Windows 10, I get an intermediate stage of ?Man: Dark Skin ZWJ?, which is comparable to my suggestion that only the consonant be deleted from a sequence of Indic stacker + consonant, even though it be very similar to a unitary consonant sign. The main difference in the Indic pair is that there is a (misplaced) grapheme cluster boundary in the former. Mark Davis has proclaimed that all these emoji behaviours are WRONG. What is wrong is that the ZWJ may go missing with copy and paste, as I found between Word and plain Notepad. Richard. From unicode at unicode.org Tue Oct 29 17:36:21 2019 From: unicode at unicode.org (wjgo_10009@btinternet.com via Unicode) Date: Tue, 29 Oct 2019 22:36:21 +0000 (GMT) Subject: New Public Review on QID emoji Message-ID: Hello everyone I have recently learned that there is a new Public Review Issue on QID emoji. https://www.unicode.org/review/pri408/ Also the closure date for PRI 405 has been given an extension. http://www.unicode.org/review/pri405/ https://www.unicode.org/review/ William Overington Tuesday 29 October 2019 From unicode at unicode.org Wed Oct 30 12:41:16 2019 From: unicode at unicode.org (wjgo_10009@btinternet.com via Unicode) Date: Wed, 30 Oct 2019 17:41:16 +0000 (GMT) Subject: New Public Review on QID emoji Message-ID: <5eff0ea4.a4d.16e1dc1e66b.Webtop.52@btinternet.com> Hello everyone I have been reading about QID emoji and what is proposed. At present I have a question to which I cannot find the answer. Is the QID emoji format, if approved by the Unicode Technical Committee going to be sent to the ISO/IEC 10646 committee for consideration by that committee? As the QID emoji format is in a Unicode Technical Standard and does not include the encoding of any new _atomic_ characters, I am concerned that the answer to the above question may well be along the lines of "No" maybe with some reasoning as to why not. Yet will a QID emoji essentially be _de facto_ a character even if not _de jure_ a character? For a QID emoji will not just be "markup using existing characters from the ISO/IEC 10646 standard that is synchronized with Unicode", such as would be a markup that anyone could devise for use in his or her research and experimentation or indeed some public use, it will be a Unicode Inc. endorsed "whatever" that is very closely linked to The Unicode Standard even if not deemed to be part of it. As I understand the situation, in some countries people take no (formal) notice as such of The Unicode Standard but rely solely on ISO/IEC 10646. Often this may well present no practical problems in information technology and its applications because the two standards are synchronized each with the other. Yet if QID emoji are implemented by Unicode Inc. without also being implemented by ISO/IEC 10646 then that could lead to future problems, notwithstanding any _de jure_ situation that QID emoji are not characters, because they will be much more than Private Use characters yet less than characters that are in ISO/IEC 10646. I am in favour of the encoding of the QID emoji mechanism and its practical application. However I wonder about what are the consequences for interoperability and communication if QID emoji become used - maybe quite widely - and yet the tag sequences are not discernable in meaning from ISO/IEC 10646 or any related ISO/IEC documents. William Overington Wednesday 30 October 2019 From unicode at unicode.org Wed Oct 30 14:18:44 2019 From: unicode at unicode.org (Ken Whistler via Unicode) Date: Wed, 30 Oct 2019 12:18:44 -0700 Subject: New Public Review on QID emoji In-Reply-To: <5eff0ea4.a4d.16e1dc1e66b.Webtop.52@btinternet.com> References: <5eff0ea4.a4d.16e1dc1e66b.Webtop.52@btinternet.com> Message-ID: <3d02402e-6ab0-2417-8c23-c958f3ae9092@sonic.net> On 10/30/2019 10:41 AM, wjgo_10009 at btinternet.com via Unicode wrote: > > At present I have a question to which I cannot find the answer. > > Is the QID emoji format, if approved by the Unicode Technical > Committee going to be sent to the ISO/IEC 10646 committee for > consideration by that committee? No. > > As the QID emoji format is in a Unicode Technical Standard and does > not include the encoding of any new _atomic_ characters, I am > concerned that the answer to the above question may well be along the > lines of "No" maybe with some reasoning as to why not. As you surmised. > > Yet will a QID emoji essentially be _de facto_ a character even if not > _de jure_ a character? That distinction is effectively meaningless. There are any number of entities that end users perceive as "characters", which are not represented by a single code point in the Unicode Standard (or 10646) -- and this has been the case now for decades. > > > Yet if QID emoji are implemented by Unicode Inc. without also being > implemented by ISO/IEC 10646 then that could lead to future problems, > notwithstanding any _de jure_ situation that QID emoji are not > characters, because they will be much more than Private Use characters > yet less than characters that are in ISO/IEC 10646. What you are missing is that *many* emoji are already represented by sequences of characters. See emoji modifier sequences, emoji flag sequences, emoji ZWJ sequences. *None* of those are specified in 10646, have not been for years now, and never will be. And yet, there is no de jure standardization crisis here, or any interoperability issue for emoji arising from that situation. > > I am in favour of the encoding of the QID emoji mechanism and its > practical application. However I wonder about what are the > consequences for interoperability and communication if QID emoji > become used - maybe quite widely - and yet the tag sequences are not > discernable in meaning from ISO/IEC 10646 or any related ISO/IEC > documents. There may well be interoperability concerns specifically for the QID emoji mechanism, but that would be an issue pertaining to the architecture of that mechanism specifically. It isn't anything to do with the relationship between the Unicode Standard (and UTS #51) and ISO/IEC 10646. --Ken