From unicode at unicode.org  Thu Oct  3 10:12:51 2019
From: unicode at unicode.org (Johannes Bergerhausen via Unicode)
Date: Thu, 3 Oct 2019 17:12:51 +0200
Subject: worldswritingsystems.org
Message-ID: <8295C9C0-05BB-44F6-88D9-21B1B34AD902@bergerhausen.com>

Dear list,

FYI: we?ve updated http://worldswritingsystems.org to Unicode 12.1 and fixed a few little bugs and errors.

All the best,
Johannes


? Helmig Bergerhausen
Gladbacher Stra?e 40, D-50672 K?ln, Germany
www.helmigbergerhausen.de

? Prof. Bergerhausen
Hochschule Mainz, School of Design
Holzstra?e 36, D-55116 Mainz, Germany
www.worldswritingsystems.org
www.decodeunicode.org
www.designlabor-gutenberg.de
www.hs-mainz.de/gestaltung

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20191003/07311ce1/attachment.html>

From unicode at unicode.org  Thu Oct  3 11:53:37 2019
From: unicode at unicode.org (Markus Scherer via Unicode)
Date: Thu, 3 Oct 2019 09:53:37 -0700
Subject: Manipuri/Meitei customary writing system
Message-ID: <CAN49p6qScAk99iUvxSj3Gv=XpGQ-X5oC1aa-aS1cHx+v=gr+jA@mail.gmail.com>

Dear Unicoders,

Is Manipuri/Meitei customarily written in Bangla/Bengali script or
in Meitei script?

I am looking at
https://en.wikipedia.org/wiki/Meitei_language#Writing_systems which seems
to describe writing practice in transition, and I can't quite tell where it
stands.

Is the use of the Meitei script aspirational or customary?
Which script is being used for major newspapers, popular books, and video
captions?

Thanks,
markus
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20191003/8187b85b/attachment.html>

From unicode at unicode.org  Fri Oct  4 01:35:09 2019
From: unicode at unicode.org (=?utf-8?B?TWFydGluIEouIETDvHJzdA==?= via Unicode)
Date: Fri, 4 Oct 2019 06:35:09 +0000
Subject: Manipuri/Meitei customary writing system
In-Reply-To: <CAN49p6qScAk99iUvxSj3Gv=XpGQ-X5oC1aa-aS1cHx+v=gr+jA@mail.gmail.com>
References: <CAN49p6qScAk99iUvxSj3Gv=XpGQ-X5oC1aa-aS1cHx+v=gr+jA@mail.gmail.com>
Message-ID: <86f9a05f-8b00-1992-5afd-d07ec834df47@it.aoyama.ac.jp>

Hello Markus,

On 2019/10/04 01:53, Markus Scherer via Unicode wrote:
> Dear Unicoders,
> 
> Is Manipuri/Meitei customarily written in Bangla/Bengali script or
> in Meitei script?
> 
> I am looking at
> https://en.wikipedia.org/wiki/Meitei_language#Writing_systems which seems
> to describe writing practice in transition, and I can't quite tell where it
> stands.
> 
> Is the use of the Meitei script aspirational or customary?
> Which script is being used for major newspapers, popular books, and video
> captions?

This may give you some more information:
https://www.atypi.org/conferences/tokyo-2019/programme/activity?a=906https://www.youtube.com/watch?v=S8XxVZkfUkk

It's a recent talk at ATypI in Tokyo (sponsored by Google, among others).

Regards,   Martin.


From unicode at unicode.org  Fri Oct  4 02:12:59 2019
From: unicode at unicode.org (=?utf-8?B?TWFydGluIEouIETDvHJzdA==?= via Unicode)
Date: Fri, 4 Oct 2019 07:12:59 +0000
Subject: Manipuri/Meitei customary writing system
In-Reply-To: <86f9a05f-8b00-1992-5afd-d07ec834df47@it.aoyama.ac.jp>
References: <CAN49p6qScAk99iUvxSj3Gv=XpGQ-X5oC1aa-aS1cHx+v=gr+jA@mail.gmail.com>
 <86f9a05f-8b00-1992-5afd-d07ec834df47@it.aoyama.ac.jp>
Message-ID: <37a2689d-50f0-84f9-40a4-92e97e670325@it.aoyama.ac.jp>

On 2019/10/04 15:35, Martin J. D?rst via Unicode wrote:
> Hello Markus,
> 
> On 2019/10/04 01:53, Markus Scherer via Unicode wrote:
>> Dear Unicoders,
>>
>> Is Manipuri/Meitei customarily written in Bangla/Bengali script or
>> in Meitei script?
>>
>> I am looking at
>> https://en.wikipedia.org/wiki/Meitei_language#Writing_systems which seems
>> to describe writing practice in transition, and I can't quite tell where it
>> stands.
>>
>> Is the use of the Meitei script aspirational or customary?
>> Which script is being used for major newspapers, popular books, and video
>> captions?
> 
> This may give you some more information:
> https://www.atypi.org/conferences/tokyo-2019/programme/activity?a=906

Sorry, this should have been two separate URIs (about the same talk).

> https://www.youtube.com/watch?v=S8XxVZkfUkk
> 
> It's a recent talk at ATypI in Tokyo (sponsored by Google, among others).
> 
> Regards,   Martin.
> 


From unicode at unicode.org  Fri Oct  4 16:02:57 2019
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Fri, 4 Oct 2019 22:02:57 +0100
Subject: Manipuri/Meitei customary writing system
In-Reply-To: <37a2689d-50f0-84f9-40a4-92e97e670325@it.aoyama.ac.jp>
References: <CAN49p6qScAk99iUvxSj3Gv=XpGQ-X5oC1aa-aS1cHx+v=gr+jA@mail.gmail.com>
 <86f9a05f-8b00-1992-5afd-d07ec834df47@it.aoyama.ac.jp>
 <37a2689d-50f0-84f9-40a4-92e97e670325@it.aoyama.ac.jp>
Message-ID: <20191004220257.2ea735df@JRWUBU2>

On Fri, 4 Oct 2019 07:12:59 +0000
Martin J. D?rst via Unicode <unicode at unicode.org> wrote:

> On 2019/10/04 15:35, Martin J. D?rst via Unicode wrote:
> > Hello Markus,
> > 
> > On 2019/10/04 01:53, Markus Scherer via Unicode wrote:  
> >> Dear Unicoders,
> >>
> >> Is Manipuri/Meitei customarily written in Bangla/Bengali script or
> >> in Meitei script?
> >>
> >> I am looking at
> >> https://en.wikipedia.org/wiki/Meitei_language#Writing_systems
> >> which seems to describe writing practice in transition, and I
> >> can't quite tell where it stands.
> >>
> >> Is the use of the Meitei script aspirational or customary?
> >> Which script is being used for major newspapers, popular books,
> >> and video captions?  
> > 
> > This may give you some more information:
> > https://www.atypi.org/conferences/tokyo-2019/programme/activity?a=906  
> 
> Sorry, this should have been two separate URIs (about the same talk).
> 
> > https://www.youtube.com/watch?v=S8XxVZkfUkk
> > 
> > It's a recent talk at ATypI in Tokyo (sponsored by Google, among
> > others).

So newspaper sales tell us that the Bengali script is still the *usual*
script for the language.  Is that a different question to
what the 'customary' script is?

Richard.


From unicode at unicode.org  Fri Oct  4 18:11:50 2019
From: unicode at unicode.org (Markus Scherer via Unicode)
Date: Fri, 4 Oct 2019 16:11:50 -0700
Subject: Manipuri/Meitei customary writing system
In-Reply-To: <20191004220257.2ea735df@JRWUBU2>
References: <CAN49p6qScAk99iUvxSj3Gv=XpGQ-X5oC1aa-aS1cHx+v=gr+jA@mail.gmail.com>
 <86f9a05f-8b00-1992-5afd-d07ec834df47@it.aoyama.ac.jp>
 <37a2689d-50f0-84f9-40a4-92e97e670325@it.aoyama.ac.jp>
 <20191004220257.2ea735df@JRWUBU2>
Message-ID: <CAN49p6qPcBEeUjh0KAs4E2u=FsphotGJRBWTMtYmdEqV_xNkdA@mail.gmail.com>

On Fri, Oct 4, 2019 at 2:05 PM Richard Wordingham via Unicode <
unicode at unicode.org> wrote:

> > >> Is the use of the Meitei script aspirational or customary?
> > >> Which script is being used for major newspapers, popular books,
> > >> and video captions?
> > >
> > > This may give you some more information:
> > > https://www.atypi.org/conferences/tokyo-2019/programme/activity?a=906
>
> >
> > Sorry, this should have been two separate URIs (about the same talk).
> >
> > > https://www.youtube.com/watch?v=S8XxVZkfUkk
> > >
> > > It's a recent talk at ATypI in Tokyo (sponsored by Google, among
> > > others).
>
> So newspaper sales tell us that the Bengali script is still the *usual*
> script for the language.


Yes. FYI in the video, the relevant part is at 14:04-14:34.
My transcription:

"Due to the lack of readership of Meetei Mayek, local newspapers continue
to use Bengali script. On 21st September 2008, Hueiyen Lanpao, a newspaper
company, published the first Meetei Mayek newspaper set entirely using
Meetei Mayek script. Although there have been small columns for Meetei
Mayek in other newspapers, Hueiyen Lanpao is still the only local newspaper
in all of Manipur to be printed using Meetei Mayek script till date."


Earlier the presenter says that Bengali is starting to disappear from
public signage.

Is that a different question to what the 'customary' script is?
>

To me, things like newspapers are among the most indicative of customary
use.

>From what I understand, someone who wants to support this language should
prepare to support both Beng and Mtei, with emphasis on Beng now and Mtei
later.

markus
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20191004/08a41aad/attachment.html>

From unicode at unicode.org  Sun Oct  6 13:57:36 2019
From: unicode at unicode.org (=?utf-8?B?5qKB5rW3IExpYW5nIEhhaQ==?= via Unicode)
Date: Sun, 6 Oct 2019 13:57:36 -0500
Subject: =?utf-8?Q?Alternative_encodings_for_Malayalam_=E2=80=9Cnta?=
 =?utf-8?Q?=E2=80=9D?=
Message-ID: <49FE1395-9719-45A4-B3F6-0DACF22D313D@gmail.com>

Folks,

(Microsoft Peter and Andrew, search for ?Windows? in the document.)

(Asmus, in the document there?s a section 5, ICANN RZ-LGR situation?let me know if there?s some news.)

This is a pretty straightforward document about the notoriously problematic encoding of Malayalam <chillu n, bottom-side sign of rra>. I always wanted to properly document this, so finally here it is:

L2/19-345 <http://www.unicode.org/cgi-bin/GetMatchingDocs.pl?L2/19-345>
Alternative encodings for Malayalam "nta"
Liang Hai
2019-10-06

Unfortunately, as <NA, VIRAMA, RRA> has already become the de facto standard encoding, now we have to recognize it in the Core Spec. It?s a bit like another Tamil sr? situation.

An excerpt of the proposal:

Document the following widely used encoding in the Core Specification as an alternative representation for Malayalam [glyph] (<chillu n, bottom-side sign of rra>) that is a special case and does not suggest any productive rule in the encoding model:

<U+0D28 ? MALAYALAM LETTER NA, U+0D4D ?? MALAYALAM SIGN VIRAMA, U+0D31 ? MALAYALAM LETTER RRA>

Best,
?? Liang Hai
https://lianghai.github.io

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20191006/4401b876/attachment.html>

From unicode at unicode.org  Sun Oct  6 16:08:13 2019
From: unicode at unicode.org (Cibu via Unicode)
Date: Sun, 6 Oct 2019 22:08:13 +0100
Subject: =?UTF-8?Q?Re=3A_Alternative_encodings_for_Malayalam_=E2=80=9Cnta=E2=80=9D?=
In-Reply-To: <49FE1395-9719-45A4-B3F6-0DACF22D313D@gmail.com>
References: <49FE1395-9719-45A4-B3F6-0DACF22D313D@gmail.com>
Message-ID: <CAD8TiP6tEs6OCUrSu9jR=zZCrZ1arY8tRdy-T43yXVRm-NNycQ@mail.gmail.com>

Thanks for addressing this. Here is my response:
https://docs.google.com/document/d/1K6L82VRmCGc9Fb4AOitNk4MT7Nu4V8aKUJo_1mW5X1o/

In summary, my take is:

The sequence <NA, VIRAMA, RRA> for ??? (<<chillu N, subscript RRA>>) should
not be legitimized as an alternate encoding; but should be recognized as a
prevailing non-standard legacy encoding.


On Sun, Oct 6, 2019 at 7:57 PM ?? Liang Hai <lianghai at gmail.com> wrote:

> Folks,
>
> (Microsoft Peter and Andrew, search for ?Windows? in the document.)
>
> (Asmus, in the document there?s a section 5, *ICANN RZ-LGR situation*?let
> me know if there?s some news.)
>
> This is a pretty straightforward document about the notoriously
> problematic encoding of Malayalam <*chillu n*, bottom-side sign of *rra*>.
> I always wanted to properly document this, so finally here it is:
>
> L2/19-345 <http://www.unicode.org/cgi-bin/GetMatchingDocs.pl?L2/19-345>
> *Alternative encodings for Malayalam "nta"*
> Liang Hai
> 2019-10-06
>
>
> Unfortunately, as <NA, VIRAMA, RRA> has already become the de facto
> standard encoding, now we have to recognize it in the Core Spec. It?s a bit
> like another Tamil *sr?* situation.
>
> An excerpt of the proposal:
>
> Document the following widely used encoding in the Core Specification as
> an alternative representation for Malayalam [glyph] (<chillu n, bottom-side
> sign of rra>) that is a special case and does not suggest any productive
> rule in the encoding model:
>
> <U+0D28 ? MALAYALAM LETTER NA, U+0D4D ?? MALAYALAM SIGN
> VIRAMA, U+0D31 ? MALAYALAM LETTER RRA>
>
>
> Best,
> ?? Liang Hai
> https://lianghai.github.io
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20191006/544f2b2e/attachment.html>

From unicode at unicode.org  Sun Oct  6 17:03:01 2019
From: unicode at unicode.org (Asmus Freytag (c) via Unicode)
Date: Sun, 6 Oct 2019 15:03:01 -0700
Subject: =?UTF-8?Q?Re=3a_Alternative_encodings_for_Malayalam_=e2=80=9cnta?=
 =?UTF-8?B?4oCd?=
In-Reply-To: <CAD8TiP6tEs6OCUrSu9jR=zZCrZ1arY8tRdy-T43yXVRm-NNycQ@mail.gmail.com>
References: <49FE1395-9719-45A4-B3F6-0DACF22D313D@gmail.com>
 <CAD8TiP6tEs6OCUrSu9jR=zZCrZ1arY8tRdy-T43yXVRm-NNycQ@mail.gmail.com>
Message-ID: <a99ee135-ebc1-10a7-1de8-6c8a8ed46558@ix.netcom.com>

Have you submitted that response as a UTC document?
A./

On 10/6/2019 2:08 PM, Cibu wrote:
> Thanks for addressing this. Here is my response: 
> https://docs.google.com/document/d/1K6L82VRmCGc9Fb4AOitNk4MT7Nu4V8aKUJo_1mW5X1o/
>
> In summary, my take is:
>
> The sequence <NA, VIRAMA, RRA> for ??? (<<chillu N, subscript RRA>>) 
> should not be legitimized as an alternate encoding; but should be 
> recognized as a prevailing non-standard legacy encoding.
>
>
> On Sun, Oct 6, 2019 at 7:57 PM ?? Liang Hai <lianghai at gmail.com 
> <mailto:lianghai at gmail.com>> wrote:
>
>     Folks,
>
>     (Microsoft Peter and Andrew, search for ?Windows? in the document.)
>
>     (Asmus, in the document there?s a section 5, /ICANN RZ-LGR
>     situation/?let me know if there?s some news.)
>
>     This is a pretty straightforward document about the notoriously
>     problematic encoding of Malayalam </chillu n/, bottom-side sign of
>     /rra/>. I always wanted to properly document this, so finally here
>     it is:
>
>         L2/19-345
>         <http://www.unicode.org/cgi-bin/GetMatchingDocs.pl?L2/19-345>
>         *Alternative encodings?for Malayalam "nta"*
>         Liang?Hai
>         2019-10-06
>
>
>     Unfortunately, as <NA, VIRAMA, RRA> has already become the de
>     facto standard encoding, now we have to recognize it in the Core
>     Spec. It?s a bit like another Tamil /sr?/ situation.
>
>     An excerpt of the proposal:
>
>         Document the following widely used encoding in
>         the?Core?Specification?as an alternative?representation for
>         Malayalam [glyph]?(<chillu n, bottom-side sign of?rra>)?that
>         is a special?case and?does not suggest any productive rule in
>         the encoding model:
>
>         <U+0D28???MALAYALAM LETTER NA, U+0D4D????MALAYALAM SIGN
>         VIRAMA,?U+0D31???MALAYALAM?LETTER RRA>
>
>
>     Best,
>     ?? Liang Hai
>     https://lianghai.github.io
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20191006/53a80e1d/attachment.html>

From unicode at unicode.org  Sun Oct  6 17:06:11 2019
From: unicode at unicode.org (Asmus Freytag (c) via Unicode)
Date: Sun, 6 Oct 2019 15:06:11 -0700
Subject: =?UTF-8?Q?Re=3a_Alternative_encodings_for_Malayalam_=e2=80=9cnta?=
 =?UTF-8?B?4oCd?=
In-Reply-To: <49FE1395-9719-45A4-B3F6-0DACF22D313D@gmail.com>
References: <49FE1395-9719-45A4-B3F6-0DACF22D313D@gmail.com>
Message-ID: <f66abf82-a9f3-8d03-8a17-7f7c605d10ac@ix.netcom.com>

On 10/6/2019 11:57 AM, ?? Liang Hai wrote:
> Folks,
>
> (Microsoft Peter and Andrew, search for ?Windows? in the document.)
>
> (Asmus, in the document there?s a section 5, /ICANN RZ-LGR 
> situation/?let me know if there?s some news.)

The issue, as it affects domain names, has been brought to the authors 
of the Malayalam Root Zone LGR proposal, the Neo-Brahmi Generation 
Panel; however, there is no new status to report at this time. I would 
appreciate if you could keep me updated on any details of the UTC 
decision (particularly those that do not make the rather terse UTC minutes).

A./


>
> This is a pretty straightforward document about the notoriously 
> problematic encoding of Malayalam </chillu n/, bottom-side sign of 
> /rra/>. I always wanted to properly document this, so finally here it is:
>
>     L2/19-345
>     <http://www.unicode.org/cgi-bin/GetMatchingDocs.pl?L2/19-345>
>     *Alternative encodings?for Malayalam "nta"*
>     Liang?Hai
>     2019-10-06
>
>
> Unfortunately, as <NA, VIRAMA, RRA> has already become the de facto 
> standard encoding, now we have to recognize it in the Core Spec. It?s 
> a bit like another Tamil /sr?/ situation.
>
> An excerpt of the proposal:
>
>     Document the following widely used encoding in
>     the?Core?Specification?as an alternative?representation for
>     Malayalam [glyph]?(<chillu n, bottom-side sign of?rra>)?that is a
>     special?case and?does not suggest any productive rule in the
>     encoding model:
>
>     <U+0D28???MALAYALAM LETTER NA, U+0D4D????MALAYALAM SIGN
>     VIRAMA,?U+0D31???MALAYALAM?LETTER RRA>
>
>
> Best,
> ?? Liang Hai
> https://lianghai.github.io
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20191006/f74d1213/attachment.html>

From unicode at unicode.org  Sun Oct  6 17:10:30 2019
From: unicode at unicode.org (Cibu via Unicode)
Date: Sun, 6 Oct 2019 23:10:30 +0100
Subject: =?UTF-8?Q?Re=3A_Alternative_encodings_for_Malayalam_=E2=80=9Cnta=E2=80=9D?=
In-Reply-To: <a99ee135-ebc1-10a7-1de8-6c8a8ed46558@ix.netcom.com>
References: <49FE1395-9719-45A4-B3F6-0DACF22D313D@gmail.com>
 <CAD8TiP6tEs6OCUrSu9jR=zZCrZ1arY8tRdy-T43yXVRm-NNycQ@mail.gmail.com>
 <a99ee135-ebc1-10a7-1de8-6c8a8ed46558@ix.netcom.com>
Message-ID: <CAD8TiP42Jbnr3uaCEJZBW5eG8AFpst+P57ZfWYCJLrrpBAS_qw@mail.gmail.com>

Yes; it is now available as L2/19-348
<http://www.unicode.org/cgi-bin/GetMatchingDocs.pl?L2/19-348>.

On Sun, Oct 6, 2019 at 11:03 PM Asmus Freytag (c) <asmusf at ix.netcom.com>
wrote:

> Have you submitted that response as a UTC document?
> A./
>
> On 10/6/2019 2:08 PM, Cibu wrote:
>
> Thanks for addressing this. Here is my response:
> https://docs.google.com/document/d/1K6L82VRmCGc9Fb4AOitNk4MT7Nu4V8aKUJo_1mW5X1o/
>
> In summary, my take is:
>
> The sequence <NA, VIRAMA, RRA> for ??? (<<chillu N, subscript RRA>>)
> should not be legitimized as an alternate encoding; but should be
> recognized as a prevailing non-standard legacy encoding.
>
>
> On Sun, Oct 6, 2019 at 7:57 PM ?? Liang Hai <lianghai at gmail.com> wrote:
>
>> Folks,
>>
>> (Microsoft Peter and Andrew, search for ?Windows? in the document.)
>>
>> (Asmus, in the document there?s a section 5, *ICANN RZ-LGR situation*?let
>> me know if there?s some news.)
>>
>> This is a pretty straightforward document about the notoriously
>> problematic encoding of Malayalam <*chillu n*, bottom-side sign of *rra*>.
>> I always wanted to properly document this, so finally here it is:
>>
>> L2/19-345 <http://www.unicode.org/cgi-bin/GetMatchingDocs.pl?L2/19-345>
>> *Alternative encodings for Malayalam "nta"*
>> Liang Hai
>> 2019-10-06
>>
>>
>> Unfortunately, as <NA, VIRAMA, RRA> has already become the de facto
>> standard encoding, now we have to recognize it in the Core Spec. It?s a bit
>> like another Tamil *sr?* situation.
>>
>> An excerpt of the proposal:
>>
>> Document the following widely used encoding in the Core Specification as
>> an alternative representation for Malayalam [glyph] (<chillu n, bottom-side
>> sign of rra>) that is a special case and does not suggest any productive
>> rule in the encoding model:
>>
>> <U+0D28 ? MALAYALAM LETTER NA, U+0D4D ?? MALAYALAM SIGN
>> VIRAMA, U+0D31 ? MALAYALAM LETTER RRA>
>>
>>
>> Best,
>> ?? Liang Hai
>> https://lianghai.github.io
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20191006/5705ffc4/attachment.html>

From unicode at unicode.org  Sun Oct  6 18:05:16 2019
From: unicode at unicode.org (Tex via Unicode)
Date: Sun, 6 Oct 2019 16:05:16 -0700
Subject: comma ellipses
Message-ID: <001501d57c9a$8c41ada0$a4c508e0$@xencraft.com>

Now that comma ellipses (,,,) are a thing (at least on social media) do we need a character proposal?

 
Asking for a friend,,, J

 
tex

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20191006/7bac1e09/attachment.html>

From unicode at unicode.org  Sun Oct  6 19:01:04 2019
From: unicode at unicode.org (Asmus Freytag via Unicode)
Date: Sun, 6 Oct 2019 17:01:04 -0700
Subject: comma ellipses
In-Reply-To: <001501d57c9a$8c41ada0$a4c508e0$@xencraft.com>
References: <001501d57c9a$8c41ada0$a4c508e0$@xencraft.com>
Message-ID: <f0b80760-bd43-3218-8b3e-abc71dd4b1db@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20191006/9bd0130b/attachment.html>

From unicode at unicode.org  Sun Oct  6 22:21:06 2019
From: unicode at unicode.org (Garth Wallace via Unicode)
Date: Sun, 6 Oct 2019 20:21:06 -0700
Subject: comma ellipses
In-Reply-To: <f0b80760-bd43-3218-8b3e-abc71dd4b1db@ix.netcom.com>
References: <001501d57c9a$8c41ada0$a4c508e0$@xencraft.com>
 <f0b80760-bd43-3218-8b3e-abc71dd4b1db@ix.netcom.com>
Message-ID: <CA+p4_H3P3o_f2ruMT7Rxhn4wM9B7Oqzp8t1=5VScHArR_Jc8tg@mail.gmail.com>

It?s deliberately incorrect for humorous effect. It gets used, but making
it ?official? would almost defeat the purpose.

On Sun, Oct 6, 2019 at 5:02 PM Asmus Freytag via Unicode <
unicode at unicode.org> wrote:

> On 10/6/2019 4:05 PM, Tex via Unicode wrote:
>
> Now that comma ellipses (,,,) are a thing (at least on social media) do we
> need a character proposal?
>
>
>
> Asking for a friend,,, J
>
>
>
> tex
>
> I thought the main reason we ended up with the period (dot) one is because
> it was originally needed for CJK-style fixed grid layout purposes. But It
> could be wrong.
>
> What's the current status for 3-dot ellipsis. Does it get used? Do we have
> autocorrect for it? If so, that would argue that implementers have settled
> and any derivative usage (comma) should be kept compatible.
>
>
> A./
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20191006/1b07b7bf/attachment.html>

From unicode at unicode.org  Mon Oct  7 00:20:48 2019
From: unicode at unicode.org (Asmus Freytag via Unicode)
Date: Sun, 6 Oct 2019 22:20:48 -0700
Subject: comma ellipses
In-Reply-To: <CA+p4_H3P3o_f2ruMT7Rxhn4wM9B7Oqzp8t1=5VScHArR_Jc8tg@mail.gmail.com>
References: <001501d57c9a$8c41ada0$a4c508e0$@xencraft.com>
 <f0b80760-bd43-3218-8b3e-abc71dd4b1db@ix.netcom.com>
 <CA+p4_H3P3o_f2ruMT7Rxhn4wM9B7Oqzp8t1=5VScHArR_Jc8tg@mail.gmail.com>
Message-ID: <af84df60-8b5f-783d-3cf0-9e736d5febf9@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20191006/a5eaa030/attachment.html>

From unicode at unicode.org  Mon Oct  7 00:39:07 2019
From: unicode at unicode.org (Tex via Unicode)
Date: Sun, 6 Oct 2019 22:39:07 -0700
Subject: comma ellipses
In-Reply-To: <af84df60-8b5f-783d-3cf0-9e736d5febf9@ix.netcom.com>
References: <001501d57c9a$8c41ada0$a4c508e0$@xencraft.com>
 <f0b80760-bd43-3218-8b3e-abc71dd4b1db@ix.netcom.com>
 <CA+p4_H3P3o_f2ruMT7Rxhn4wM9B7Oqzp8t1=5VScHArR_Jc8tg@mail.gmail.com>
 <af84df60-8b5f-783d-3cf0-9e736d5febf9@ix.netcom.com>
Message-ID: <000601d57cd1$89d897e0$9d89c7a0$@xencraft.com>

Just for additional info on the subject:

 
https://www.theguardian.com/science/2019/oct/05/linguist-gretchen-mcculloch-interview-because-internet-book

 
??I?ve been spending a fair bit of time recently with the comma ellipsis, which is three commas (,,,) instead of dot-dot-dot. I?ve been looking at it for over a year and I?m still figuring out what?s going on there. There seems to be something but possibly several somethings.

One use is by older people who, in some cases where they would use the classic ellipsis, use commas instead. It?s not quite clear if that?s a typo in some cases, but it seems to be more systematic than that. Maybe they?re preferring the comma because it?s a little bit easier to see if you?re on the older side, and your vision is not what it once was. Or maybe they just see the two as equivalent. It then seems to have jumped the shark into parody form. There?s a Facebook group in which younger people pretend to be to be baby boomers, and one of the features people use there is this comma ellipsis. And then in some circles there also seems to be a use of comma ellipses that is very, very heavily ironic. But what exactly the nature is of that heavy irony is still something that I?m working on figuring out?.?

 
From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Asmus Freytag via Unicode
Sent: Sunday, October 6, 2019 10:21 PM
To: unicode at unicode.org
Subject: Re: comma ellipses

 
On 10/6/2019 8:21 PM, Garth Wallace via Unicode wrote:

It?s deliberately incorrect for humorous effect. It gets used, but making it ?official? would almost defeat the purpose.

Well then it should encode a "typographically incorrect" comma ellipsis :)

A./

 
On Sun, Oct 6, 2019 at 5:02 PM Asmus Freytag via Unicode <unicode at unicode.org> wrote:

On 10/6/2019 4:05 PM, Tex via Unicode wrote:

Now that comma ellipses (,,,) are a thing (at least on social media) do we need a character proposal?

 
Asking for a friend,,, J

 
tex

I thought the main reason we ended up with the period (dot) one is because it was originally needed for CJK-style fixed grid layout purposes. But It could be wrong.

What's the current status for 3-dot ellipsis. Does it get used? Do we have autocorrect for it? If so, that would argue that implementers have settled and any derivative usage (comma) should be kept compatible.

 
A./

 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20191006/a92dc320/attachment.html>

From unicode at unicode.org  Mon Oct  7 00:59:17 2019
From: unicode at unicode.org (David Starner via Unicode)
Date: Sun, 6 Oct 2019 22:59:17 -0700
Subject: comma ellipses
In-Reply-To: <000601d57cd1$89d897e0$9d89c7a0$@xencraft.com>
References: <001501d57c9a$8c41ada0$a4c508e0$@xencraft.com>
 <f0b80760-bd43-3218-8b3e-abc71dd4b1db@ix.netcom.com>
 <CA+p4_H3P3o_f2ruMT7Rxhn4wM9B7Oqzp8t1=5VScHArR_Jc8tg@mail.gmail.com>
 <af84df60-8b5f-783d-3cf0-9e736d5febf9@ix.netcom.com>
 <000601d57cd1$89d897e0$9d89c7a0$@xencraft.com>
Message-ID: <CAMZ=zj5_j58hm_joYtswY6iVHhyRPFCLMCgjhR6QRjA4VcE=ww@mail.gmail.com>

I still see the encoding of the original ellipsis as a mistake,
probably for compatibility with some older standard that included it
because the system wasn't smart enough to intelligently handle "..."
as ellipsis.

-- 
Kie ekzistas vivo, ekzistas espero.

From unicode at unicode.org  Mon Oct  7 01:09:50 2019
From: unicode at unicode.org (Asmus Freytag (c) via Unicode)
Date: Sun, 6 Oct 2019 23:09:50 -0700
Subject: comma ellipses
In-Reply-To: <000601d57cd1$89d897e0$9d89c7a0$@xencraft.com>
References: <001501d57c9a$8c41ada0$a4c508e0$@xencraft.com>
 <f0b80760-bd43-3218-8b3e-abc71dd4b1db@ix.netcom.com>
 <CA+p4_H3P3o_f2ruMT7Rxhn4wM9B7Oqzp8t1=5VScHArR_Jc8tg@mail.gmail.com>
 <af84df60-8b5f-783d-3cf0-9e736d5febf9@ix.netcom.com>
 <000601d57cd1$89d897e0$9d89c7a0$@xencraft.com>
Message-ID: <543956a9-6d30-617d-c527-0d7ffd2aa7f2@ix.netcom.com>

Now you are introducing research - that kills all the fun . . . (oops , 
, , )
A./

On 10/6/2019 10:39 PM, Tex wrote:
>
> Just for additional info on the subject:
>
> https://www.theguardian.com/science/2019/oct/05/linguist-gretchen-mcculloch-interview-because-internet-book
>
> ??I?ve been spending a fair bit of time recently with the comma 
> ellipsis, which is three commas (,,,) instead of dot-dot-dot. I?ve 
> been looking at it for over a year and I?m still figuring out what?s 
> going on there. There seems to be something but possibly several 
> somethings.
>
> One use is by older people who, in some cases where they would use the 
> classic ellipsis, use commas instead. It?s not quite clear if that?s a 
> typo in some cases, but it seems to be more systematic than that. 
> Maybe they?re preferring the comma because it?s a little bit easier to 
> see if you?re on the older side, and your vision is not what it once 
> was. Or maybe they just see the two as equivalent. It then seems to 
> have jumped the shark into parody form. There?s a Facebook group in 
> which younger people pretend to be to be baby boomers, and one of the 
> features people use there is this comma ellipsis. And then in some 
> circles there also seems to be a use of comma ellipses that is very, 
> very heavily ironic. But what exactly the nature is of that heavy 
> irony is still something that I?m working on figuring out?.?
>
> *From:*Unicode [mailto:unicode-bounces at unicode.org] *On Behalf Of 
> *Asmus Freytag via Unicode
> *Sent:* Sunday, October 6, 2019 10:21 PM
> *To:* unicode at unicode.org
> *Subject:* Re: comma ellipses
>
> On 10/6/2019 8:21 PM, Garth Wallace via Unicode wrote:
>
>     It?s deliberately incorrect for humorous effect. It gets used, but
>     making it ?official? would almost defeat the purpose.
>
> Well then it should encode a "typographically incorrect" comma ellipsis :)
>
> A./
>
>     On Sun, Oct 6, 2019 at 5:02 PM Asmus Freytag via Unicode
>     <unicode at unicode.org <mailto:unicode at unicode.org>> wrote:
>
>         On 10/6/2019 4:05 PM, Tex via Unicode wrote:
>
>             Now that comma ellipses (,,,) are a thing (at least on
>             social media) do we need a character proposal?
>
>             Asking for a friend,,, J
>
>             tex
>
>         I thought the main reason we ended up with the period (dot)
>         one is because it was originally needed for CJK-style fixed
>         grid layout purposes. But It could be wrong.
>
>         What's the current status for 3-dot ellipsis. Does it get
>         used? Do we have autocorrect for it? If so, that would argue
>         that implementers have settled and any derivative usage
>         (comma) should be kept compatible.
>
>         A./
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20191006/57f9233b/attachment.html>

From unicode at unicode.org  Mon Oct  7 01:30:16 2019
From: unicode at unicode.org (Asmus Freytag via Unicode)
Date: Sun, 6 Oct 2019 23:30:16 -0700
Subject: comma ellipses
In-Reply-To: <CAMZ=zj5_j58hm_joYtswY6iVHhyRPFCLMCgjhR6QRjA4VcE=ww@mail.gmail.com>
References: <001501d57c9a$8c41ada0$a4c508e0$@xencraft.com>
 <f0b80760-bd43-3218-8b3e-abc71dd4b1db@ix.netcom.com>
 <CA+p4_H3P3o_f2ruMT7Rxhn4wM9B7Oqzp8t1=5VScHArR_Jc8tg@mail.gmail.com>
 <af84df60-8b5f-783d-3cf0-9e736d5febf9@ix.netcom.com>
 <000601d57cd1$89d897e0$9d89c7a0$@xencraft.com>
 <CAMZ=zj5_j58hm_joYtswY6iVHhyRPFCLMCgjhR6QRjA4VcE=ww@mail.gmail.com>
Message-ID: <99abf2a4-6195-4ef7-59c5-d3cbf52ed7cd@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20191006/c6093ca4/attachment.html>

From unicode at unicode.org  Mon Oct  7 02:49:05 2019
From: unicode at unicode.org (=?UTF-8?B?V8OhbmcgWWlmw6Fu?= via Unicode)
Date: Mon, 7 Oct 2019 16:49:05 +0900
Subject: worldswritingsystems.org
In-Reply-To: <8295C9C0-05BB-44F6-88D9-21B1B34AD902@bergerhausen.com>
References: <8295C9C0-05BB-44F6-88D9-21B1B34AD902@bergerhausen.com>
Message-ID: <CAF5KyExp1erf2B0P27-muCzbdbsuYan0xdRwDXr79nmXu8+0OQ@mail.gmail.com>

A very comprehensive website, but a couple of details have come to my attention.
- Bronze script: despite its traditional name, it's hard to say that
it is a single consistent system (just jump to WP via your link). They
are various "scripts" at best, only grouped by writing medium which
was the only thing sure at the early stage of study.
- Seal script: while I'm not sure strictly what the dates stand for,
121 CE is when the oldest extant dictionary of it was compiled
(because its usage had declined) and not likely when its usage
started. The dated use goes back to around 200 BCE, and some unearthed
materials apparently predates it by some centuries. The end date is
much more puzzling. AFAIK there's no essential gap between 20c and
today; you can either say it was officially obsolete before 121, or
still has ritual use to this day.

2019?10?4?(?) 0:15 Johannes Bergerhausen via Unicode <unicode at unicode.org>:
>
> Dear list,
>
> FYI: we?ve updated http://worldswritingsystems.org to Unicode 12.1 and fixed a few little bugs and errors.
>
> All the best,
> Johannes
>
>
>
>
> ? Helmig Bergerhausen
>
> Gladbacher Stra?e 40, D-50672 K?ln, Germany
>
> www.helmigbergerhausen.de
>
> ? Prof. Bergerhausen
>
> Hochschule Mainz, School of Design
>
> Holzstra?e 36, D-55116 Mainz, Germany
>
> www.worldswritingsystems.org
> www.decodeunicode.org
> www.designlabor-gutenberg.de
> www.hs-mainz.de/gestaltung
>


From unicode at unicode.org  Mon Oct  7 03:42:56 2019
From: unicode at unicode.org (Philippe Verdy via Unicode)
Date: Mon, 7 Oct 2019 10:42:56 +0200
Subject: comma ellipses
In-Reply-To: <001501d57c9a$8c41ada0$a4c508e0$@xencraft.com>
References: <001501d57c9a$8c41ada0$a4c508e0$@xencraft.com>
Message-ID: <CAGa7JC2A2bmG08--qdxxn6TevGCdGsCxE22P=78bSNzeutrhgw@mail.gmail.com>

Commas may be used instead of dots by users of French keyboards (it's
easier to type the comma, when the dot/full stop requires pressing the
SHIFT key).
I may be wrong, but I've quire frequently seen commas or semicolons instead
of dot/full stops under normal orthography.
But the web and notably social networks can invent their own "rule":
pretending that the dot/full stop at end of sentence is "aggressive" is
probably a deviation from the English-only designation of the dot as a
"full stop", reinterpreted as "stop talking about this, my sentence is
final, I don't want to give more justification" (when for such case the
user should have better used the exclamation mark!)

Anyway I've never liked the 3-dot ellipsis which just occurs in Unicode for
compatiblity with fixed-width fonts on terminals, just to compact 3 cells
into one (or in CJK styles to replace the "bubble" dots with their 1/2 cell
gap on the right side of each cell, contracting them to three smaller dots
in just one CJK cell).

But another reason could be that using commas instead of dots allows
distinguishing the ellipsis from an abbreviation dot used jut before it. Or
making the distinction to explicitly mark the end of sentence by a regular
dot/full stop after the ellipsis, when the ellipsis could be used in the
middle of a sentence (no clear distinction when what follows the ellipsis
is a proper name starting by a capital or not a word: where is the end of
sentence?) and for which the alternative using comma ellipsis would
explicitly say that the ellipsis does not terminate the sentence as in "I
need to spend $2... $4 to return" (one sentences, the meaning is different
from "I need to spend $2,,, $4 to return" where that comma ellipsis would
be an abbreviation for "between $2 and $4").

Anyway, people of the right to use commas if they prefer it for the
semantics they intend to distinguish. This does not mean that we need to
encode this sequence as a separate unbreakable character like it was done
for the dot ellipsis. Otherwise, we would have to encode "etc." also as a
single character, or we would end up adding also many more leader dots (in
classic metal types regular dots/fullstops were used, but some type
compositors may have liked to use mount a single "..." character to avoid
having to keep them glued or keep them regularly spaced with special
spacers when justifying lines mechanically: this saved them a little time
for compositing rows of metal types). There's no real need for CJK or for
monospaced terminals to get a more compact presentation. And for regular
text, just using multiple separate commas will still render as intended.
And metal types are no longer used.

Personnally I don't like the 3-dot ellipsis character because it plays
badly even in monospaced fonts. And there's no demonstrated use where a
single 3-commas ellipsis character would have to be distinguished
semantically and visually from 3 separate commas.

If people want to use ",,," for their informal speech on social networks,
or in chat sessions, they can do that today without needing any new
character and a new keyboard layout or input method. And nobody will really
know if this ",,," was mistyped instead of "..." to avoid pressing SHIFT on
a French AZERTY keyboard (not extended by a numeric keypad where the
dot/full stop may also be typed easily without SHIFT). As well a French
typist could have used ";;;" with semicolons when forgetting to press the
SHIFT key.

If we encode ",,," as a single character, then why not "???" or "!!!", or
"----", or "**", or and many other variants mixing multiple punctuation
signs or symbols (like "$$" as an "angry" mark or the abbreviation for
"costly", then also "??" or "??"...) Then also why not "eeeeeee" or
"hmmmmmmmm" for noting hesitations? This would become endless, without any
limit: Unicode would ten start encoding millions of whole words of
thousands languages as single characters, much more than the whole existing
set of CJK ideographs (including its extensions in nearly two planes).
Interoperability would worsen.


Le lun. 7 oct. 2019 ? 01:14, Tex via Unicode <unicode at unicode.org> a ?crit :

> Now that comma ellipses (,,,) are a thing (at least on social media) do we
> need a character proposal?
>
>
>
> Asking for a friend,,, J
>
>
>
> tex
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20191007/ba660478/attachment.html>

From unicode at unicode.org  Mon Oct  7 15:05:08 2019
From: unicode at unicode.org (=?utf-8?B?5qKB5rW3IExpYW5nIEhhaQ==?= via Unicode)
Date: Mon, 7 Oct 2019 13:05:08 -0700
Subject: =?utf-8?Q?Re=3A_Alternative_encodings_for_Malayalam_=E2=80=9Cnta?=
 =?utf-8?Q?=E2=80=9D?=
In-Reply-To: <DD53F22C-EA80-40EE-BCE7-366E5A0E6FCB@gmail.com>
References: <49FE1395-9719-45A4-B3F6-0DACF22D313D@gmail.com>
 <CAD8TiP6tEs6OCUrSu9jR=zZCrZ1arY8tRdy-T43yXVRm-NNycQ@mail.gmail.com>
 <a99ee135-ebc1-10a7-1de8-6c8a8ed46558@ix.netcom.com>
 <CAD8TiP42Jbnr3uaCEJZBW5eG8AFpst+P57ZfWYCJLrrpBAS_qw@mail.gmail.com>
 <DD53F22C-EA80-40EE-BCE7-366E5A0E6FCB@gmail.com>
Message-ID: <5E95EE1F-F41E-4407-AE44-BFFD1146DEAD@gmail.com>

[Putting the public mailing list back to the recipient list.]

Cibu,

Thanks for your L2/19-348 <https://www.unicode.org/L2/L2019/19348-malayalam-response.pdf> (Response to L2/19-345). My comments:

> I am curious to know the reference for the phonetic analysis described in section ?A chillu-less analysis in the proposal L2/19-345. How can a phonetic analysis be the basis for an important double encoding decision?

The basis is not the phonetic analysis (the phonetic analysis is only provided in the document as an fyi, so readers understand why many people use it), but the fact of a widespread alternative encoding.

Basically we need to properly recognize the failure of ensuring a single, ideal encoding. It?s not helpful to keep the Core Spec detached from the reality.

> Anycase, the sequence implied by this particular analysis is an artifact of the evolution of Unicode for Malayalam; it is not grounded in any prior writing traditions or academic literature.

We?re not talking about legitimacy of the phonetic encoding.

> In Malayalam, dental /n?/ and alveolar /n/ are not allophones as implied in the proposal.

I actually didn?t suggest any allophone relationship, on purpose. If it?s helpful, I can change the ?~? notation in ?[n?a ~ na]? (and [ra ~ ta]) to ?/? or ?,? in a revision.

> So using <NA, VIRAMA> for CHILLU N is not phonetically accurate.

This is not a valid argument (see the next paragraph), although accuracy is not relevant  anyway (as I said, I was trying to explain why people use <NA, VIRAMA, RRA>, not trying to legitimize it.).

The written form ? is the syllable-coda specific form of the written form ?, and the pronunciation of ? being limited to [n] is a result of Malayalam?s phonology ([n?] not usually appearing in a syllable-coda position, unless preceding another dental sound).

The reason for ?? being used in the phonetic encoding is mostly because ? is not considered to be eligible for conjunct forming, and ?? is the natural fallback. Again, I?m not trying to legitimize the encoding, but only explaining my observation of the widespread encoding.

> Moreover, if you show the visual ???? (<<NA, visual VIRAMA, RRA>>) to a native user (who is unaware of Unicode particulars), they will not identify it as (<<chillu N, subscript RRA>> /nt?/); instead, they would read it as /n?r?/.

Not relevant. I avoided ?????? particularly for this kind of argument. The?? was only there to mark an inherent vowel suppressed ?. I almost avoided?? altogether because of its ambiguity, but didn?t do it, because that would make the document too obscure. The point of an an inherent vowel suppressed ? is used in the phonetic encoding, and?? just happens to be used there.

> This proposal does not address the remaining chillu conjuncts described in ?L2/19-086R?.

The document doesn?t propose any productive encoding rule. Why does it need to address other cases?

> It also does not address the legacy sequence supported by MS Windows <NA, VIRAMA, ZWJ, RRA> for (<<chillu N, subscript RRA>>).

I can make it clearer that <NA, VIRAMA, ZWJ, RRA> is just plainly unacceptable as it clashes with our general rule of chillu not forming a conjunct with its following letter automatically (without a conjoiner), in Section 4, Real-world encodings.

> I am not sure how this proposal is going to solve the issue of inadequate support for <CHILLU N, VIRAMA, RRA>, without explicitly rescinding this sequence. Double encoding for (<<chillu N, subscript RRA>>) is not going to solve any issue, if not, making the issue more acute. Double encoding is never a desirable quality for Unicode. So the decision should not be taken lightly or hastly. It needs to be clearly thought through, probably through a PRI.

Double encoding will not be solved. The proposal is about recognizing the reality of failure. With Windows on the loose for so many years, we?ve already missed the opportunity of ensuring a single encoding for the written form.

Now the standard needs to first recognize the widespread encoding that won?t go away, so implementers are informed. Then we see which direction we should push Microsoft and Apple to converge.

I agree that the Unicode Standard might need to have a clear disposition/preference between the graphic and phonetic encodings, so the two are not considered to be just equal, so we can have a direction for pushing the implementations to converge.

> Prior to Unicode 5.2, the encoding of the cluster [glyph] (<<chillu N, subscript RRA>> /nt?/) was not clearly defined. ?

You mean 5.1, right? The encoding has been specified since 5.1.

> ? and <NA, VIRAMA, ZWJ, RRA> ?

How can implementations support this encoding without breaking the side-by-side form ?? though?

Best,
?? Liang Hai
https://lianghai.github.io <https://lianghai.github.io/>

>> On Oct 6, 2019, at 15:10, Cibu <cibucj at gmail.com <mailto:cibucj at gmail.com>> wrote:
>> 
>> Yes; it is now available as L2/19-348 <http://www.unicode.org/cgi-bin/GetMatchingDocs.pl?L2/19-348>.
>> 
>> On Sun, Oct 6, 2019 at 11:03 PM Asmus Freytag (c) <asmusf at ix.netcom.com <mailto:asmusf at ix.netcom.com>> wrote:
>> Have you submitted that response as a UTC document?
>> A./
>> 
>> On 10/6/2019 2:08 PM, Cibu wrote:
>>> Thanks for addressing this. Here is my response: https://docs.google.com/document/d/1K6L82VRmCGc9Fb4AOitNk4MT7Nu4V8aKUJo_1mW5X1o/ <https://docs.google.com/document/d/1K6L82VRmCGc9Fb4AOitNk4MT7Nu4V8aKUJo_1mW5X1o/>
>>> 
>>> In summary, my take is:
>>> 
>>> The sequence <NA, VIRAMA, RRA> for ??? (<<chillu N, subscript RRA>>) should not be legitimized as an alternate encoding; but should be recognized as a prevailing non-standard legacy encoding.
>>> 
>>> 
>>> On Sun, Oct 6, 2019 at 7:57 PM ?? Liang Hai <lianghai at gmail.com <mailto:lianghai at gmail.com>> wrote:
>>> Folks,
>>> 
>>> (Microsoft Peter and Andrew, search for ?Windows? in the document.)
>>> 
>>> (Asmus, in the document there?s a section 5, ICANN RZ-LGR situation?let me know if there?s some news.)
>>> 
>>> This is a pretty straightforward document about the notoriously problematic encoding of Malayalam <chillu n, bottom-side sign of rra>. I always wanted to properly document this, so finally here it is:
>>> 
>>> L2/19-345 <http://www.unicode.org/cgi-bin/GetMatchingDocs.pl?L2/19-345>
>>> Alternative encodings for Malayalam "nta"
>>> Liang Hai
>>> 2019-10-06
>>> 
>>> Unfortunately, as <NA, VIRAMA, RRA> has already become the de facto standard encoding, now we have to recognize it in the Core Spec. It?s a bit like another Tamil sr? situation.
>>> 
>>> An excerpt of the proposal:
>>> 
>>> Document the following widely used encoding in the Core Specification as an alternative representation for Malayalam [glyph] (<chillu n, bottom-side sign of rra>) that is a special case and does not suggest any productive rule in the encoding model:
>>> 
>>> <U+0D28 ? MALAYALAM LETTER NA, U+0D4D ?? MALAYALAM SIGN VIRAMA, U+0D31 ? MALAYALAM LETTER RRA>
>>> 
>>> Best,
>>> ?? Liang Hai
>>> https://lianghai.github.io <https://lianghai.github.io/>
>> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20191007/afb69215/attachment.html>

From unicode at unicode.org  Tue Oct  8 09:25:34 2019
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Tue, 8 Oct 2019 15:25:34 +0100
Subject: Pure Regular Expression Engines and Literal Clusters
Message-ID: <20191008152534.2068db6c@JRWUBU2>

I've been puzzling over how a pure regular expression engine that works
via a non-deterministic finite automaton can be bent to accommodate
'literal clusters' as in Requirement RL2.2 'Extended Grapheme Clusters'
of UTS#18 'Unicode Regular Expressions' - "To meet this requirement, an
implementation shall provide a mechanism for matching against an
arbitrary extended grapheme cluster, a literal cluster, and matching
extended grapheme cluster boundaries."  It works from a regular
expression by stitching together the FSMs corresponding to its elements.

An example UTS#18 gives for matching a literal cluster can be simplified
to, in its notation:

[c \q{ch}]

This is interpreted as 'match against "ch" if possible, otherwise
against "c".  Thus the strings "ca" and "cha" would both match the
expression

[c \q{ch}]a

while "chh" but not "ch" would match against

[c \q{ch}]h

Or have I got this wrong?

Thus, while "[c \q{ch}]" may be a regex, it is clearly not any notation
for a regular expression in the mathematical sense.

It seems to me that this expression requires backtracking, which is
totally alien to the design of the regular expression engine.  One
problem then is that the engine supports both the union and
intersection of regular languages.  While algebraic manipulation might
raise union to the highest level, eliminating intersection is an
expensive operation which I have deliberately avoided.  While
backtracking is feasible if state progression has been restricted to
the FSM for a literal cluster, it is far more difficult if multiple
FSMs have been running in parallel.

As the engine fully respects canonical equivalence (with the result
that it can find an accented letter of the Vietnamese alphabet even if
it bears a subscript tone mark), concatenated subexpressions can
divide the input streams between them.  Consequently, the
backtracking mechanism gets complicated.

May I correctly argue instead that matching against literal clusters
would be satisfied by instead supporting, for this example, the regular
subexpression "(c|ch)" or the UnicodeSet expression "[c{ch}]"?

Richard.


From unicode at unicode.org  Wed Oct  9 02:04:35 2019
From: unicode at unicode.org (Cibu via Unicode)
Date: Wed, 9 Oct 2019 08:04:35 +0100
Subject: =?UTF-8?Q?Re=3A_Alternative_encodings_for_Malayalam_=E2=80=9Cnta=E2=80=9D?=
In-Reply-To: <5E95EE1F-F41E-4407-AE44-BFFD1146DEAD@gmail.com>
References: <49FE1395-9719-45A4-B3F6-0DACF22D313D@gmail.com>
 <CAD8TiP6tEs6OCUrSu9jR=zZCrZ1arY8tRdy-T43yXVRm-NNycQ@mail.gmail.com>
 <a99ee135-ebc1-10a7-1de8-6c8a8ed46558@ix.netcom.com>
 <CAD8TiP42Jbnr3uaCEJZBW5eG8AFpst+P57ZfWYCJLrrpBAS_qw@mail.gmail.com>
 <DD53F22C-EA80-40EE-BCE7-366E5A0E6FCB@gmail.com>
 <5E95EE1F-F41E-4407-AE44-BFFD1146DEAD@gmail.com>
Message-ID: <CAD8TiP5GtbCFeOV8Vm4wWSRJguRQTw3Oke6a=jj4LjiO5rGNYQ@mail.gmail.com>

On Mon, Oct 7, 2019 at 9:05 PM ?? Liang Hai <lianghai at gmail.com> wrote:

>
> Prior to Unicode 5.2, the encoding of the cluster [glyph] (<<chillu
> N, subscript RRA>> /nt?/) was not clearly defined. ?
>
>
> You mean 5.1, right? The encoding has been specified since 5.1.
>

I couldn't get the text for 5.1 from
https://www.unicode.org/versions/Unicode5.1.0. So I had to specify 5.2 for
which the text is clear in
https://www.unicode.org/versions/Unicode5.2.0/ch09.pdf


>
> ? and <NA, VIRAMA, ZWJ, RRA> ?
>
>
> How can implementations support this encoding without breaking the
> side-by-side form ?? though?
>

Here is the difference between our approaches. You probably are trying to
say that <NA, VIRAMA, RRA> is a valid sequence and hence the requirement of
being non-conflicting with the rest. I am not recommending that. I just
wanted to document the fact there is significant usage of <NA, VIRAMA, RRA>
for stacked ??? and <NA, VIRAMA, ZWJ, RRA>, to a lesser degree. Fonts may
or may not resolve the conflict of <NA, VIRAMA, ZWJ, RRA> sequence.
However, higher level systems may be able to resolve it by additional
context information. We should also continue to specify that <CHILLU N,
VIRAMA, RRA> is the standard sequence to help the input methods and other
normalisation logic.


>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20191009/9e8fa422/attachment.html>

From unicode at unicode.org  Wed Oct  9 12:00:48 2019
From: unicode at unicode.org (=?utf-8?B?5qKB5rW3IExpYW5nIEhhaQ==?= via Unicode)
Date: Wed, 9 Oct 2019 10:00:48 -0700
Subject: =?utf-8?Q?Re=3A_Alternative_encodings_for_Malayalam_=E2=80=9Cnta?=
 =?utf-8?Q?=E2=80=9D?=
In-Reply-To: <CAD8TiP5GtbCFeOV8Vm4wWSRJguRQTw3Oke6a=jj4LjiO5rGNYQ@mail.gmail.com>
References: <49FE1395-9719-45A4-B3F6-0DACF22D313D@gmail.com>
 <CAD8TiP6tEs6OCUrSu9jR=zZCrZ1arY8tRdy-T43yXVRm-NNycQ@mail.gmail.com>
 <a99ee135-ebc1-10a7-1de8-6c8a8ed46558@ix.netcom.com>
 <CAD8TiP42Jbnr3uaCEJZBW5eG8AFpst+P57ZfWYCJLrrpBAS_qw@mail.gmail.com>
 <DD53F22C-EA80-40EE-BCE7-366E5A0E6FCB@gmail.com>
 <5E95EE1F-F41E-4407-AE44-BFFD1146DEAD@gmail.com>
 <CAD8TiP5GtbCFeOV8Vm4wWSRJguRQTw3Oke6a=jj4LjiO5rGNYQ@mail.gmail.com>
Message-ID: <3E7E6D66-D868-44D6-89C9-432E1AA035E3@gmail.com>

> On Oct 9, 2019, at 00:04, Cibu <cibucj at gmail.com> wrote:
> 
> On Mon, Oct 7, 2019 at 9:05 PM ?? Liang Hai <lianghai at gmail.com <mailto:lianghai at gmail.com>> wrote:
> 
>> Prior to Unicode 5.2, the encoding of the cluster [glyph] (<<chillu N, subscript RRA>> /nt?/) was not clearly defined. ?
> 
> You mean 5.1, right? The encoding has been specified since 5.1.
> 
> I couldn't get the text for 5.1 from https://www.unicode.org/versions/Unicode5.1.0 <https://www.unicode.org/versions/Unicode5.1.0>. So I had to specify 5.2 for which the text is clear in https://www.unicode.org/versions/Unicode5.2.0/ch09.pdf <https://www.unicode.org/versions/Unicode5.2.0/ch09.pdf>
Oh the Core Spec?s 5.0 -> 5.1 delta is presented on the webpage itself, but not incorporated into the PDF:

https://unicode.org/versions/Unicode5.1.0/#Malayalam_Chillu_Characters

>> ? and <NA, VIRAMA, ZWJ, RRA> ?
> 
> How can implementations support this encoding without breaking the side-by-side form ?? though?
> 
> Here is the difference between our approaches. You probably are trying to say that <NA, VIRAMA, RRA> is a valid sequence and hence the requirement of being non-conflicting with the rest. I am not recommending that. I just wanted to document the fact there is significant usage of <NA, VIRAMA, RRA> for stacked ??? and <NA, VIRAMA, ZWJ, RRA>, to a lesser degree. Fonts may or may not resolve the conflict of <NA, VIRAMA, ZWJ, RRA> sequence. However, higher level systems may be able to resolve it by additional context information. We should also continue to specify that <CHILLU N, VIRAMA, RRA> is the standard sequence to help the input methods and other normalisation logic.

Right, I see. This aligns with the comments I received at the plenary discussion too. Gonna include both unideal encodings in a piece of proposed Core Spec edit, in a revised document.

Best,
?? Liang Hai
https://lianghai.github.io

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20191009/5c59d014/attachment.html>

From unicode at unicode.org  Thu Oct 10 11:37:12 2019
From: unicode at unicode.org (Cibu via Unicode)
Date: Thu, 10 Oct 2019 17:37:12 +0100
Subject: =?UTF-8?Q?Re=3A_Alternative_encodings_for_Malayalam_=E2=80=9Cnta=E2=80=9D?=
In-Reply-To: <3E7E6D66-D868-44D6-89C9-432E1AA035E3@gmail.com>
References: <49FE1395-9719-45A4-B3F6-0DACF22D313D@gmail.com>
 <CAD8TiP6tEs6OCUrSu9jR=zZCrZ1arY8tRdy-T43yXVRm-NNycQ@mail.gmail.com>
 <a99ee135-ebc1-10a7-1de8-6c8a8ed46558@ix.netcom.com>
 <CAD8TiP42Jbnr3uaCEJZBW5eG8AFpst+P57ZfWYCJLrrpBAS_qw@mail.gmail.com>
 <DD53F22C-EA80-40EE-BCE7-366E5A0E6FCB@gmail.com>
 <5E95EE1F-F41E-4407-AE44-BFFD1146DEAD@gmail.com>
 <CAD8TiP5GtbCFeOV8Vm4wWSRJguRQTw3Oke6a=jj4LjiO5rGNYQ@mail.gmail.com>
 <3E7E6D66-D868-44D6-89C9-432E1AA035E3@gmail.com>
Message-ID: <CAD8TiP4=x37-74=vDorbJa0GKB1JZUxRtN5z4_RURnsay5hbXA@mail.gmail.com>

>
> Oh the Core Spec?s 5.0 -> 5.1 delta is presented on the webpage itself,
> but not incorporated into the PDF:
>
> https://unicode.org/versions/Unicode5.1.0/#Malayalam_Chillu_Characters
>
>
Thanks for pointing this out. ?? I had missed it.


> Here is the difference between our approaches. You probably are trying to
> say that <NA, VIRAMA, RRA> is a valid sequence and hence the requirement of
> being non-conflicting with the rest. I am not recommending that. I just
> wanted to document the fact there is significant usage of <NA, VIRAMA, RRA>
> for stacked ??? and <NA, VIRAMA, ZWJ, RRA>, to a lesser degree. Fonts may
> or may not resolve the conflict of <NA, VIRAMA, ZWJ, RRA> sequence.
> However, higher level systems may be able to resolve it by additional
> context information. We should also continue to specify that <CHILLU N,
> VIRAMA, RRA> is the standard sequence to help the input methods and other
> normalisation logic.
>
>
> Right, I see. This aligns with the comments I received at the plenary
> discussion too. Gonna include both unideal encodings in a piece of proposed
> Core Spec edit, in a revised document.
>

So I assume the plan is to include this in the Core Spec edits along with
the planned ones corresponding to L2/19-086R (chillu conjuncts) and
L2/18-346 (general historical characters). Please keep me posted. Thanks!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20191010/5e9f2efc/attachment.html>

From unicode at unicode.org  Thu Oct 10 16:54:35 2019
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Thu, 10 Oct 2019 22:54:35 +0100
Subject: Pure Regular Expression Engines and Literal Clusters
In-Reply-To: <20191008152534.2068db6c@JRWUBU2>
References: <20191008152534.2068db6c@JRWUBU2>
Message-ID: <20191010225435.567382c6@JRWUBU2>

On Tue, 8 Oct 2019 15:25:34 +0100
Richard Wordingham via Unicode <unicode at unicode.org> wrote:

> An example UTS#18 gives for matching a literal cluster can be
> simplified to, in its notation:
> 
> [c \q{ch}]
> 
> This is interpreted as 'match against "ch" if possible, otherwise
> against "c".  Thus the strings "ca" and "cha" would both match the
> expression
> 
> [c \q{ch}]a
> 
> while "chh" but not "ch" would match against
> 
> [c \q{ch}]h
> 
> Or have I got this wrong?

After comparing this with the Perl behaviour of /(:?ch|c)
and /(:?ch|c)h, I've come to the conclusion that I've got the
interpretation wrong.  The former may match "ch" or "c", and I
conclude that the only funny meaning of \q is to indicate a preference
for the sequence of two characters - if the engine yields all matches,
it has no meaning.

This greatly simplifies matters.

Richard.

From unicode at unicode.org  Thu Oct 10 17:23:00 2019
From: unicode at unicode.org (Markus Scherer via Unicode)
Date: Thu, 10 Oct 2019 15:23:00 -0700
Subject: Pure Regular Expression Engines and Literal Clusters
In-Reply-To: <20191008152534.2068db6c@JRWUBU2>
References: <20191008152534.2068db6c@JRWUBU2>
Message-ID: <CAN49p6r2EiCZLoOV1DngbM83d5Vzu7X0dOcjVnXf-CJK4+aR6Q@mail.gmail.com>

On Tue, Oct 8, 2019 at 7:28 AM Richard Wordingham via Unicode <
unicode at unicode.org> wrote:

> An example UTS#18 gives for matching a literal cluster can be simplified
> to, in its notation:
>
> [c \q{ch}]
>
> This is interpreted as 'match against "ch" if possible, otherwise
> against "c".  Thus the strings "ca" and "cha" would both match the
> expression
>
> [c \q{ch}]a
>
> while "chh" but not "ch" would match against
>
> [c \q{ch}]h
>

Right. We just independently discussed this today in the UTC meeting,
connected with the "properties of strings" discussion in the proposed
update.

[c \q{ch}]h should work like (ch|c)h. Note that the order matters in the
alternation -- so this works equivalently if longer strings are sorted
first.

May I correctly argue instead that matching against literal clusters
> would be satisfied by instead supporting, for this example, the regular
> subexpression "(c|ch)" or the UnicodeSet expression "[c{ch}]"?
>

ICU UnicodeSet [c{ch}] is equivalent to UTS #18 [c\q{ch}].

ICU's UnicodeSet syntax is simpler, the UTS #18 syntax is more
backward-compatible.

Best regards,
markus
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20191010/4c43d0de/attachment.html>

From unicode at unicode.org  Fri Oct 11 01:46:21 2019
From: unicode at unicode.org (=?utf-8?B?TWFydGluIEouIETDvHJzdA==?= via Unicode)
Date: Fri, 11 Oct 2019 06:46:21 +0000
Subject: Fwd: The Most Frequent Emoji
In-Reply-To: <5D9DF525.5070300@unicode.org>
References: <5D9DF525.5070300@unicode.org>
Message-ID: <fc9b6809-3a19-62d6-7413-0b71a4fea530@it.aoyama.ac.jp>

I had a look at the page with the frequencies. Many emoji didn't 
display, but that's my browser's problem. What was worse was that the 
sidebar and the stuff at the bottom was all looking weird. I hope this 
can be fixed.

Regards,   Martin.

-------- Forwarded Message --------
Subject: The Most Frequent Emoji
Date: Wed, 09 Oct 2019 07:56:37 -0700
From: announcements at unicode.org
Reply-To: root at unicode.org
To: announcements at unicode.org

Emoji Frequency ImageHow does the Unicode Consortium choose which new 
emoji to add? One important factor is data about how frequently the 
current emoji are used. Patterns of usage help to inform decisions about 
future emoji. The Consortium has been working to assemble this 
information and make it available to the public.

And the two most frequently used emoji in the world are...
?? and ??
The new Unicode Emoji Frequency 
<https://home.unicode.org/emoji/emoji-frequency> page shows a list of 
the Unicode v12.0 emoji ranked in order of how frequently they are used.

?The forecasted frequency of use is a key factor in determining whether 
to encode new emoji, and for that it is important to know the frequency 
of use of existing emoji,? said Mark Davis, President of the Unicode 
Consortium. ?Understanding how frequently emoji are used helps 
prioritize which categories to focus on and which emoji to add to the 
Standard.?

------------------------------------------------------------------------
/Over 136,000 characters are available for adoption 
<http://unicode.org/consortium/adopt-a-character.html>, to help the 
Unicode Consortium?s work on digitally disadvantaged languages./

[badge] <http://unicode.org/consortium/adopt-a-character.html>

http://blog.unicode.org/2019/10/the-most-frequent-emoji.html


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20191011/ef11eb6f/attachment.html>

From unicode at unicode.org  Fri Oct 11 04:14:41 2019
From: unicode at unicode.org (Asmus Freytag via Unicode)
Date: Fri, 11 Oct 2019 02:14:41 -0700
Subject: Fwd: The Most Frequent Emoji
In-Reply-To: <fc9b6809-3a19-62d6-7413-0b71a4fea530@it.aoyama.ac.jp>
References: <5D9DF525.5070300@unicode.org>
 <fc9b6809-3a19-62d6-7413-0b71a4fea530@it.aoyama.ac.jp>
Message-ID: <a1a2c14c-d7da-c7a1-709b-1306715ec875@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20191011/7608d7ce/attachment.html>

From unicode at unicode.org  Fri Oct 11 05:39:56 2019
From: unicode at unicode.org (Elizabeth Mattijsen via Unicode)
Date: Fri, 11 Oct 2019 12:39:56 +0200
Subject: Pure Regular Expression Engines and Literal Clusters
In-Reply-To: <CAN49p6r2EiCZLoOV1DngbM83d5Vzu7X0dOcjVnXf-CJK4+aR6Q@mail.gmail.com>
References: <20191008152534.2068db6c@JRWUBU2>
 <CAN49p6r2EiCZLoOV1DngbM83d5Vzu7X0dOcjVnXf-CJK4+aR6Q@mail.gmail.com>
Message-ID: <33ADA35B-D882-4F39-8693-83B0C5F9796B@dijkmat.nl>

> On 11 Oct 2019, at 00:23, Markus Scherer via Unicode <unicode at unicode.org> wrote:
> 
> On Tue, Oct 8, 2019 at 7:28 AM Richard Wordingham via Unicode <unicode at unicode.org> wrote:
> An example UTS#18 gives for matching a literal cluster can be simplified
> to, in its notation:
> 
> [c \q{ch}]
> 
> This is interpreted as 'match against "ch" if possible, otherwise
> against "c".  Thus the strings "ca" and "cha" would both match the
> expression
> 
> [c \q{ch}]a
> 
> while "chh" but not "ch" would match against
> 
> [c \q{ch}]h
> 
> Right. We just independently discussed this today in the UTC meeting, connected with the "properties of strings" discussion in the proposed update.
> 
> [c \q{ch}]h should work like (ch|c)h. Note that the order matters in the alternation -- so this works equivalently if longer strings are sorted first.
> 
> May I correctly argue instead that matching against literal clusters
> would be satisfied by instead supporting, for this example, the regular
> subexpression "(c|ch)" or the UnicodeSet expression "[c{ch}]"?
> 
> ICU UnicodeSet [c{ch}] is equivalent to UTS #18 [c\q{ch}].
> 
> ICU's UnicodeSet syntax is simpler, the UTS #18 syntax is more backward-compatible.

Not quite following this discussion, but I got triggered by the use of Perl in this discussion.

In Perl 6 (which is a different language from Perl 5 altogether), regular expressions have been completely revamped.

In Perl 6, the use of "|" indicates alternatives using longest token matching (LTM):
   https://docs.perl6.org/language/regexes#index-entry-regex_|-Longest_alternation:_|

In Perl 6, the use of "||" indicates first matching alternative wins:
    https://docs.perl6.org/language/regexes#index-entry-regex_||-Alternation:_||

Furthermore, Perl 6 uses Normalization Form Grapheme for matching:
    https://docs.perl6.org/type/Cool#index-entry-Grapheme

Hope this has some relevance to this discussion / gives new viewpoints.


Elizabeth Mattijsen


From unicode at unicode.org  Fri Oct 11 06:35:07 2019
From: unicode at unicode.org (Fred Brennan via Unicode)
Date: Fri, 11 Oct 2019 19:35:07 +0800
Subject: Will TAGALOG LETTER RA, currently in the pipeline,
 be in the next version of Unicode?
Message-ID: <1712595.prWeGnbi0f@pc>

Many users are asking me and I'm not sure of the answer (nor how to find it 
out).

The UTC approved it, so it will be in the next version of Unicode, right?

We sure hope so...it is a character needed to write a script in current use. 
Although only a minority of people care about it, that minority is dedicated!

Best,
Fred Brennan


From unicode at unicode.org  Fri Oct 11 11:50:16 2019
From: unicode at unicode.org (Markus Scherer via Unicode)
Date: Fri, 11 Oct 2019 09:50:16 -0700
Subject: Will TAGALOG LETTER RA, currently in the pipeline, be in the next
 version of Unicode?
In-Reply-To: <1712595.prWeGnbi0f@pc>
References: <1712595.prWeGnbi0f@pc>
Message-ID: <CAN49p6p5hjEQmA465ZDU879yp4HGBeqjZeNYP-GMsD=9xcvg0A@mail.gmail.com>

On Fri, Oct 11, 2019 at 4:37 AM Fred Brennan via Unicode <
unicode at unicode.org> wrote:

> Many users are asking me and I'm not sure of the answer (nor how to find
> it
> out).
>

You can find out by looking at the data files that are being developed for
Unicode 13.
Look at the latest UnicodeData.txt in
https://www.unicode.org/Public/13.0.0/ucd/

I don't see a TAGALOG LETTER RA there.

DerivedAge.txt there shows Tagalog characters only from Unicode 3.2.

The next place to check would be the pipeline page:
https://www.unicode.org/alloc/Pipeline.html

It shows TAGALOG LETTER RA in the section "Characters Accepted or In Ballot
for Future Versions".
UTC accepted it just in July of this year, but it's not yet in ISO ballot.

If all goes well, it could go into Unicode 14, March 2021.

Best regards,
markus
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20191011/746e055e/attachment.html>

From unicode at unicode.org  Fri Oct 11 12:17:19 2019
From: unicode at unicode.org (Ken Whistler via Unicode)
Date: Fri, 11 Oct 2019 10:17:19 -0700
Subject: Will TAGALOG LETTER RA, currently in the pipeline, be in the next
 version of Unicode?
In-Reply-To: <1712595.prWeGnbi0f@pc>
References: <1712595.prWeGnbi0f@pc>
Message-ID: <fe452b84-f852-5815-09b0-9caba767e19c@sonic.net>

Short answer is no.

The characters in the pipeline section labeled "Characters Accepted for 
Version 13.0" are what will be in the beta review for 13.0 (look for 
that sometime next month), and then eventually in the published Version 
13.0 next month:

https://www.unicode.org/alloc/Pipeline.html#planned_next_version

Characters listed in the "Characters for Future Versions" table:

https://www.unicode.org/alloc/Pipeline.html#future

are not yet targeted for any particular version. Many of them, including 
the Tagalog letter RA, will end up published in Unicode 14.0, but the 
detailed decisions on what makes it into Unicode 14.0 won't happen until 
sometime next summer.

Production of new versions of the Unicode Standard is a ponderous and 
lengthy operation, involving 4 UTC meetings, uncounted subcommittee 
meetings, dozens of specifications, hundreds of character properties, 
thousands of characters, hundreds of fonts, and intricate charts and QA 
process. It doesn't happen at the drop of a hat, which is why we 
schedule a full year for each new major release.

So, in general, no, you can *never* assume that once the UTC has just 
approved a new character that it will be in the next version of Unicode.

--Ken

On 10/11/2019 4:35 AM, Fred Brennan via Unicode wrote:
> Many users are asking me and I'm not sure of the answer (nor how to find it
> out).
>
> The UTC approved it, so it will be in the next version of Unicode, right?
>
> We sure hope so...it is a character needed to write a script in current use.
> Although only a minority of people care about it, that minority is dedicated!
>
> Best,
> Fred Brennan

From unicode at unicode.org  Fri Oct 11 12:21:51 2019
From: unicode at unicode.org (Ken Whistler via Unicode)
Date: Fri, 11 Oct 2019 10:21:51 -0700
Subject: Will TAGALOG LETTER RA, currently in the pipeline, be in the next
 version of Unicode?
In-Reply-To: <fe452b84-f852-5815-09b0-9caba767e19c@sonic.net>
References: <1712595.prWeGnbi0f@pc>
 <fe452b84-f852-5815-09b0-9caba767e19c@sonic.net>
Message-ID: <29d0aa85-88db-50da-ab40-411c898d407e@sonic.net>

Sorry about the typo there. I meant "the published Version 13.0 next March"

--Ken

On 10/11/2019 10:17 AM, Ken Whistler wrote:
> then eventually in the published Version 13.0 next month: 

From unicode at unicode.org  Fri Oct 11 12:35:45 2019
From: unicode at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via Unicode)
Date: Fri, 11 Oct 2019 10:35:45 -0700
Subject: Unicode website glitches. (was The Most Frequent Emoji)
In-Reply-To: <fc9b6809-3a19-62d6-7413-0b71a4fea530@it.aoyama.ac.jp>
References: <5D9DF525.5070300@unicode.org>
 <fc9b6809-3a19-62d6-7413-0b71a4fea530@it.aoyama.ac.jp>
Message-ID: <CAJ2xs_Fb5TKkqAkhwApHFhBAwhWdfqCYmgs0nrXrNFMjmS7C0w@mail.gmail.com>

There was a caching problem with WordPress, where you have to do a hard
reload in some browsers. See if the problem still exists, and if the hard
reload fixes it. If anyone else is having trouble with that, let us know.

BTW, if you want to comment on the format as opposed to glitches, please
change the subject line.

Mark


On Thu, Oct 10, 2019 at 11:50 PM Martin J. D?rst via Unicode <
unicode at unicode.org> wrote:

> I had a look at the page with the frequencies. Many emoji didn't
> display, but that's my browser's problem. What was worse was that the
> sidebar and the stuff at the bottom was all looking weird. I hope this
> can be fixed.
>
> Regards,   Martin.
>
> -------- Forwarded Message --------
> Subject: The Most Frequent Emoji
> Date: Wed, 09 Oct 2019 07:56:37 -0700
> From: announcements at unicode.org
> Reply-To: root at unicode.org
> To: announcements at unicode.org
>
> Emoji Frequency ImageHow does the Unicode Consortium choose which new
> emoji to add? One important factor is data about how frequently the
> current emoji are used. Patterns of usage help to inform decisions about
> future emoji. The Consortium has been working to assemble this
> information and make it available to the public.
>
> And the two most frequently used emoji in the world are...
> ?? and ??
> The new Unicode Emoji Frequency
> <https://home.unicode.org/emoji/emoji-frequency> page shows a list of
> the Unicode v12.0 emoji ranked in order of how frequently they are used.
>
> ?The forecasted frequency of use is a key factor in determining whether
> to encode new emoji, and for that it is important to know the frequency
> of use of existing emoji,? said Mark Davis, President of the Unicode
> Consortium. ?Understanding how frequently emoji are used helps
> prioritize which categories to focus on and which emoji to add to the
> Standard.?
>
> ------------------------------------------------------------------------
> /Over 136,000 characters are available for adoption
> <http://unicode.org/consortium/adopt-a-character.html>, to help the
> Unicode Consortium?s work on digitally disadvantaged languages./
>
> [badge] <http://unicode.org/consortium/adopt-a-character.html>
>
> http://blog.unicode.org/2019/10/the-most-frequent-emoji.html
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20191011/efe3d492/attachment.html>

From unicode at unicode.org  Fri Oct 11 13:18:46 2019
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Fri, 11 Oct 2019 19:18:46 +0100
Subject: Pure Regular Expression Engines and Literal Clusters
In-Reply-To: <33ADA35B-D882-4F39-8693-83B0C5F9796B@dijkmat.nl>
References: <20191008152534.2068db6c@JRWUBU2>
 <CAN49p6r2EiCZLoOV1DngbM83d5Vzu7X0dOcjVnXf-CJK4+aR6Q@mail.gmail.com>
 <33ADA35B-D882-4F39-8693-83B0C5F9796B@dijkmat.nl>
Message-ID: <20191011191846.39018209@JRWUBU2>

On Fri, 11 Oct 2019 12:39:56 +0200
Elizabeth Mattijsen via Unicode <unicode at unicode.org> wrote:

> Furthermore, Perl 6 uses Normalization Form Grapheme for matching:
>     https://docs.perl6.org/type/Cool#index-entry-Grapheme

I seriously doubt that a Thai considers each combination of consonant
(44), non-spacing vowel (7) and tone mark (4) a different character.
Moreover, if what you say is correct, perl6 will be useless for
finding such combinations in correctly spelled text.  The regular
expression

\p{insc=consonant}\p{insc=vowel_dependent}\p{insc=tone_mark}

would find only misspellings because in correct Thai spelling, matching
sequences constitute grapheme clusters.  I trust perl6 will actually
continue to support analyses of strings as sequences of codepoints. 

Richard.

From unicode at unicode.org  Fri Oct 11 14:01:58 2019
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Fri, 11 Oct 2019 20:01:58 +0100
Subject: Pure Regular Expression Engines and Literal Clusters
In-Reply-To: <CAN49p6r2EiCZLoOV1DngbM83d5Vzu7X0dOcjVnXf-CJK4+aR6Q@mail.gmail.com>
References: <20191008152534.2068db6c@JRWUBU2>
 <CAN49p6r2EiCZLoOV1DngbM83d5Vzu7X0dOcjVnXf-CJK4+aR6Q@mail.gmail.com>
Message-ID: <20191011200158.41a948f4@JRWUBU2>

On Thu, 10 Oct 2019 15:23:00 -0700
Markus Scherer via Unicode <unicode at unicode.org> wrote:

> [c \q{ch}]h should work like (ch|c)h. Note that the order matters in
> the alternation -- so this works equivalently if longer strings are
> sorted first.

Thanks for answering the question.

Does conformance UTS#18 to level 2 mandate the choice of matching
substring? This would appear to prohibit compliance to POSIX rules,
where the length of overall match counts.

Richard.

From unicode at unicode.org  Fri Oct 11 16:35:33 2019
From: unicode at unicode.org (Markus Scherer via Unicode)
Date: Fri, 11 Oct 2019 14:35:33 -0700
Subject: Pure Regular Expression Engines and Literal Clusters
In-Reply-To: <20191011200158.41a948f4@JRWUBU2>
References: <20191008152534.2068db6c@JRWUBU2>
 <CAN49p6r2EiCZLoOV1DngbM83d5Vzu7X0dOcjVnXf-CJK4+aR6Q@mail.gmail.com>
 <20191011200158.41a948f4@JRWUBU2>
Message-ID: <CAN49p6r=D68Tzy1bzaXcvFVpv+RaXQG1GZnozBd4og=XkeRaDQ@mail.gmail.com>

On Fri, Oct 11, 2019 at 12:05 PM Richard Wordingham via Unicode <
unicode at unicode.org> wrote:

> On Thu, 10 Oct 2019 15:23:00 -0700
> Markus Scherer via Unicode <unicode at unicode.org> wrote:
>
> > [c \q{ch}]h should work like (ch|c)h. Note that the order matters in
> > the alternation -- so this works equivalently if longer strings are
> > sorted first.
>
> Thanks for answering the question.
>
> Does conformance UTS#18 to level 2 mandate the choice of matching
> substring? This would appear to prohibit compliance to POSIX rules,
> where the length of overall match counts.
>

We just had a discussion this week. Mark will revise the proposed update.

The idea is currently to specify properties-of-strings (and I think a
range/class with "clusters") behaving like an alternation where the longest
strings are first, and leaving it up to the regex engine exactly what that
means.

In general, UTS #18 offers a lot of things that regex implementers may or
may not adopt.

If you have specific ideas, please send them as PRI feedback.
(Discussion on the list is good and useful, but does not guarantee that it
gets looked at when it counts.)

Best regards,
markus
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20191011/170c7911/attachment.html>

From unicode at unicode.org  Fri Oct 11 17:04:54 2019
From: unicode at unicode.org (Doug Ewell via Unicode)
Date: Fri, 11 Oct 2019 15:04:54 -0700
Subject: Will TAGALOG LETTER RA, currently in the pipeline, be in the next
 version of =?UTF-8?Q?Unicode=3F?=
Message-ID: <20191011150454.665a7a7059d7ee80bb4d670165c8327d.151fa3f752.wbe@email03.godaddy.com>

Ken Whistler wrote:
 
> So, in general, no, you can *never* assume that once the UTC has just
> approved a new character that it will be in the next version of
> Unicode.
 
I got quite a few messages like this when UTC approved the legacy
computing characters in L2/19-025 last January. Great, that means I'll
be able to start using and exchanging them in March, when Unicode 12.1
is released, right? Uh, no:
 
1. What Ken said above.
 
2. Unicode 12.1 was always just about the Reiwa sign.
 
3. Even when 13 comes out, fonts won't be immediately and magically
updated to include them.
 
--
Doug Ewell | Thornton, CO, US | ewellic.org
 

From unicode at unicode.org  Fri Oct 11 17:28:01 2019
From: unicode at unicode.org (Rebecca Bettencourt via Unicode)
Date: Fri, 11 Oct 2019 15:28:01 -0700
Subject: Will TAGALOG LETTER RA, currently in the pipeline, be in the next
 version of Unicode?
In-Reply-To: <20191011150454.665a7a7059d7ee80bb4d670165c8327d.151fa3f752.wbe@email03.godaddy.com>
References: <20191011150454.665a7a7059d7ee80bb4d670165c8327d.151fa3f752.wbe@email03.godaddy.com>
Message-ID: <CAH=y87bviXpfiFfMmQ2nVnpqDR_66fc_vMQZZH8NQggPq18Ocw@mail.gmail.com>

> 3. Even when 13 comes out, fonts won't be immediately and magically
> updated to include them.


In this case, though, several fonts actually already include TAGALOG LETTER
RA. :)

"This spot, U+170D, has become a *de facto* standard among *baybayin*
writers in the Philippines and the Filipino diaspora. Several modern fonts,
including the one that appears on Philippine currency to write the word
*Pilipino*, use U+170D as a ?ra?. (See ?0.13) Software, if it can output
['ra'], uses U+170D. (See ?0.14) Documents online, if they include ['ra'],
most often have it encoded as U+170D." (L2/19-258R, page 6)


This proposal was special in that it was asking the Unicode Consortium to
recognize a character that was already being used unofficially, so that
organizations like the Google Noto team who are sticklers for Unicode
compliance would include it. :)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20191011/5caabf06/attachment.html>

From unicode at unicode.org  Fri Oct 11 19:05:23 2019
From: unicode at unicode.org (=?utf-8?B?TWFydGluIEouIETDvHJzdA==?= via Unicode)
Date: Sat, 12 Oct 2019 00:05:23 +0000
Subject: Unicode website glitches. (was The Most Frequent Emoji)
In-Reply-To: <CAJ2xs_Fb5TKkqAkhwApHFhBAwhWdfqCYmgs0nrXrNFMjmS7C0w@mail.gmail.com>
References: <5D9DF525.5070300@unicode.org>
 <fc9b6809-3a19-62d6-7413-0b71a4fea530@it.aoyama.ac.jp>
 <CAJ2xs_Fb5TKkqAkhwApHFhBAwhWdfqCYmgs0nrXrNFMjmS7C0w@mail.gmail.com>
Message-ID: <ca2cae58-0b00-bf98-eee9-280dae35d6a1@it.aoyama.ac.jp>

Hello Mark,

On 2019/10/12 02:35, Mark Davis ?? wrote:
> There was a caching problem with WordPress, where you have to do a hard
> reload in some browsers. See if the problem still exists, and if the hard
> reload fixes it. If anyone else is having trouble with that, let us know.

I can confirm that a hard reload fixed the problem.


> BTW, if you want to comment on the format as opposed to glitches, please
> change the subject line.

I think it's less the format and much more the split personality of the 
Unicode Web site(s?) that I have problems with.

Regards,   Martin.

> Mark
> 
> 
> On Thu, Oct 10, 2019 at 11:50 PM Martin J. D?rst via Unicode <
> unicode at unicode.org> wrote:
> 
>> I had a look at the page with the frequencies. Many emoji didn't
>> display, but that's my browser's problem. What was worse was that the
>> sidebar and the stuff at the bottom was all looking weird. I hope this
>> can be fixed.
>>
>> Regards,   Martin.

>> The new Unicode Emoji Frequency
>> <https://home.unicode.org/emoji/emoji-frequency> page shows a list of
>> the Unicode v12.0 emoji ranked in order of how frequently they are used.


From unicode at unicode.org  Fri Oct 11 20:02:12 2019
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Sat, 12 Oct 2019 02:02:12 +0100
Subject: Pure Regular Expression Engines and Literal Clusters
In-Reply-To: <CAN49p6r=D68Tzy1bzaXcvFVpv+RaXQG1GZnozBd4og=XkeRaDQ@mail.gmail.com>
References: <20191008152534.2068db6c@JRWUBU2>
 <CAN49p6r2EiCZLoOV1DngbM83d5Vzu7X0dOcjVnXf-CJK4+aR6Q@mail.gmail.com>
 <20191011200158.41a948f4@JRWUBU2>
 <CAN49p6r=D68Tzy1bzaXcvFVpv+RaXQG1GZnozBd4og=XkeRaDQ@mail.gmail.com>
Message-ID: <20191012020212.6db1634a@JRWUBU2>

On Fri, 11 Oct 2019 14:35:33 -0700
Markus Scherer via Unicode <unicode at unicode.org> wrote:

> > > [c \q{ch}]h should work like (ch|c)h. Note that the order matters
> > > in the alternation -- so this works equivalently if longer
> > > strings are sorted first.  

> > Does conformance UTS#18 to level 2 mandate the choice of matching
> > substring? This would appear to prohibit compliance to POSIX rules,
> > where the length of overall match counts.

> The idea is currently to specify properties-of-strings (and I think a
> range/class with "clusters") behaving like an alternation where the
> longest strings are first, and leaving it up to the regex engine
> exactly what that means.
> 
> In general, UTS #18 offers a lot of things that regex implementers
> may or may not adopt.

> If you have specific ideas, please send them as PRI feedback.
> (Discussion on the list is good and useful, but does not guarantee
> that it gets looked at when it counts.)

You claimed the order of alternatives mattered.  That is an important
issue for anyone rash enough to think that the standard is fit to be
used as a specification.

I'm still not entirely clear what a regular expression /[\u00c1\u00e1]/
can mean.  If the system uses NFD to simulate Unicode conformance,
shall the expression then be converted to /[{A\u0301}{a\u0301}]/?  Or
should it simply fail to match any NFD string?  I've been implementing
the view that all or none of the canonical equivalents of a string
match.  (I therefore support mildly discontiguous substrings, though I
don't support splitting undecomposable characters.)

Richard.

From unicode at unicode.org  Fri Oct 11 20:37:18 2019
From: unicode at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via Unicode)
Date: Fri, 11 Oct 2019 18:37:18 -0700
Subject: Pure Regular Expression Engines and Literal Clusters
In-Reply-To: <20191012020212.6db1634a@JRWUBU2>
References: <20191008152534.2068db6c@JRWUBU2>
 <CAN49p6r2EiCZLoOV1DngbM83d5Vzu7X0dOcjVnXf-CJK4+aR6Q@mail.gmail.com>
 <20191011200158.41a948f4@JRWUBU2>
 <CAN49p6r=D68Tzy1bzaXcvFVpv+RaXQG1GZnozBd4og=XkeRaDQ@mail.gmail.com>
 <20191012020212.6db1634a@JRWUBU2>
Message-ID: <CAJ2xs_GDJMnqTmK9j3RDeygN=ka1FepOikTewmv-nNDewarnCw@mail.gmail.com>

>
> You claimed the order of alternatives mattered.  That is an important
> issue for anyone rash enough to think that the standard is fit to be
> used as a specification.
>

Regex engines differ in how they handle the interpretation of the matching
of alternatives, and it is not possible for us to wave a magic wand to
change them.

What we can do is specify how the interpretation of the properties of
strings works. By specifying that they behave like alternation AND adding
the extra constraint of having longer first, we minimize the differences
across regex engines.

>
> I'm still not entirely clear what a regular expression /[\u00c1\u00e1]/
> can mean.  If the system uses NFD to simulate Unicode conformance,
> shall the expression then be converted to /[{A\u0301}{a\u0301}]/?  Or
> should it simply fail to match any NFD string?  I've been implementing
> the view that all or none of the canonical equivalents of a string
> match.  (I therefore support mildly discontiguous substrings, though I
> don't support splitting undecomposable characters.)
>

We came to the conclusion years ago that regex engines cannot reasonably be
expected to implement canonical equivalence; they are really working at a
lower level. So you see the advice we give at
http://unicode.org/reports/tr18/#Canonical_Equivalents. (Again, no magic
wand.)


> Richard.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20191011/b320e761/attachment.html>

From unicode at unicode.org  Sat Oct 12 03:16:30 2019
From: unicode at unicode.org (=?utf-8?Q?Daniel_B=C3=BCnzli?= via Unicode)
Date: Sat, 12 Oct 2019 10:16:30 +0200
Subject: Website format (was Re: Unicode website glitches. (was The Most
 Frequent Emoji))
In-Reply-To: <ca2cae58-0b00-bf98-eee9-280dae35d6a1@it.aoyama.ac.jp>
References: <5D9DF525.5070300@unicode.org>
 <fc9b6809-3a19-62d6-7413-0b71a4fea530@it.aoyama.ac.jp>
 <CAJ2xs_Fb5TKkqAkhwApHFhBAwhWdfqCYmgs0nrXrNFMjmS7C0w@mail.gmail.com>
 <ca2cae58-0b00-bf98-eee9-280dae35d6a1@it.aoyama.ac.jp>
Message-ID: <etPan.5da18bde.3a5cfa82.14e@erratique.ch>

On 12 October 2019 at 02:05:23, Martin J. D?rst via Unicode (unicode at unicode.org) wrote:

> I think it's less the format and much more the split personality of the
> Unicode Web site(s?) that I have problems with.

I also do.?

One thing that is particulary annoying is the fact that the "home" link on the "technical" (unchanged) subpart of the website gets back to the "marketing" home page which is particularly inefficient (the links you are looking for are not above the fold on a laptop screen) and confusing (the whole layout shifts and the theme changes) for perusing the technical part of the website.

With all due respect for the work that has been done on the new website I think that the new structure significantly decreased the usability of the website for technical users.

Best,?

Daniel


From unicode at unicode.org  Sat Oct 12 03:23:36 2019
From: unicode at unicode.org (Asmus Freytag via Unicode)
Date: Sat, 12 Oct 2019 01:23:36 -0700
Subject: Website format (was Re: Unicode website glitches. (was The Most
 Frequent Emoji))
In-Reply-To: <etPan.5da18bde.3a5cfa82.14e@erratique.ch>
References: <5D9DF525.5070300@unicode.org>
 <fc9b6809-3a19-62d6-7413-0b71a4fea530@it.aoyama.ac.jp>
 <CAJ2xs_Fb5TKkqAkhwApHFhBAwhWdfqCYmgs0nrXrNFMjmS7C0w@mail.gmail.com>
 <ca2cae58-0b00-bf98-eee9-280dae35d6a1@it.aoyama.ac.jp>
 <etPan.5da18bde.3a5cfa82.14e@erratique.ch>
Message-ID: <9222b376-02cf-2530-5777-21120a4917da@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20191012/76f460a6/attachment.html>

From unicode at unicode.org  Sat Oct 12 05:15:38 2019
From: unicode at unicode.org (Fred Brennan via Unicode)
Date: Sat, 12 Oct 2019 18:15:38 +0800
Subject: Will TAGALOG LETTER RA, currently in the pipeline,
 be in the next version of Unicode?
In-Reply-To: <CAH=y87bviXpfiFfMmQ2nVnpqDR_66fc_vMQZZH8NQggPq18Ocw@mail.gmail.com>
References: <20191011150454.665a7a7059d7ee80bb4d670165c8327d.151fa3f752.wbe@email03.godaddy.com>
 <CAH=y87bviXpfiFfMmQ2nVnpqDR_66fc_vMQZZH8NQggPq18Ocw@mail.gmail.com>
Message-ID: <4306889.x27fmyrm67@pc>

On Saturday, October 12, 2019 6:28:01 AM PST Rebecca Bettencourt via Unicode 
wrote:
> This proposal was special in that it was asking the Unicode Consortium to
> recognize a character that was already being used unofficially, so that
> organizations like the Google Noto team who are sticklers for Unicode
> compliance would include it. :)

Indeed - it is extremely unfortunate that users will need to wait until 
2021(!) to get it into Unicode so Google will finally add it to the Noto 
fonts.

There seems to be no conscionable reason for such a long delay after the 
approval.

If that's just how things are done, fine, I certainly can't change the whole 
system. But imagine if you had to wait two years to even have a chance of 
using a letter you desperately need to write your language? Imagine if the 
letter "Q" was unencoded and Noto refused to add it for two more years?


From unicode at unicode.org  Sat Oct 12 07:17:55 2019
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Sat, 12 Oct 2019 13:17:55 +0100
Subject: Pure Regular Expression Engines and Literal Clusters
In-Reply-To: <CAJ2xs_GDJMnqTmK9j3RDeygN=ka1FepOikTewmv-nNDewarnCw@mail.gmail.com>
References: <20191008152534.2068db6c@JRWUBU2>
 <CAN49p6r2EiCZLoOV1DngbM83d5Vzu7X0dOcjVnXf-CJK4+aR6Q@mail.gmail.com>
 <20191011200158.41a948f4@JRWUBU2>
 <CAN49p6r=D68Tzy1bzaXcvFVpv+RaXQG1GZnozBd4og=XkeRaDQ@mail.gmail.com>
 <20191012020212.6db1634a@JRWUBU2>
 <CAJ2xs_GDJMnqTmK9j3RDeygN=ka1FepOikTewmv-nNDewarnCw@mail.gmail.com>
Message-ID: <20191012131755.7749a622@JRWUBU2>

On Fri, 11 Oct 2019 18:37:18 -0700
Mark Davis ?? via Unicode <unicode at unicode.org> wrote:

> >
> > You claimed the order of alternatives mattered.  That is an
> > important issue for anyone rash enough to think that the standard
> > is fit to be used as a specification.
> >  
> 
> Regex engines differ in how they handle the interpretation of the
> matching of alternatives, and it is not possible for us to wave a
> magic wand to change them.

But you are close to waving a truncheon to deprecate some of them.  And
even if you do not wave the truncheon, you will provide other people a
stick to beat them with.

> What we can do is specify how the interpretation of the properties of
> strings works. By specifying that they behave like alternation AND
> adding the extra constraint of having longer first, we minimize the
> differences across regex engines.

But remember that 'having longer first' is meaningless for a
non-deterministic finite automaton that does a single pass through the
string to be searched.

> > I'm still not entirely clear what a regular
> > expression /[\u00c1\u00e1]/ can mean.  If the system uses NFD to
> > simulate Unicode conformance, shall the expression then be
> > converted to /[{A\u0301}{a\u0301}]/?  Or should it simply fail to
> > match any NFD string?  I've been implementing the view that all or
> > none of the canonical equivalents of a string match.  (I therefore
> > support mildly discontiguous substrings, though I don't support
> > splitting undecomposable characters.) 
> 
> We came to the conclusion years ago that regex engines cannot
> reasonably be expected to implement canonical equivalence; they are
> really working at a lower level.

So does a lot of text processing.  The issue should simply be that the
change is too complicated for straightforward implementation:

(1) One winds up with slightly discontiguous substrings: the
non-starters at the beginning and end may not be contiguous.

(2) If one does not work with NFD, one ends up with parts of characters
in substrings.

(3) If one does not work with NFD (thereby formally avoiding the
issue of Unicode equivalence), replacing a non-starter by a character
of a different ccc is in general not a Unicode-compliant process.  (This
avoidance technique can be necessary for the Unicode Collation
Algorithm.)

(4) The algorithm for recognising concatenation and iteration (more
precisely, their closures under canonical equivalence) need to be
significantly rewritten.  One needs to be careful with optimisation -
some approaches could lead to reducing an FSM with over 2^54 states.

The issue of concatenation and iteration is largely solved in the
theory of traces and regular expressions, though there is still the
issue of when the iteration (Kleene star) of a regular expression (for
traces) is itself regular.  In the literature, this issue is called the
'star problem'.  One practical answer is that the Kleene star is itself
regular if it is generated from the set of strings matching the regular
expression that either contain NFD non-starters or all of whose
characters have the same ccc.  An unchecked requirement that
Kleene stars all be of this form would probably not be too great
a problem - one could probably dress this up by 'only fully supporting
Kleene star that is the same as the "concurrent star"'. Another one is
that recognition algorithms do not need to restrict themselves to
*regular* expressions - back references are not 'regular' either.

/\u0F73*/ is probably the simplest example of a non-regular Kleene star
in the Unicode strings under canonical equivalence.  (That character is
a problem character for ICU collation.)
However, /[[:Tibetan:]&[:insc=vowel_dependent:]]*/ is regular, as
removing U+0F73 from the Unicode set does not change its iteration.
Contrariwise, there might be a formal issue with giving <U+0F73 TIBETAN
VOWEL SIGN II, U+0F73, U+0F73> preference over <U+0F7A TIBETAN
VOWEL SIGN E, U+0F7A, U+0F7A> if one used the iteration algorithm for
regular-only Kleene star.

> So you see the advice we give at
> http://unicode.org/reports/tr18/#Canonical_Equivalents. (Again, no
> magic wand.)

So who's got tools for converting the USE's expression for a
'standard cluster' into a regular expression that catches all NFD
equivalents of the original expression?  There may be perfection issues
- the star problem may be unsolved for sequences of Unicode strings
under canonical equivalence.  Annoyingly, I can't find any text but my
own that relates traces to Unicode!

The trick of converting strings to NFD before searching them is
certainly useful.  Even with an engine respecting canonical
equivalence, it cuts the 2^54 I mentioned down to 54, the
number of non-zero canonical combining classes currently in use.  Of
course, such a reduction is not fully consistent with the spirit of a
finite state machine.

Richard.


From unicode at unicode.org  Sat Oct 12 07:50:57 2019
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Sat, 12 Oct 2019 13:50:57 +0100
Subject: Will TAGALOG LETTER RA, currently in the pipeline, be in the
 next version of Unicode?
In-Reply-To: <4306889.x27fmyrm67@pc>
References: <20191011150454.665a7a7059d7ee80bb4d670165c8327d.151fa3f752.wbe@email03.godaddy.com>
 <CAH=y87bviXpfiFfMmQ2nVnpqDR_66fc_vMQZZH8NQggPq18Ocw@mail.gmail.com>
 <4306889.x27fmyrm67@pc>
Message-ID: <20191012135057.4fd93a51@JRWUBU2>

On Sat, 12 Oct 2019 18:15:38 +0800
Fred Brennan via Unicode <unicode at unicode.org> wrote:

> Indeed - it is extremely unfortunate that users will need to wait
> until 2021(!) to get it into Unicode so Google will finally add it to
> the Noto fonts.

> There seems to be no conscionable reason for such a long delay after
> the approval.

The UTC's accepting a character does not mean it will make it into
Unicode.  In the ISO process it may yet be rejected, renumbered or
renamed.  These things have certainly happen for new scripts.

Richard.

From unicode at unicode.org  Sat Oct 12 10:06:25 2019
From: unicode at unicode.org (Ken Whistler via Unicode)
Date: Sat, 12 Oct 2019 08:06:25 -0700
Subject: Will TAGALOG LETTER RA, currently in the pipeline, be in the next
 version of Unicode?
In-Reply-To: <4306889.x27fmyrm67@pc>
References: <20191011150454.665a7a7059d7ee80bb4d670165c8327d.151fa3f752.wbe@email03.godaddy.com>
 <CAH=y87bviXpfiFfMmQ2nVnpqDR_66fc_vMQZZH8NQggPq18Ocw@mail.gmail.com>
 <4306889.x27fmyrm67@pc>
Message-ID: <ec0411a9-3577-eb0d-ecbe-0849294b0083@sonic.net>


On 10/12/2019 3:15 AM, Fred Brennan via Unicode wrote:
> There seems to be no conscionable reason for such a long delay after the
> approval.
>
> If that's just how things are done, fine, I certainly can't change the whole
> system. But imagine if you had to wait two years to even have a chance of
> using a letter you desperately need to write your language? Imagine if the
> letter "Q" was unencoded and Noto refused to add it for two more years?

Well, as long as we are imagining things, then consider a scenario where 
the UTC is presented a proposal for encoding a writing system which is 
reported as an historic artifact of the 18th century, "fallen out of 
normal use", yet encodes it anyway based on the proposal provided in 1999:

https://www.unicode.org/L2/L1999/n1933.pdf

and publishes it in Unicode 3.2 in 2002:

https://www.unicode.org/standard/supported.html

Then imagine that a community works to revive use of that script (now 
known as Baybayin) and extends character use in it based on similar 
characters in related, more contemporaneous scripts, but that the first 
time the UTC actually formally hears about that extension is on July 18, 
2019:

https://www.unicode.org/L2/L2019/19258r-baybayin-ra.pdf

And then imagine that despite a 17 year gap before this supposedly 
urgent defect in an encoding is reported to the UTC, that the UTC in 
fact approves encoding of U+170D TAGALOG LETTER RA at its very *first* 
opportunity, eight days later, on July 26, 2019. Further imagine that 
the UTC immediately publishes what amounts to a "letter of intent" to 
publish this character when it can:

https://www.unicode.org/alloc/Pipeline.html#future

It may then be understandable that some UTC participants might be 
puzzled to be accused of unconscionable delays in this case.

I understand the frustration that you are expressing, but it simply 
isn't feasible for every proposal's advocates to get their particular 
candidates pushed to the front of the line for publication. Unicode 13.0 
is creaking down the track towards its March 10, 2020 publication, but 
it already is contending with 5930 new characters (as well as additional 
emoji sequences beyond that), every one of which was approved by the UTC 
*prior* to July 26, 2019 and all of which are already in some advanced 
stage of ISO ballot consideration.

In the meantime, Baybayin users are inconvenienced, sure, but it is 
unlikely that the interim solutions will just break, because nobody is 
opposed to U+170D TAGALOG LETTER RA, and it is exceedingly unlikely that 
that code point would be moved before its eventual publication in the 
standard in March, 2021.

--Ken


From unicode at unicode.org  Sat Oct 12 13:28:02 2019
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Sat, 12 Oct 2019 19:28:02 +0100
Subject: Will TAGALOG LETTER RA, currently in the pipeline, be in the
 next version of Unicode?
In-Reply-To: <4306889.x27fmyrm67@pc>
References: <20191011150454.665a7a7059d7ee80bb4d670165c8327d.151fa3f752.wbe@email03.godaddy.com>
 <CAH=y87bviXpfiFfMmQ2nVnpqDR_66fc_vMQZZH8NQggPq18Ocw@mail.gmail.com>
 <4306889.x27fmyrm67@pc>
Message-ID: <20191012192802.01bd24b3@JRWUBU2>

On Sat, 12 Oct 2019 18:15:38 +0800
Fred Brennan via Unicode <unicode at unicode.org> wrote:

> Indeed - it is extremely unfortunate that users will need to wait
> until 2021(!) to get it into Unicode so Google will finally add it to
> the Noto fonts.

> If that's just how things are done, fine, I certainly can't change
> the whole system. But imagine if you had to wait two years to even
> have a chance of using a letter you desperately need to write your
> language?

Update me on what the problem with using the character *now* is.  If the
character is so important, why do you need to wait for Noto fonts?

I can imagine a much bigger problem - you could have the problem that
the Babayin script is 'supported'.  This could result in dotted circles
between RA and the combining marks.  It took ages between the addition
of U+0BB6 TAMI LETTER SHA to Unicode and obtaining a renderer that
acknowledged it as a Tamil letter.  You should be (or are you?)
badgering HarfBuzz to speculatively support it.  (There may be other
problems in the system.)

> Imagine if the letter "Q" was unencoded and Noto refused to
> add it for two more years?

On private PCs, having Noto support for a script can actually be an
unmitigated disaster.

Richard.

From unicode at unicode.org  Sat Oct 12 14:36:45 2019
From: unicode at unicode.org (=?utf-8?Q?Hans_=C3=85berg?= via Unicode)
Date: Sat, 12 Oct 2019 21:36:45 +0200
Subject: Pure Regular Expression Engines and Literal Clusters
In-Reply-To: <20191012131755.7749a622@JRWUBU2>
References: <20191008152534.2068db6c@JRWUBU2>
 <CAN49p6r2EiCZLoOV1DngbM83d5Vzu7X0dOcjVnXf-CJK4+aR6Q@mail.gmail.com>
 <20191011200158.41a948f4@JRWUBU2>
 <CAN49p6r=D68Tzy1bzaXcvFVpv+RaXQG1GZnozBd4og=XkeRaDQ@mail.gmail.com>
 <20191012020212.6db1634a@JRWUBU2>
 <CAJ2xs_GDJMnqTmK9j3RDeygN=ka1FepOikTewmv-nNDewarnCw@mail.gmail.com>
 <20191012131755.7749a622@JRWUBU2>
Message-ID: <3EA16101-7FD3-4BE8-82F7-DB32E36BBFAE@telia.com>


> On 12 Oct 2019, at 14:17, Richard Wordingham via Unicode <unicode at unicode.org> wrote:
> 
> But remember that 'having longer first' is meaningless for a
> non-deterministic finite automaton that does a single pass through the
> string to be searched.

It is possible to identify all submatches deterministically in linear time without backtracking ? I a made an algorithm for that.

A selection among different submatches then requires additional rules.


From unicode at unicode.org  Sat Oct 12 17:03:17 2019
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Sat, 12 Oct 2019 23:03:17 +0100
Subject: Pure Regular Expression Engines and Literal Clusters
In-Reply-To: <33ADA35B-D882-4F39-8693-83B0C5F9796B@dijkmat.nl>
References: <20191008152534.2068db6c@JRWUBU2>
 <CAN49p6r2EiCZLoOV1DngbM83d5Vzu7X0dOcjVnXf-CJK4+aR6Q@mail.gmail.com>
 <33ADA35B-D882-4F39-8693-83B0C5F9796B@dijkmat.nl>
Message-ID: <20191012230317.661f36c4@JRWUBU2>

On Fri, 11 Oct 2019 12:39:56 +0200
Elizabeth Mattijsen via Unicode <unicode at unicode.org> wrote:


> Furthermore, Perl 6 uses Normalization Form Grapheme for matching:
>     https://docs.perl6.org/type/Cool#index-entry-Grapheme

This approach does address the issue Mark Davis mentioned about regex
engines working at the wrong level.  Perhaps you can put my mind at
rest about whether it works at all with scripts that subordinate
vowels.

If I wanted to find the occurrences of the Pali word _pacati_ 'to cook'
in Latin script text using form NFG, I could use a Perl regular
expression like /\b(:?a|pa)?p[a?]c(:?\B.)*/.  (At least,

grep -P '\b(:?a|pa)?p[a?]c\p{Ll}*' file.txt

works on text in NFC.  I couldn't work out the command-line expression
to display a list of matches from Perl, and the PCRE \B is broken beyond
ASCII in GNU grep 2.25.)

How would I do such a search in an Indic script using form NFG?

The main issue is that the single character 'c' would have to expand to
a list of all but one of the Pali grapheme clusters whose initial
consonant transliterates to 'c'.  Have you a notation for such a class?

Regards,

Richard.


From unicode at unicode.org  Sat Oct 12 17:37:05 2019
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Sat, 12 Oct 2019 23:37:05 +0100
Subject: Pure Regular Expression Engines and Literal Clusters
In-Reply-To: <3EA16101-7FD3-4BE8-82F7-DB32E36BBFAE@telia.com>
References: <20191008152534.2068db6c@JRWUBU2>
 <CAN49p6r2EiCZLoOV1DngbM83d5Vzu7X0dOcjVnXf-CJK4+aR6Q@mail.gmail.com>
 <20191011200158.41a948f4@JRWUBU2>
 <CAN49p6r=D68Tzy1bzaXcvFVpv+RaXQG1GZnozBd4og=XkeRaDQ@mail.gmail.com>
 <20191012020212.6db1634a@JRWUBU2>
 <CAJ2xs_GDJMnqTmK9j3RDeygN=ka1FepOikTewmv-nNDewarnCw@mail.gmail.com>
 <20191012131755.7749a622@JRWUBU2>
 <3EA16101-7FD3-4BE8-82F7-DB32E36BBFAE@telia.com>
Message-ID: <20191012233705.52544fb9@JRWUBU2>

On Sat, 12 Oct 2019 21:36:45 +0200
Hans ?berg via Unicode <unicode at unicode.org> wrote:

> > On 12 Oct 2019, at 14:17, Richard Wordingham via Unicode
> > <unicode at unicode.org> wrote:
> > 
> > But remember that 'having longer first' is meaningless for a
> > non-deterministic finite automaton that does a single pass through
> > the string to be searched.  
> 
> It is possible to identify all submatches deterministically in linear
> time without backtracking ? I a made an algorithm for that.

That's impressive, as the number of possible submatches for a*(a*)a* is
quadratic in the string length.

> A selection among different submatches then requires additional rules.

Regards,

Richard.


From unicode at unicode.org  Sun Oct 13 03:04:34 2019
From: unicode at unicode.org (=?utf-8?Q?Hans_=C3=85berg?= via Unicode)
Date: Sun, 13 Oct 2019 10:04:34 +0200
Subject: Pure Regular Expression Engines and Literal Clusters
In-Reply-To: <20191012233705.52544fb9@JRWUBU2>
References: <20191008152534.2068db6c@JRWUBU2>
 <CAN49p6r2EiCZLoOV1DngbM83d5Vzu7X0dOcjVnXf-CJK4+aR6Q@mail.gmail.com>
 <20191011200158.41a948f4@JRWUBU2>
 <CAN49p6r=D68Tzy1bzaXcvFVpv+RaXQG1GZnozBd4og=XkeRaDQ@mail.gmail.com>
 <20191012020212.6db1634a@JRWUBU2>
 <CAJ2xs_GDJMnqTmK9j3RDeygN=ka1FepOikTewmv-nNDewarnCw@mail.gmail.com>
 <20191012131755.7749a622@JRWUBU2>
 <3EA16101-7FD3-4BE8-82F7-DB32E36BBFAE@telia.com>
 <20191012233705.52544fb9@JRWUBU2>
Message-ID: <C1F2BDD2-86E7-4643-BED8-B9298084469C@telia.com>


> On 13 Oct 2019, at 00:37, Richard Wordingham via Unicode <unicode at unicode.org> wrote:
> 
> On Sat, 12 Oct 2019 21:36:45 +0200
> Hans ?berg via Unicode <unicode at unicode.org> wrote:
> 
>>> On 12 Oct 2019, at 14:17, Richard Wordingham via Unicode
>>> <unicode at unicode.org> wrote:
>>> 
>>> But remember that 'having longer first' is meaningless for a
>>> non-deterministic finite automaton that does a single pass through
>>> the string to be searched.  
>> 
>> It is possible to identify all submatches deterministically in linear
>> time without backtracking ? I a made an algorithm for that.
> 
> That's impressive, as the number of possible submatches for a*(a*)a* is
> quadratic in the string length.

That is probably after the possibilities in the matching graph have been expanded, which can even be exponential. As an analogy, think of a polynomial product, I compute the product, not the expansion.


From unicode at unicode.org  Sun Oct 13 08:00:18 2019
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Sun, 13 Oct 2019 14:00:18 +0100
Subject: Pure Regular Expression Engines and Literal Clusters
In-Reply-To: <C1F2BDD2-86E7-4643-BED8-B9298084469C@telia.com>
References: <20191008152534.2068db6c@JRWUBU2>
 <CAN49p6r2EiCZLoOV1DngbM83d5Vzu7X0dOcjVnXf-CJK4+aR6Q@mail.gmail.com>
 <20191011200158.41a948f4@JRWUBU2>
 <CAN49p6r=D68Tzy1bzaXcvFVpv+RaXQG1GZnozBd4og=XkeRaDQ@mail.gmail.com>
 <20191012020212.6db1634a@JRWUBU2>
 <CAJ2xs_GDJMnqTmK9j3RDeygN=ka1FepOikTewmv-nNDewarnCw@mail.gmail.com>
 <20191012131755.7749a622@JRWUBU2>
 <3EA16101-7FD3-4BE8-82F7-DB32E36BBFAE@telia.com>
 <20191012233705.52544fb9@JRWUBU2>
 <C1F2BDD2-86E7-4643-BED8-B9298084469C@telia.com>
Message-ID: <20191013140018.5ea512bc@JRWUBU2>

On Sun, 13 Oct 2019 10:04:34 +0200
Hans ?berg via Unicode <unicode at unicode.org> wrote:

> > On 13 Oct 2019, at 00:37, Richard Wordingham via Unicode
> > <unicode at unicode.org> wrote:
> > 
> > On Sat, 12 Oct 2019 21:36:45 +0200
> > Hans ?berg via Unicode <unicode at unicode.org> wrote:
> >   
> >>> On 12 Oct 2019, at 14:17, Richard Wordingham via Unicode
> >>> <unicode at unicode.org> wrote:
> >>> 
> >>> But remember that 'having longer first' is meaningless for a
> >>> non-deterministic finite automaton that does a single pass through
> >>> the string to be searched.    
> >> 
> >> It is possible to identify all submatches deterministically in
> >> linear time without backtracking ? I a made an algorithm for
> >> that.  
> > 
> > That's impressive, as the number of possible submatches for
> > a*(a*)a* is quadratic in the string length.  
> 
> That is probably after the possibilities in the matching graph have
> been expanded, which can even be exponential. As an analogy, think of
> a polynomial product, I compute the product, not the expansion.

I'm now beginning to wonder what you are claiming. One thing one can
do without backtracking is to determine which capture groups capture
something, and which combinations of capturing or not occur.  That's
a straightforward extension of doing the overall 'recognition' in linear
time - at least, linear in length (n) of the searched string.  (I say
straightforward, but it would mess up my state naming algorithm.)  The
time can also depend on the complexity of the regular expression, which
can be bounded by the length (m) of the expression if working with mere
strings, giving time O(mn) if one doesn't undertake the worst case
O(2^m) task of converting the non-deterministic FSM to a deterministic
FSM.

Using m as a complexity measure for traces may be misleading, and I
think plain wrong; for moderate m, the complexity can easily go up as
fast as m^10, and I think higher powers are possible.  Strings
exercising the higher complexities are linguistically implausible.

Regards,

Richard.


From unicode at unicode.org  Sun Oct 13 08:29:04 2019
From: unicode at unicode.org (=?utf-8?Q?Hans_=C3=85berg?= via Unicode)
Date: Sun, 13 Oct 2019 15:29:04 +0200
Subject: Pure Regular Expression Engines and Literal Clusters
In-Reply-To: <20191013140018.5ea512bc@JRWUBU2>
References: <20191008152534.2068db6c@JRWUBU2>
 <CAN49p6r2EiCZLoOV1DngbM83d5Vzu7X0dOcjVnXf-CJK4+aR6Q@mail.gmail.com>
 <20191011200158.41a948f4@JRWUBU2>
 <CAN49p6r=D68Tzy1bzaXcvFVpv+RaXQG1GZnozBd4og=XkeRaDQ@mail.gmail.com>
 <20191012020212.6db1634a@JRWUBU2>
 <CAJ2xs_GDJMnqTmK9j3RDeygN=ka1FepOikTewmv-nNDewarnCw@mail.gmail.com>
 <20191012131755.7749a622@JRWUBU2>
 <3EA16101-7FD3-4BE8-82F7-DB32E36BBFAE@telia.com>
 <20191012233705.52544fb9@JRWUBU2>
 <C1F2BDD2-86E7-4643-BED8-B9298084469C@telia.com>
 <20191013140018.5ea512bc@JRWUBU2>
Message-ID: <3893F0F9-3199-46A6-BC62-0404708CD71A@telia.com>


> On 13 Oct 2019, at 15:00, Richard Wordingham via Unicode <unicode at unicode.org> wrote:
> 
>>> On Sat, 12 Oct 2019 21:36:45 +0200
>>> Hans ?berg via Unicode <unicode at unicode.org> wrote:
>>> 
>>>>> On 12 Oct 2019, at 14:17, Richard Wordingham via Unicode
>>>>> <unicode at unicode.org> wrote:
>>>>> 
>>>>> But remember that 'having longer first' is meaningless for a
>>>>> non-deterministic finite automaton that does a single pass through
>>>>> the string to be searched.    
>>>> 
>>>> It is possible to identify all submatches deterministically in
>>>> linear time without backtracking ? I a made an algorithm for
>>>> that.  
> 
> I'm now beginning to wonder what you are claiming.

I start with a NFA with no empty transitions and apply the subset DFA construction dynamically for a given string along with some reverse NFA-data that is enough to transverse backwards when a final state arrives. The result is a NFA where all transverses is a match of the string at that position. 


From unicode at unicode.org  Sun Oct 13 14:17:54 2019
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Sun, 13 Oct 2019 20:17:54 +0100
Subject: Pure Regular Expression Engines and Literal Clusters
In-Reply-To: <3893F0F9-3199-46A6-BC62-0404708CD71A@telia.com>
References: <20191008152534.2068db6c@JRWUBU2>
 <CAN49p6r2EiCZLoOV1DngbM83d5Vzu7X0dOcjVnXf-CJK4+aR6Q@mail.gmail.com>
 <20191011200158.41a948f4@JRWUBU2>
 <CAN49p6r=D68Tzy1bzaXcvFVpv+RaXQG1GZnozBd4og=XkeRaDQ@mail.gmail.com>
 <20191012020212.6db1634a@JRWUBU2>
 <CAJ2xs_GDJMnqTmK9j3RDeygN=ka1FepOikTewmv-nNDewarnCw@mail.gmail.com>
 <20191012131755.7749a622@JRWUBU2>
 <3EA16101-7FD3-4BE8-82F7-DB32E36BBFAE@telia.com>
 <20191012233705.52544fb9@JRWUBU2>
 <C1F2BDD2-86E7-4643-BED8-B9298084469C@telia.com>
 <20191013140018.5ea512bc@JRWUBU2>
 <3893F0F9-3199-46A6-BC62-0404708CD71A@telia.com>
Message-ID: <20191013201754.6597fdd0@JRWUBU2>

On Sun, 13 Oct 2019 15:29:04 +0200
Hans ?berg via Unicode <unicode at unicode.org> wrote:

> > On 13 Oct 2019, at 15:00, Richard Wordingham via Unicode
> > I'm now beginning to wonder what you are claiming.  

> I start with a NFA with no empty transitions and apply the subset DFA
> construction dynamically for a given string along with some reverse
> NFA-data that is enough to transverse backwards when a final state
> arrives. The result is a NFA where all transverses is a match of the
> string at that position.

And then the speed comparison depends on how quickly one can extract
the match information required from that data structure.

Incidentally, at least some of the sizes and timings I gave seem to be
wrong even for strings.  They won't work with numeric quantifiers, as
in /[ab]{0,20}[ac]{10,20}[ad]{0,20}e/.

One gets lesser issues in quantifying complexity if one wants "?" to
match \p{Lu} when working in NFD - potentially a different state for
each prefix of the capital letters.  (It's also the case except for
UTF-32 if characters are treated as sequences of code units.)  Perhaps
'upper case letter that Unicode happens to have encoded as a single
character' isn't a concept that regular expressions need to support
concisely. What's needed is to have a set somewhere between
[\p{Lu}&\p{isNFD}] and [\p{Lu}],though perhaps it should be extended to
include "ff" - there are English surnames like "ffrench".

Regards,

Richard.


From unicode at unicode.org  Sun Oct 13 15:14:10 2019
From: unicode at unicode.org (=?utf-8?Q?Hans_=C3=85berg?= via Unicode)
Date: Sun, 13 Oct 2019 22:14:10 +0200
Subject: Pure Regular Expression Engines and Literal Clusters
In-Reply-To: <20191013201754.6597fdd0@JRWUBU2>
References: <20191008152534.2068db6c@JRWUBU2>
 <CAN49p6r2EiCZLoOV1DngbM83d5Vzu7X0dOcjVnXf-CJK4+aR6Q@mail.gmail.com>
 <20191011200158.41a948f4@JRWUBU2>
 <CAN49p6r=D68Tzy1bzaXcvFVpv+RaXQG1GZnozBd4og=XkeRaDQ@mail.gmail.com>
 <20191012020212.6db1634a@JRWUBU2>
 <CAJ2xs_GDJMnqTmK9j3RDeygN=ka1FepOikTewmv-nNDewarnCw@mail.gmail.com>
 <20191012131755.7749a622@JRWUBU2>
 <3EA16101-7FD3-4BE8-82F7-DB32E36BBFAE@telia.com>
 <20191012233705.52544fb9@JRWUBU2>
 <C1F2BDD2-86E7-4643-BED8-B9298084469C@telia.com>
 <20191013140018.5ea512bc@JRWUBU2>
 <3893F0F9-3199-46A6-BC62-0404708CD71A@telia.com>
 <20191013201754.6597fdd0@JRWUBU2>
Message-ID: <A5305A26-F8B8-43B6-96B8-474858B709A5@telia.com>


> On 13 Oct 2019, at 21:17, Richard Wordingham via Unicode <unicode at unicode.org> wrote:
> 
> On Sun, 13 Oct 2019 15:29:04 +0200
> Hans ?berg via Unicode <unicode at unicode.org> wrote:
> 
>>> On 13 Oct 2019, at 15:00, Richard Wordingham via Unicode
>>> I'm now beginning to wonder what you are claiming.  
> 
>> I start with a NFA with no empty transitions and apply the subset DFA
>> construction dynamically for a given string along with some reverse
>> NFA-data that is enough to transverse backwards when a final state
>> arrives. The result is a NFA where all transverses is a match of the
>> string at that position.
> 
> And then the speed comparison depends on how quickly one can extract
> the match information required from that data structure.

Yes. For example, one should match the saved DFA in constant time, if matched as dynamic sets which is linear in set size, then one can get quadratic time complexity in string size.

Even though one can iterate through each match NFA in linear time, it could have say two choices at each character position each leading to the next, which would give an exponential size relative the string length.

Normally one is not interested in all matches, this is the disambiguation rules that do that.

> Incidentally, at least some of the sizes and timings I gave seem to be
> wrong even for strings.  They won't work with numeric quantifiers, as
> in /[ab]{0,20}[ac]{10,20}[ad]{0,20}e/.

For those, one normally implements a loop iteration. I did not that. I mentioned this method to Tim Shen on the libstdc++ list, so perhaps he might have implemented something.

> One gets lesser issues in quantifying complexity if one wants "?" to
> match \p{Lu} when working in NFD - potentially a different state for
> each prefix of the capital letters.  (It's also the case except for
> UTF-32 if characters are treated as sequences of code units.)  Perhaps
> 'upper case letter that Unicode happens to have encoded as a single
> character' isn't a concept that regular expressions need to support
> concisely. What's needed is to have a set somewhere between
> [\p{Lu}&\p{isNFD}] and [\p{Lu}],though perhaps it should be extended to
> include "ff" - there are English surnames like "ffrench?.

I made some C++ templates that translate Unicode code point character classes into UTF-8/32 regular expressions. So anything that can be reduced to actual regular expressions would work. 


From unicode at unicode.org  Sun Oct 13 16:54:12 2019
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Sun, 13 Oct 2019 22:54:12 +0100
Subject: Pure Regular Expression Engines and Literal Clusters
In-Reply-To: <A5305A26-F8B8-43B6-96B8-474858B709A5@telia.com>
References: <20191008152534.2068db6c@JRWUBU2>
 <CAN49p6r2EiCZLoOV1DngbM83d5Vzu7X0dOcjVnXf-CJK4+aR6Q@mail.gmail.com>
 <20191011200158.41a948f4@JRWUBU2>
 <CAN49p6r=D68Tzy1bzaXcvFVpv+RaXQG1GZnozBd4og=XkeRaDQ@mail.gmail.com>
 <20191012020212.6db1634a@JRWUBU2>
 <CAJ2xs_GDJMnqTmK9j3RDeygN=ka1FepOikTewmv-nNDewarnCw@mail.gmail.com>
 <20191012131755.7749a622@JRWUBU2>
 <3EA16101-7FD3-4BE8-82F7-DB32E36BBFAE@telia.com>
 <20191012233705.52544fb9@JRWUBU2>
 <C1F2BDD2-86E7-4643-BED8-B9298084469C@telia.com>
 <20191013140018.5ea512bc@JRWUBU2>
 <3893F0F9-3199-46A6-BC62-0404708CD71A@telia.com>
 <20191013201754.6597fdd0@JRWUBU2>
 <A5305A26-F8B8-43B6-96B8-474858B709A5@telia.com>
Message-ID: <20191013225412.4f1772ca@JRWUBU2>

On Sun, 13 Oct 2019 22:14:10 +0200
Hans ?berg via Unicode <unicode at unicode.org> wrote:

> > On 13 Oct 2019, at 21:17, Richard Wordingham via Unicode
> > <unicode at unicode.org> wrote:

> > Incidentally, at least some of the sizes and timings I gave seem to
> > be wrong even for strings.  They won't work with numeric
> > quantifiers, as in /[ab]{0,20}[ac]{10,20}[ad]{0,20}e/.  

> > One gets lesser issues in quantifying complexity if one wants "?" to
> > match \p{Lu} when working in NFD - potentially a different state for
> > each prefix of the capital letters.  (It's also the case except for
> > UTF-32 if characters are treated as sequences of code units.)
> > Perhaps 'upper case letter that Unicode happens to have encoded as
> > a single character' isn't a concept that regular expressions need
> > to support concisely. What's needed is to have a set somewhere
> > between [\p{Lu}&\p{isNFD}] and [\p{Lu}],though perhaps it should be
> > extended to include "ff" - there are English surnames like
> > "ffrench?.  

The point about these examples is that the estimate of one state per
character becomes a severe underestimate.  For example, after
processing 20 a's, the NFA for /[ab]{0,20}[ac]{10,20}[ad]{0,20}e/ can
be in any of about 50 states.  The number of possible states is not
linear in the length of the expression.  While a 'loop iteration' can
keep the size of the compiled regex down, it doesn't prevent the
proliferation of states - just add zeroes to my example. 

> I made some C++ templates that translate Unicode code point character
> classes into UTF-8/32 regular expressions. So anything that can be
> reduced to actual regular expressions would work. 

Besides invalidating complexity metrics, the issue was what \p{Lu}
should match.  For example, with PCRE syntax, GNU grep Version 2.25
\p{Lu} matches U+0100 but not <A, U+0300>.  When I'm respecting
canonical equivalence, I want both to match [:Lu:], and that's what I
do. [:Lu:] can then match a sequence of up to 4 NFD characters.

Regards,

Richard.


From unicode at unicode.org  Sun Oct 13 17:22:36 2019
From: unicode at unicode.org (=?utf-8?Q?Hans_=C3=85berg?= via Unicode)
Date: Mon, 14 Oct 2019 00:22:36 +0200
Subject: Pure Regular Expression Engines and Literal Clusters
In-Reply-To: <20191013225412.4f1772ca@JRWUBU2>
References: <20191008152534.2068db6c@JRWUBU2>
 <CAN49p6r2EiCZLoOV1DngbM83d5Vzu7X0dOcjVnXf-CJK4+aR6Q@mail.gmail.com>
 <20191011200158.41a948f4@JRWUBU2>
 <CAN49p6r=D68Tzy1bzaXcvFVpv+RaXQG1GZnozBd4og=XkeRaDQ@mail.gmail.com>
 <20191012020212.6db1634a@JRWUBU2>
 <CAJ2xs_GDJMnqTmK9j3RDeygN=ka1FepOikTewmv-nNDewarnCw@mail.gmail.com>
 <20191012131755.7749a622@JRWUBU2>
 <3EA16101-7FD3-4BE8-82F7-DB32E36BBFAE@telia.com>
 <20191012233705.52544fb9@JRWUBU2>
 <C1F2BDD2-86E7-4643-BED8-B9298084469C@telia.com>
 <20191013140018.5ea512bc@JRWUBU2>
 <3893F0F9-3199-46A6-BC62-0404708CD71A@telia.com>
 <20191013201754.6597fdd0@JRWUBU2>
 <A5305A26-F8B8-43B6-96B8-474858B709A5@telia.com>
 <20191013225412.4f1772ca@JRWUBU2>
Message-ID: <7DFB381E-EAE3-4B04-AA50-2E18E764E731@telia.com>


> On 13 Oct 2019, at 23:54, Richard Wordingham via Unicode <unicode at unicode.org> wrote:
> 
> The point about these examples is that the estimate of one state per
> character becomes a severe underestimate.  For example, after
> processing 20 a's, the NFA for /[ab]{0,20}[ac]{10,20}[ad]{0,20}e/ can
> be in any of about 50 states.  The number of possible states is not
> linear in the length of the expression.  While a 'loop iteration' can
> keep the size of the compiled regex down, it doesn't prevent the
> proliferation of states - just add zeroes to my example. 

Formally only the expansion of such ranges are NFA, and I haven?t seen anyone considering the complexity with them included. So to me, it seems just a hack.

>> I made some C++ templates that translate Unicode code point character
>> classes into UTF-8/32 regular expressions. So anything that can be
>> reduced to actual regular expressions would work. 
> 
> Besides invalidating complexity metrics, the issue was what \p{Lu}
> should match.  For example, with PCRE syntax, GNU grep Version 2.25
> \p{Lu} matches U+0100 but not <A, U+0300>.  When I'm respecting
> canonical equivalence, I want both to match [:Lu:], and that's what I
> do. [:Lu:] can then match a sequence of up to 4 NFD characters.

Hopefully some experts here can tune in, explaining exactly what regular expressions they have in mind.


From unicode at unicode.org  Sun Oct 13 19:10:45 2019
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Mon, 14 Oct 2019 01:10:45 +0100
Subject: Pure Regular Expression Engines and Literal Clusters
In-Reply-To: <7DFB381E-EAE3-4B04-AA50-2E18E764E731@telia.com>
References: <20191008152534.2068db6c@JRWUBU2>
 <CAN49p6r2EiCZLoOV1DngbM83d5Vzu7X0dOcjVnXf-CJK4+aR6Q@mail.gmail.com>
 <20191011200158.41a948f4@JRWUBU2>
 <CAN49p6r=D68Tzy1bzaXcvFVpv+RaXQG1GZnozBd4og=XkeRaDQ@mail.gmail.com>
 <20191012020212.6db1634a@JRWUBU2>
 <CAJ2xs_GDJMnqTmK9j3RDeygN=ka1FepOikTewmv-nNDewarnCw@mail.gmail.com>
 <20191012131755.7749a622@JRWUBU2>
 <3EA16101-7FD3-4BE8-82F7-DB32E36BBFAE@telia.com>
 <20191012233705.52544fb9@JRWUBU2>
 <C1F2BDD2-86E7-4643-BED8-B9298084469C@telia.com>
 <20191013140018.5ea512bc@JRWUBU2>
 <3893F0F9-3199-46A6-BC62-0404708CD71A@telia.com>
 <20191013201754.6597fdd0@JRWUBU2>
 <A5305A26-F8B8-43B6-96B8-474858B709A5@telia.com>
 <20191013225412.4f1772ca@JRWUBU2>
 <7DFB381E-EAE3-4B04-AA50-2E18E764E731@telia.com>
Message-ID: <20191014011045.35c851e9@JRWUBU2>

On Mon, 14 Oct 2019 00:22:36 +0200
Hans ?berg via Unicode <unicode at unicode.org> wrote:

> > On 13 Oct 2019, at 23:54, Richard Wordingham via Unicode
> > <unicode at unicode.org> wrote:

>> Besides invalidating complexity metrics, the issue was what \p{Lu}
>> should match.  For example, with PCRE syntax, GNU grep Version 2.25
>> \p{Lu} matches U+0100 but not <A, U+0300>.  When I'm respecting
>> canonical equivalence, I want both to match [:Lu:], and that's what
>> I do. [:Lu:] can then match a sequence of up to 4 NFD characters.  
 
> Hopefully some experts here can tune in, explaining exactly what
> regular expressions they have in mind.

The best indication lies at
https://www.unicode.org/reports/tr18/tr18-13.html#Canonical_Equivalents
(2008), which is the last version before support for canonical
equivalence was dropped as a requirement.

It's not entirely coherent, as the authors don't seem to find an
expression like

\p{L}\p{gcb=extend}*

a natural thing to use, as the second factor is mostly sequences of
non-starters.  At that point, I would say they weren't expecting
\p{Lu} to not match  <A, U+0300>, as they were still expecting  [?] to
match both "?" and "a\u0308".

They hadn't given any thought to [\p{L}&&\p{isNFD}]\p{gcb=extend}*, and
were expecting normalisation (even to NFC) to be a possible cure.  They
had begun to realise that converting expressions to match all or none
of a set of canonical equivalents was hard; the issue of non-contiguous
matches wasn't mentioned.

When I say 'hard', I'm thinking of the problem that concatenation may
require dissolution of the two constituent expressions and involve the
temporary creation of 54-fold (if text is handled as NFD) or 2^54-fold
(no normalisation) sets of extra states.  That's what's driven me to
write my own regular expression engine for traces.

Regards,

Richard.


From unicode at unicode.org  Sun Oct 13 19:13:28 2019
From: unicode at unicode.org (Asmus Freytag via Unicode)
Date: Sun, 13 Oct 2019 17:13:28 -0700
Subject: Pure Regular Expression Engines and Literal Clusters
In-Reply-To: <20191013225412.4f1772ca@JRWUBU2>
References: <20191008152534.2068db6c@JRWUBU2>
 <CAN49p6r2EiCZLoOV1DngbM83d5Vzu7X0dOcjVnXf-CJK4+aR6Q@mail.gmail.com>
 <20191011200158.41a948f4@JRWUBU2>
 <CAN49p6r=D68Tzy1bzaXcvFVpv+RaXQG1GZnozBd4og=XkeRaDQ@mail.gmail.com>
 <20191012020212.6db1634a@JRWUBU2>
 <CAJ2xs_GDJMnqTmK9j3RDeygN=ka1FepOikTewmv-nNDewarnCw@mail.gmail.com>
 <20191012131755.7749a622@JRWUBU2>
 <3EA16101-7FD3-4BE8-82F7-DB32E36BBFAE@telia.com>
 <20191012233705.52544fb9@JRWUBU2>
 <C1F2BDD2-86E7-4643-BED8-B9298084469C@telia.com>
 <20191013140018.5ea512bc@JRWUBU2>
 <3893F0F9-3199-46A6-BC62-0404708CD71A@telia.com>
 <20191013201754.6597fdd0@JRWUBU2>
 <A5305A26-F8B8-43B6-96B8-474858B709A5@telia.com>
 <20191013225412.4f1772ca@JRWUBU2>
Message-ID: <7b42cd5c-d78f-3583-39c1-65ee209eefa5@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20191013/63776172/attachment.html>

From unicode at unicode.org  Sun Oct 13 20:38:58 2019
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Mon, 14 Oct 2019 02:38:58 +0100
Subject: Pure Regular Expression Engines and Literal Clusters
In-Reply-To: <7b42cd5c-d78f-3583-39c1-65ee209eefa5@ix.netcom.com>
References: <20191008152534.2068db6c@JRWUBU2>
 <CAN49p6r2EiCZLoOV1DngbM83d5Vzu7X0dOcjVnXf-CJK4+aR6Q@mail.gmail.com>
 <20191011200158.41a948f4@JRWUBU2>
 <CAN49p6r=D68Tzy1bzaXcvFVpv+RaXQG1GZnozBd4og=XkeRaDQ@mail.gmail.com>
 <20191012020212.6db1634a@JRWUBU2>
 <CAJ2xs_GDJMnqTmK9j3RDeygN=ka1FepOikTewmv-nNDewarnCw@mail.gmail.com>
 <20191012131755.7749a622@JRWUBU2>
 <3EA16101-7FD3-4BE8-82F7-DB32E36BBFAE@telia.com>
 <20191012233705.52544fb9@JRWUBU2>
 <C1F2BDD2-86E7-4643-BED8-B9298084469C@telia.com>
 <20191013140018.5ea512bc@JRWUBU2>
 <3893F0F9-3199-46A6-BC62-0404708CD71A@telia.com>
 <20191013201754.6597fdd0@JRWUBU2>
 <A5305A26-F8B8-43B6-96B8-474858B709A5@telia.com>
 <20191013225412.4f1772ca@JRWUBU2>
 <7b42cd5c-d78f-3583-39c1-65ee209eefa5@ix.netcom.com>
Message-ID: <20191014023858.5d2be8ae@JRWUBU2>

On Sun, 13 Oct 2019 17:13:28 -0700
Asmus Freytag via Unicode <unicode at unicode.org> wrote:

> On 10/13/2019 2:54 PM, Richard Wordingham via Unicode wrote:
> Besides invalidating complexity metrics, the issue was what \p{Lu}
> should match.  For example, with PCRE syntax, GNU grep Version 2.25
> \p{Lu} matches U+0100 but not <A, U+0300>.  When I'm respecting
> canonical equivalence, I want both to match [:Lu:], and that's what I
> do. [:Lu:] can then match a sequence of up to 4 NFD characters.
> 
> Formally, wouldn't that be rewriting \p{Lu} to match \p{Lu}\p{Mn}*;
> instead of formally handling NFD, you could extend the syntax to
> handle "inherited" properties across combining sequences.
> 
> Am I missing anything?

Yes.  There is no precomposed LATIN LETTER M WITH CIRCUMFLEX, so [:Lu:]
should not match <U+004D LATIN CAPITAL LETTER M, U+0302 COMBINING
CIRCUMFLEX ACCENT>.  Now, I could invent a string property so
that \p{xLu} that meant (:?\p{Lu}\p{Mn}*).

I don't entirely understand what you said; you may have missed the
distinction between "[:Lu:] can then match" and "[:Lu:] will then
match".  I think only Greek letters expand to 4 characters in NFD.

When I'm respecting canonical equivalence/working with traces, I want
[:insc=vowel_dependent:][:insc=tone_mark:] to match both <U+0E39 THAI
CHARACTER SARA UU, U+0E49 THAI CHARACTER MAI THO> and its canonical
equivalent <U+0E49, U+0E39>.  The canonical closure of that
sequence can be messy even within scripts.  Some pairs commute: others
don't, usually for good reasons.

Regards,

Richard.

From unicode at unicode.org  Sun Oct 13 22:25:25 2019
From: unicode at unicode.org (Asmus Freytag via Unicode)
Date: Sun, 13 Oct 2019 20:25:25 -0700
Subject: Pure Regular Expression Engines and Literal Clusters
In-Reply-To: <20191014023858.5d2be8ae@JRWUBU2>
References: <20191008152534.2068db6c@JRWUBU2>
 <CAN49p6r2EiCZLoOV1DngbM83d5Vzu7X0dOcjVnXf-CJK4+aR6Q@mail.gmail.com>
 <20191011200158.41a948f4@JRWUBU2>
 <CAN49p6r=D68Tzy1bzaXcvFVpv+RaXQG1GZnozBd4og=XkeRaDQ@mail.gmail.com>
 <20191012020212.6db1634a@JRWUBU2>
 <CAJ2xs_GDJMnqTmK9j3RDeygN=ka1FepOikTewmv-nNDewarnCw@mail.gmail.com>
 <20191012131755.7749a622@JRWUBU2>
 <3EA16101-7FD3-4BE8-82F7-DB32E36BBFAE@telia.com>
 <20191012233705.52544fb9@JRWUBU2>
 <C1F2BDD2-86E7-4643-BED8-B9298084469C@telia.com>
 <20191013140018.5ea512bc@JRWUBU2>
 <3893F0F9-3199-46A6-BC62-0404708CD71A@telia.com>
 <20191013201754.6597fdd0@JRWUBU2>
 <A5305A26-F8B8-43B6-96B8-474858B709A5@telia.com>
 <20191013225412.4f1772ca@JRWUBU2>
 <7b42cd5c-d78f-3583-39c1-65ee209eefa5@ix.netcom.com>
 <20191014023858.5d2be8ae@JRWUBU2>
Message-ID: <3dbebbdf-04fa-e9b2-9780-cc40f4e5d15e@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20191013/fdf3717e/attachment.html>

From unicode at unicode.org  Sun Oct 13 23:28:34 2019
From: unicode at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via Unicode)
Date: Sun, 13 Oct 2019 21:28:34 -0700
Subject: Pure Regular Expression Engines and Literal Clusters
In-Reply-To: <3dbebbdf-04fa-e9b2-9780-cc40f4e5d15e@ix.netcom.com>
References: <20191008152534.2068db6c@JRWUBU2>
 <CAN49p6r2EiCZLoOV1DngbM83d5Vzu7X0dOcjVnXf-CJK4+aR6Q@mail.gmail.com>
 <20191011200158.41a948f4@JRWUBU2>
 <CAN49p6r=D68Tzy1bzaXcvFVpv+RaXQG1GZnozBd4og=XkeRaDQ@mail.gmail.com>
 <20191012020212.6db1634a@JRWUBU2>
 <CAJ2xs_GDJMnqTmK9j3RDeygN=ka1FepOikTewmv-nNDewarnCw@mail.gmail.com>
 <20191012131755.7749a622@JRWUBU2>
 <3EA16101-7FD3-4BE8-82F7-DB32E36BBFAE@telia.com>
 <20191012233705.52544fb9@JRWUBU2>
 <C1F2BDD2-86E7-4643-BED8-B9298084469C@telia.com>
 <20191013140018.5ea512bc@JRWUBU2>
 <3893F0F9-3199-46A6-BC62-0404708CD71A@telia.com>
 <20191013201754.6597fdd0@JRWUBU2>
 <A5305A26-F8B8-43B6-96B8-474858B709A5@telia.com>
 <20191013225412.4f1772ca@JRWUBU2>
 <7b42cd5c-d78f-3583-39c1-65ee209eefa5@ix.netcom.com>
 <20191014023858.5d2be8ae@JRWUBU2>
 <3dbebbdf-04fa-e9b2-9780-cc40f4e5d15e@ix.netcom.com>
Message-ID: <CAJ2xs_H_oYVVM5CzEnXUBVMt8t3BPTtfpLx2O4ow9hh8R8fFsA@mail.gmail.com>

The problem is that most regex engines are not written to handle some
"interesting" features of canonical equivalence, like discontinuity.
Suppose that X is canonically equivalent to AB.

   - A query /X/ can match the separated A and C in the target string
   "AbC". So if I have code do [replace /X/ in "AbC" by "pq"], how should it
   behave? "pqb", "pbq", "bpq"? If the input was in NFD (for example), should
   the output be rearranged/decomposed so that it is NFD? and so on.
   - A query /A/ can match *part* of the X in the target string "aXb". So
   if I have code to do [replace /A/ in "aXb" by "pq"], what should result:
   "apqBb"?

The syntax and APIs for regex engines are not built to handle these
features. It introduces a enough complications in the code, syntax, and
semantics that no major implementation has seen fit to do it. We used to
have a section in the spec about this, but were convinced that it was
better off handled at a higher level.

Mark


On Sun, Oct 13, 2019 at 8:31 PM Asmus Freytag via Unicode <
unicode at unicode.org> wrote:

> On 10/13/2019 6:38 PM, Richard Wordingham via Unicode wrote:
>
> On Sun, 13 Oct 2019 17:13:28 -0700
> Asmus Freytag via Unicode <unicode at unicode.org> <unicode at unicode.org> wrote:
>
>
> On 10/13/2019 2:54 PM, Richard Wordingham via Unicode wrote:
> Besides invalidating complexity metrics, the issue was what \p{Lu}
> should match.  For example, with PCRE syntax, GNU grep Version 2.25
> \p{Lu} matches U+0100 but not <A, U+0300>.  When I'm respecting
> canonical equivalence, I want both to match [:Lu:], and that's what I
> do. [:Lu:] can then match a sequence of up to 4 NFD characters.
>
> Formally, wouldn't that be rewriting \p{Lu} to match \p{Lu}\p{Mn}*;
> instead of formally handling NFD, you could extend the syntax to
> handle "inherited" properties across combining sequences.
>
> Am I missing anything?
>
> Yes.  There is no precomposed LATIN LETTER M WITH CIRCUMFLEX, so [:Lu:]
> should not match <U+004D LATIN CAPITAL LETTER M, U+0302 COMBINING
> CIRCUMFLEX ACCENT>.
>
> Why does it matter if it is precomposed? Why should it? (For anyone other
> than a character coding maven).
>
>  Now, I could invent a string property so
> that \p{xLu} that meant (:?\p{Lu}\p{Mn}*).
>
> I don't entirely understand what you said; you may have missed the
> distinction between "[:Lu:] can then match" and "[:Lu:] will then
> match".  I think only Greek letters expand to 4 characters in NFD.
>
> When I'm respecting canonical equivalence/working with traces, I want
> [:insc=vowel_dependent:][:insc=tone_mark:] to match both <U+0E39 THAI
> CHARACTER SARA UU, U+0E49 THAI CHARACTER MAI THO> and its canonical
> equivalent <U+0E49, U+0E39>.  The canonical closure of that
> sequence can be messy even within scripts.  Some pairs commute: others
> don't, usually for good reasons.
>
> Some models may be more natural for different scripts. Certainly, in SEA
> or Indic scripts, most combining marks are not best modeled with properties
> as "inherited". But for L/G/C etc. it would be a different matter.
>
> For general recommendations, such as UTS#18, it would be good to move the
> state of the art so that the "primitives" are in line with the way typical
> writing systems behave, so that people can write "linguistically correct"
> regexes.
>
> A./
>
>
> Regards,
>
> Richard.
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20191013/be87fb3e/attachment.html>

From unicode at unicode.org  Mon Oct 14 02:05:49 2019
From: unicode at unicode.org (Eli Zaretskii via Unicode)
Date: Mon, 14 Oct 2019 10:05:49 +0300
Subject: Pure Regular Expression Engines and Literal Clusters
In-Reply-To: <20191014011045.35c851e9@JRWUBU2> (message from Richard
 Wordingham via Unicode on Mon, 14 Oct 2019 01:10:45 +0100)
References: <20191008152534.2068db6c@JRWUBU2>
 <CAN49p6r2EiCZLoOV1DngbM83d5Vzu7X0dOcjVnXf-CJK4+aR6Q@mail.gmail.com>
 <20191011200158.41a948f4@JRWUBU2>
 <CAN49p6r=D68Tzy1bzaXcvFVpv+RaXQG1GZnozBd4og=XkeRaDQ@mail.gmail.com>
 <20191012020212.6db1634a@JRWUBU2>
 <CAJ2xs_GDJMnqTmK9j3RDeygN=ka1FepOikTewmv-nNDewarnCw@mail.gmail.com>
 <20191012131755.7749a622@JRWUBU2>
 <3EA16101-7FD3-4BE8-82F7-DB32E36BBFAE@telia.com>
 <20191012233705.52544fb9@JRWUBU2>
 <C1F2BDD2-86E7-4643-BED8-B9298084469C@telia.com>
 <20191013140018.5ea512bc@JRWUBU2>
 <3893F0F9-3199-46A6-BC62-0404708CD71A@telia.com>
 <20191013201754.6597fdd0@JRWUBU2>
 <A5305A26-F8B8-43B6-96B8-474858B709A5@telia.com>
 <20191013225412.4f1772ca@JRWUBU2>
 <7DFB381E-EAE3-4B04-AA50-2E18E764E731@telia.com>
 <20191014011045.35c851e9@JRWUBU2>
Message-ID: <83mue3kdrm.fsf@gnu.org>

> Date: Mon, 14 Oct 2019 01:10:45 +0100
> From: Richard Wordingham via Unicode <unicode at unicode.org>
> 
> >> Besides invalidating complexity metrics, the issue was what \p{Lu}
> >> should match.  For example, with PCRE syntax, GNU grep Version 2.25
> >> \p{Lu} matches U+0100 but not <A, U+0300>.  When I'm respecting
> >> canonical equivalence, I want both to match [:Lu:], and that's what
> >> I do. [:Lu:] can then match a sequence of up to 4 NFD characters.  
>  
> > Hopefully some experts here can tune in, explaining exactly what
> > regular expressions they have in mind.
> 
> The best indication lies at
> https://www.unicode.org/reports/tr18/tr18-13.html#Canonical_Equivalents
> (2008), which is the last version before support for canonical
> equivalence was dropped as a requirement.
> 
> It's not entirely coherent, as the authors don't seem to find an
> expression like
> 
> \p{L}\p{gcb=extend}*
> 
> a natural thing to use, as the second factor is mostly sequences of
> non-starters.  At that point, I would say they weren't expecting
> \p{Lu} to not match  <A, U+0300>, as they were still expecting  [?] to
> match both "?" and "a\u0308".
> 
> They hadn't given any thought to [\p{L}&&\p{isNFD}]\p{gcb=extend}*, and
> were expecting normalisation (even to NFC) to be a possible cure.  They
> had begun to realise that converting expressions to match all or none
> of a set of canonical equivalents was hard; the issue of non-contiguous
> matches wasn't mentioned.

I think these are two separate issues: whether search should normalize
(a.k.a. performs character folding) should be a user option.  You are
talking only about canonical equivalence, but there's also
compatibility decomposition, so, for example, searching for "1" should
perhaps match ? and ?.

From unicode at unicode.org  Mon Oct 14 02:18:54 2019
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Mon, 14 Oct 2019 08:18:54 +0100
Subject: Pure Regular Expression Engines and Literal Clusters
In-Reply-To: <3dbebbdf-04fa-e9b2-9780-cc40f4e5d15e@ix.netcom.com>
References: <20191008152534.2068db6c@JRWUBU2>
 <CAN49p6r2EiCZLoOV1DngbM83d5Vzu7X0dOcjVnXf-CJK4+aR6Q@mail.gmail.com>
 <20191011200158.41a948f4@JRWUBU2>
 <CAN49p6r=D68Tzy1bzaXcvFVpv+RaXQG1GZnozBd4og=XkeRaDQ@mail.gmail.com>
 <20191012020212.6db1634a@JRWUBU2>
 <CAJ2xs_GDJMnqTmK9j3RDeygN=ka1FepOikTewmv-nNDewarnCw@mail.gmail.com>
 <20191012131755.7749a622@JRWUBU2>
 <3EA16101-7FD3-4BE8-82F7-DB32E36BBFAE@telia.com>
 <20191012233705.52544fb9@JRWUBU2>
 <C1F2BDD2-86E7-4643-BED8-B9298084469C@telia.com>
 <20191013140018.5ea512bc@JRWUBU2>
 <3893F0F9-3199-46A6-BC62-0404708CD71A@telia.com>
 <20191013201754.6597fdd0@JRWUBU2>
 <A5305A26-F8B8-43B6-96B8-474858B709A5@telia.com>
 <20191013225412.4f1772ca@JRWUBU2>
 <7b42cd5c-d78f-3583-39c1-65ee209eefa5@ix.netcom.com>
 <20191014023858.5d2be8ae@JRWUBU2>
 <3dbebbdf-04fa-e9b2-9780-cc40f4e5d15e@ix.netcom.com>
Message-ID: <20191014081854.020a0f2d@JRWUBU2>

On Sun, 13 Oct 2019 20:25:25 -0700
Asmus Freytag via Unicode <unicode at unicode.org> wrote:

> On 10/13/2019 6:38 PM, Richard Wordingham via Unicode wrote:
> On Sun, 13 Oct 2019 17:13:28 -0700

>> Yes.  There is no precomposed LATIN LETTER M WITH CIRCUMFLEX, so
>> [:Lu:] should not match <U+004D LATIN CAPITAL LETTER M, U+0302
>> COMBINING CIRCUMFLEX ACCENT>. 

> Why does it matter if it is precomposed? Why should it? (For anyone
> other than a character coding maven).

Because general_category is a property of characters, not strings.  It
matters to anyone who intends to conform to a standard.

>> Now, I could invent a string
>> property so that \p{xLu} that meant (:?\p{Lu}\p{Mn}*).

No, I shouldn't!  \m{xLu} is infinite, which would not be allowed for
a Unicode set.  I'd have to resort to a wordy definition for it to be a
property.

Richard.

From unicode at unicode.org  Mon Oct 14 02:46:07 2019
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Mon, 14 Oct 2019 08:46:07 +0100
Subject: Pure Regular Expression Engines and Literal Clusters
In-Reply-To: <CAJ2xs_H_oYVVM5CzEnXUBVMt8t3BPTtfpLx2O4ow9hh8R8fFsA@mail.gmail.com>
References: <20191008152534.2068db6c@JRWUBU2>
 <CAN49p6r2EiCZLoOV1DngbM83d5Vzu7X0dOcjVnXf-CJK4+aR6Q@mail.gmail.com>
 <20191011200158.41a948f4@JRWUBU2>
 <CAN49p6r=D68Tzy1bzaXcvFVpv+RaXQG1GZnozBd4og=XkeRaDQ@mail.gmail.com>
 <20191012020212.6db1634a@JRWUBU2>
 <CAJ2xs_GDJMnqTmK9j3RDeygN=ka1FepOikTewmv-nNDewarnCw@mail.gmail.com>
 <20191012131755.7749a622@JRWUBU2>
 <3EA16101-7FD3-4BE8-82F7-DB32E36BBFAE@telia.com>
 <20191012233705.52544fb9@JRWUBU2>
 <C1F2BDD2-86E7-4643-BED8-B9298084469C@telia.com>
 <20191013140018.5ea512bc@JRWUBU2>
 <3893F0F9-3199-46A6-BC62-0404708CD71A@telia.com>
 <20191013201754.6597fdd0@JRWUBU2>
 <A5305A26-F8B8-43B6-96B8-474858B709A5@telia.com>
 <20191013225412.4f1772ca@JRWUBU2>
 <7b42cd5c-d78f-3583-39c1-65ee209eefa5@ix.netcom.com>
 <20191014023858.5d2be8ae@JRWUBU2>
 <3dbebbdf-04fa-e9b2-9780-cc40f4e5d15e@ix.netcom.com>
 <CAJ2xs_H_oYVVM5CzEnXUBVMt8t3BPTtfpLx2O4ow9hh8R8fFsA@mail.gmail.com>
Message-ID: <20191014084607.6d133fd6@JRWUBU2>

On Sun, 13 Oct 2019 21:28:34 -0700
Mark Davis ?? via Unicode <unicode at unicode.org> wrote:

> The problem is that most regex engines are not written to handle some
> "interesting" features of canonical equivalence, like discontinuity.
> Suppose that X is canonically equivalent to AB.
> 
>    - A query /X/ can match the separated A and C in the target string
>    "AbC". So if I have code do [replace /X/ in "AbC" by "pq"], how
> should it behave? "pqb", "pbq", "bpq"?

If A contains a non-starter, pqbC.
If C contains a non-starter, Abpq.
Otherwise, if the results are canonically inequivalent, it should
raise an exception for attempting a process that is either ill-defined
or not Unicode-compliant. 

> If the input was in NFD (for
> example), should the output be rearranged/decomposed so that it is
> NFD? and so on.

That is not a new issue.  It exists already.

>    - A query /A/ can match *part* of the X in the target string
> "aXb". So if I have code to do [replace /A/ in "aXb" by "pq"], what
> should result: "apqBb"?

Yes, unless raising an exception is appropriate (see above).

> The syntax and APIs for regex engines are not built to handle these
> features. It introduces a enough complications in the code, syntax,
> and semantics that no major implementation has seen fit to do it. We
> used to have a section in the spec about this, but were convinced
> that it was better off handled at a higher level.

What higher level?  If anything, I would say that the handler is at a
lower level (character fragments and the like).

The potential requirement should be restored, but not subsumed in
Levels 1 to 3.  It is a sufficiently different level of endeavour.

Richard.


From unicode at unicode.org  Mon Oct 14 08:08:01 2019
From: unicode at unicode.org (=?utf-8?Q?Hans_=C3=85berg?= via Unicode)
Date: Mon, 14 Oct 2019 15:08:01 +0200
Subject: Pure Regular Expression Engines and Literal Clusters
In-Reply-To: <20191014011045.35c851e9@JRWUBU2>
References: <20191008152534.2068db6c@JRWUBU2>
 <CAN49p6r2EiCZLoOV1DngbM83d5Vzu7X0dOcjVnXf-CJK4+aR6Q@mail.gmail.com>
 <20191011200158.41a948f4@JRWUBU2>
 <CAN49p6r=D68Tzy1bzaXcvFVpv+RaXQG1GZnozBd4og=XkeRaDQ@mail.gmail.com>
 <20191012020212.6db1634a@JRWUBU2>
 <CAJ2xs_GDJMnqTmK9j3RDeygN=ka1FepOikTewmv-nNDewarnCw@mail.gmail.com>
 <20191012131755.7749a622@JRWUBU2>
 <3EA16101-7FD3-4BE8-82F7-DB32E36BBFAE@telia.com>
 <20191012233705.52544fb9@JRWUBU2>
 <C1F2BDD2-86E7-4643-BED8-B9298084469C@telia.com>
 <20191013140018.5ea512bc@JRWUBU2>
 <3893F0F9-3199-46A6-BC62-0404708CD71A@telia.com>
 <20191013201754.6597fdd0@JRWUBU2>
 <A5305A26-F8B8-43B6-96B8-474858B709A5@telia.com>
 <20191013225412.4f1772ca@JRWUBU2>
 <7DFB381E-EAE3-4B04-AA50-2E18E764E731@telia.com>
 <20191014011045.35c851e9@JRWUBU2>
Message-ID: <6EFD1B4D-BF9F-4586-8E9C-878B59B61FC4@telia.com>


> On 14 Oct 2019, at 02:10, Richard Wordingham via Unicode <unicode at unicode.org> wrote:
> 
> On Mon, 14 Oct 2019 00:22:36 +0200
> Hans ?berg via Unicode <unicode at unicode.org> wrote:
> 
>>> On 13 Oct 2019, at 23:54, Richard Wordingham via Unicode
>>> <unicode at unicode.org> wrote:
> 
>>> Besides invalidating complexity metrics, the issue was what \p{Lu}
>>> should match.  For example, with PCRE syntax, GNU grep Version 2.25
>>> \p{Lu} matches U+0100 but not <A, U+0300>.  When I'm respecting
>>> canonical equivalence, I want both to match [:Lu:], and that's what
>>> I do. [:Lu:] can then match a sequence of up to 4 NFD characters.  
> 
>> Hopefully some experts here can tune in, explaining exactly what
>> regular expressions they have in mind.
> 
> The best indication lies at
> https://www.unicode.org/reports/tr18/tr18-13.html#Canonical_Equivalents

The certificate has expired, one day ago, risking to steal personal and financial information says the browser, refusing to load it. So one has to load the totally insecure HTTP page for risk of creating a mayhem on the computer. :-)

> (2008), which is the last version before support for canonical
> equivalence was dropped as a requirement.

As said there, one might add all the equivalents if one can find them. Alternatively, one could normalize the regex and the string, keeping track of the translation boundaries on the string so that it can be translated back to a match on the original string if called for.


From unicode at unicode.org  Mon Oct 14 13:29:39 2019
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Mon, 14 Oct 2019 19:29:39 +0100
Subject: Pure Regular Expression Engines and Literal Clusters
In-Reply-To: <83mue3kdrm.fsf@gnu.org>
References: <20191008152534.2068db6c@JRWUBU2>
 <CAN49p6r2EiCZLoOV1DngbM83d5Vzu7X0dOcjVnXf-CJK4+aR6Q@mail.gmail.com>
 <20191011200158.41a948f4@JRWUBU2>
 <CAN49p6r=D68Tzy1bzaXcvFVpv+RaXQG1GZnozBd4og=XkeRaDQ@mail.gmail.com>
 <20191012020212.6db1634a@JRWUBU2>
 <CAJ2xs_GDJMnqTmK9j3RDeygN=ka1FepOikTewmv-nNDewarnCw@mail.gmail.com>
 <20191012131755.7749a622@JRWUBU2>
 <3EA16101-7FD3-4BE8-82F7-DB32E36BBFAE@telia.com>
 <20191012233705.52544fb9@JRWUBU2>
 <C1F2BDD2-86E7-4643-BED8-B9298084469C@telia.com>
 <20191013140018.5ea512bc@JRWUBU2>
 <3893F0F9-3199-46A6-BC62-0404708CD71A@telia.com>
 <20191013201754.6597fdd0@JRWUBU2>
 <A5305A26-F8B8-43B6-96B8-474858B709A5@telia.com>
 <20191013225412.4f1772ca@JRWUBU2>
 <7DFB381E-EAE3-4B04-AA50-2E18E764E731@telia.com>
 <20191014011045.35c851e9@JRWUBU2> <83mue3kdrm.fsf@gnu.org>
Message-ID: <20191014192939.34ea39ce@JRWUBU2>

On Mon, 14 Oct 2019 10:05:49 +0300
Eli Zaretskii via Unicode <unicode at unicode.org> wrote:

> > Date: Mon, 14 Oct 2019 01:10:45 +0100
> > From: Richard Wordingham via Unicode <unicode at unicode.org>

> > They hadn't given any thought to [\p{L}&&\p{isNFD}]\p{gcb=extend}*,
> > and were expecting normalisation (even to NFC) to be a possible
> > cure.  They had begun to realise that converting expressions to
> > match all or none of a set of canonical equivalents was hard; the
> > issue of non-contiguous matches wasn't mentioned.  

> I think these are two separate issues: whether search should normalize
> (a.k.a. performs character folding) should be a user option.  You are
> talking only about canonical equivalence, but there's also
> compatibility decomposition, so, for example, searching for "1" should
> perhaps match ? and ?.

HERETIC!

The official position is that text that is canonically
equivalent is the same.  There are problem areas where traditional
modes of expression require that canonically equivalent text be treated
differently.  For these, it is useful to have tools that treat them
differently.  However, the normal presumption should be that
canonically equivalent text is the same.

The party line seems to be that most searching should actually be done
using a 'collation', which brings with it different levels of
'folding'.  In multilingual use, a collation used for searching should
be quite different to one used for sorting.

Now, there is a case for being able to switch off normalisation and
canonical equivalence generally, e.g. when dealing with ISO 10646 text
instead of Unicode text.  This of course still leaves the question of
what character classes defined by Unicode properties then mean.

If one converts the regular expression so that what it matches is
closed under canonical equivalence, then visibly normalising the
searched text becomes irrelevant.  For working with Unicode traces, I
actually do both.  I convert the text to NFD but report matches in terms
of the original code point sequence; working this way simplifies the
conversion of the regular expression, which I do as part of its
compilation.  For traces, it seems only natural to treat precomposed
characters as syntactic sugar for the NFD decompositions.  (They
have no place in the formal theory of traces.)  However, I go further
and convert the decomposed text to NFD. (Recall that conversion to NFD
can change the stored order of combining marks.)

One of the simplifications I get is that straight runs of text in the
regular expression then match in the middle just by converting that run
and the searched strings.

For the concatenation of expressions A and B, once I am looking at the
possible interleaving of two traces, I am dealing with NFA states of
the form states(A) ? {1..254} ? states(B), so that for an element (a,
n, b), a corresponds to starts of words with a match in A, b
corresponds to starts of _words_ with a match in B, and n is the ccc
of the last character used to advance to b.  The element n blocks
non-starters that can't belong to a word matching A.  If I didn't
(internally) convert the searched text to NFD, the element n would have
to be a set of blocked canonical combining classes, changing the number
of possible values from 54 to 2^54 - 1.

While aficionados of regular languages may object that converting the
searched text to NFD is cheating, there is a theorem that if I have a
finite automaton that recognises a family of NFD strings, there is
another finite automaton that will recognise all their canonical
equivalents.

Richard.


From unicode at unicode.org  Mon Oct 14 13:41:19 2019
From: unicode at unicode.org (Eli Zaretskii via Unicode)
Date: Mon, 14 Oct 2019 21:41:19 +0300
Subject: Pure Regular Expression Engines and Literal Clusters
In-Reply-To: <20191014192939.34ea39ce@JRWUBU2> (message from Richard
 Wordingham via Unicode on Mon, 14 Oct 2019 19:29:39 +0100)
References: <20191008152534.2068db6c@JRWUBU2>
 <CAN49p6r2EiCZLoOV1DngbM83d5Vzu7X0dOcjVnXf-CJK4+aR6Q@mail.gmail.com>
 <20191011200158.41a948f4@JRWUBU2>
 <CAN49p6r=D68Tzy1bzaXcvFVpv+RaXQG1GZnozBd4og=XkeRaDQ@mail.gmail.com>
 <20191012020212.6db1634a@JRWUBU2>
 <CAJ2xs_GDJMnqTmK9j3RDeygN=ka1FepOikTewmv-nNDewarnCw@mail.gmail.com>
 <20191012131755.7749a622@JRWUBU2>
 <3EA16101-7FD3-4BE8-82F7-DB32E36BBFAE@telia.com>
 <20191012233705.52544fb9@JRWUBU2>
 <C1F2BDD2-86E7-4643-BED8-B9298084469C@telia.com>
 <20191013140018.5ea512bc@JRWUBU2>
 <3893F0F9-3199-46A6-BC62-0404708CD71A@telia.com>
 <20191013201754.6597fdd0@JRWUBU2>
 <A5305A26-F8B8-43B6-96B8-474858B709A5@telia.com>
 <20191013225412.4f1772ca@JRWUBU2>
 <7DFB381E-EAE3-4B04-AA50-2E18E764E731@telia.com>
 <20191014011045.35c851e9@JRWUBU2> <83mue3kdrm.fsf@gnu.org>
 <20191014192939.34ea39ce@JRWUBU2>
Message-ID: <83bluji300.fsf@gnu.org>

> Date: Mon, 14 Oct 2019 19:29:39 +0100
> From: Richard Wordingham via Unicode <unicode at unicode.org>
> 
> On Mon, 14 Oct 2019 10:05:49 +0300
> Eli Zaretskii via Unicode <unicode at unicode.org> wrote:
> 
> > I think these are two separate issues: whether search should normalize
> > (a.k.a. performs character folding) should be a user option.  You are
> > talking only about canonical equivalence, but there's also
> > compatibility decomposition, so, for example, searching for "1" should
> > perhaps match ? and ?.
> 
> HERETIC!
> 
> The official position is that text that is canonically
> equivalent is the same.  There are problem areas where traditional
> modes of expression require that canonically equivalent text be treated
> differently.  For these, it is useful to have tools that treat them
> differently.  However, the normal presumption should be that
> canonically equivalent text is the same.

I'm well aware of the official position.  However, when we attempted
to implement it unconditionally in Emacs, some people objected, and
brought up good reasons.  You can, of course, elect to disregard this
experience, and instead learn it from your own.

> The party line seems to be that most searching should actually be done
> using a 'collation', which brings with it different levels of
> 'folding'.  In multilingual use, a collation used for searching should
> be quite different to one used for sorting.

Alas, collation is locale- and language-dependent.  And, if you are
going to use your search in a multilingual application (Emacs is such
an application), you will have hard time even knowing which tailoring
to apply for each potential match, because you will need to support
the use case of working with text that mixes languages.

Leaving the conundrum to the user to resolve seems to be a good
compromise, and might actually teach us something that is useful for
future modifications of the "party line".

From unicode at unicode.org  Mon Oct 14 18:23:59 2019
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Tue, 15 Oct 2019 00:23:59 +0100
Subject: Annoyances from Implementation of Canonical Equivalence (was: Pure
 Regular Expression Engines and Literal Clusters)
In-Reply-To: <83bluji300.fsf@gnu.org>
References: <20191008152534.2068db6c@JRWUBU2>
 <CAN49p6r2EiCZLoOV1DngbM83d5Vzu7X0dOcjVnXf-CJK4+aR6Q@mail.gmail.com>
 <20191011200158.41a948f4@JRWUBU2>
 <CAN49p6r=D68Tzy1bzaXcvFVpv+RaXQG1GZnozBd4og=XkeRaDQ@mail.gmail.com>
 <20191012020212.6db1634a@JRWUBU2>
 <CAJ2xs_GDJMnqTmK9j3RDeygN=ka1FepOikTewmv-nNDewarnCw@mail.gmail.com>
 <20191012131755.7749a622@JRWUBU2>
 <3EA16101-7FD3-4BE8-82F7-DB32E36BBFAE@telia.com>
 <20191012233705.52544fb9@JRWUBU2>
 <C1F2BDD2-86E7-4643-BED8-B9298084469C@telia.com>
 <20191013140018.5ea512bc@JRWUBU2>
 <3893F0F9-3199-46A6-BC62-0404708CD71A@telia.com>
 <20191013201754.6597fdd0@JRWUBU2>
 <A5305A26-F8B8-43B6-96B8-474858B709A5@telia.com>
 <20191013225412.4f1772ca@JRWUBU2>
 <7DFB381E-EAE3-4B04-AA50-2E18E764E731@telia.com>
 <20191014011045.35c851e9@JRWUBU2> <83mue3kdrm.fsf@gnu.org>
 <20191014192939.34ea39ce@JRWUBU2> <83bluji300.fsf@gnu.org>
Message-ID: <20191015002359.700a5df0@JRWUBU2>

On Mon, 14 Oct 2019 21:41:19 +0300
Eli Zaretskii via Unicode <unicode at unicode.org> wrote:

> > Date: Mon, 14 Oct 2019 19:29:39 +0100
> > From: Richard Wordingham via Unicode <unicode at unicode.org>

> > The official position is that text that is canonically
> > equivalent is the same.  There are problem areas where traditional
> > modes of expression require that canonically equivalent text be
> > treated differently.  For these, it is useful to have tools that
> > treat them differently.  However, the normal presumption should be
> > that canonically equivalent text is the same.  

> I'm well aware of the official position.  However, when we attempted
> to implement it unconditionally in Emacs, some people objected, and
> brought up good reasons.  You can, of course, elect to disregard this
> experience, and instead learn it from your own.

Is there a good record of these complaints anywhere?  It is annoying
when a text entry function does not keep the text as one enters it, but
it would be interesting to know what the other complaints were.  (It
would occasionally be useful to have an easily issued command like
'delete preceding NFD codepoint'.)  I did mention above that
occasionally one needs to know what codepoints were used and in what
order.

Richard.

From unicode at unicode.org  Tue Oct 15 01:43:23 2019
From: unicode at unicode.org (Eli Zaretskii via Unicode)
Date: Tue, 15 Oct 2019 09:43:23 +0300
Subject: Annoyances from Implementation of Canonical Equivalence (was: Pure
 Regular Expression Engines and Literal Clusters)
In-Reply-To: <20191015002359.700a5df0@JRWUBU2> (message from Richard
 Wordingham via Unicode on Tue, 15 Oct 2019 00:23:59 +0100)
References: <20191008152534.2068db6c@JRWUBU2>
 <CAN49p6r2EiCZLoOV1DngbM83d5Vzu7X0dOcjVnXf-CJK4+aR6Q@mail.gmail.com>
 <20191011200158.41a948f4@JRWUBU2>
 <CAN49p6r=D68Tzy1bzaXcvFVpv+RaXQG1GZnozBd4og=XkeRaDQ@mail.gmail.com>
 <20191012020212.6db1634a@JRWUBU2>
 <CAJ2xs_GDJMnqTmK9j3RDeygN=ka1FepOikTewmv-nNDewarnCw@mail.gmail.com>
 <20191012131755.7749a622@JRWUBU2>
 <3EA16101-7FD3-4BE8-82F7-DB32E36BBFAE@telia.com>
 <20191012233705.52544fb9@JRWUBU2>
 <C1F2BDD2-86E7-4643-BED8-B9298084469C@telia.com>
 <20191013140018.5ea512bc@JRWUBU2>
 <3893F0F9-3199-46A6-BC62-0404708CD71A@telia.com>
 <20191013201754.6597fdd0@JRWUBU2>
 <A5305A26-F8B8-43B6-96B8-474858B709A5@telia.com>
 <20191013225412.4f1772ca@JRWUBU2>
 <7DFB381E-EAE3-4B04-AA50-2E18E764E731@telia.com>
 <20191014011045.35c851e9@JRWUBU2> <83mue3kdrm.fsf@gnu.org>
 <20191014192939.34ea39ce@JRWUBU2> <83bluji300.fsf@gnu.org>
 <20191015002359.700a5df0@JRWUBU2>
Message-ID: <83tv8ah5kk.fsf@gnu.org>

> Date: Tue, 15 Oct 2019 00:23:59 +0100
> From: Richard Wordingham via Unicode <unicode at unicode.org>
> 
> > I'm well aware of the official position.  However, when we attempted
> > to implement it unconditionally in Emacs, some people objected, and
> > brought up good reasons.  You can, of course, elect to disregard this
> > experience, and instead learn it from your own.
> 
> Is there a good record of these complaints anywhere?

You could look up these discussions:

  https://lists.gnu.org/archive/html/emacs-devel/2016-02/msg00189.html
  https://lists.gnu.org/archive/html/emacs-devel/2016-02/msg00506.html

> (It would occasionally be useful to have an easily issued command
> like 'delete preceding NFD codepoint'.)

I agree.  Emacs commands that delete characters backward (usually
invoked by the Backspace key) do that automatically, if the text
before cursor was produced by composing several codepoints.

> I did mention above that occasionally one needs to know what
> codepoints were used and in what order.

Sure.  There's an Emacs command (C-u C-x =) which shows that
information for the text at a given position.

From unicode at unicode.org  Tue Oct 15 14:52:15 2019
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Tue, 15 Oct 2019 20:52:15 +0100
Subject: Annoyances from Implementation of Canonical Equivalence (was:
 Pure Regular Expression Engines and Literal Clusters)
In-Reply-To: <83tv8ah5kk.fsf@gnu.org>
References: <20191008152534.2068db6c@JRWUBU2>
 <CAN49p6r2EiCZLoOV1DngbM83d5Vzu7X0dOcjVnXf-CJK4+aR6Q@mail.gmail.com>
 <20191011200158.41a948f4@JRWUBU2>
 <CAN49p6r=D68Tzy1bzaXcvFVpv+RaXQG1GZnozBd4og=XkeRaDQ@mail.gmail.com>
 <20191012020212.6db1634a@JRWUBU2>
 <CAJ2xs_GDJMnqTmK9j3RDeygN=ka1FepOikTewmv-nNDewarnCw@mail.gmail.com>
 <20191012131755.7749a622@JRWUBU2>
 <3EA16101-7FD3-4BE8-82F7-DB32E36BBFAE@telia.com>
 <20191012233705.52544fb9@JRWUBU2>
 <C1F2BDD2-86E7-4643-BED8-B9298084469C@telia.com>
 <20191013140018.5ea512bc@JRWUBU2>
 <3893F0F9-3199-46A6-BC62-0404708CD71A@telia.com>
 <20191013201754.6597fdd0@JRWUBU2>
 <A5305A26-F8B8-43B6-96B8-474858B709A5@telia.com>
 <20191013225412.4f1772ca@JRWUBU2>
 <7DFB381E-EAE3-4B04-AA50-2E18E764E731@telia.com>
 <20191014011045.35c851e9@JRWUBU2> <83mue3kdrm.fsf@gnu.org>
 <20191014192939.34ea39ce@JRWUBU2> <83bluji300.fsf@gnu.org>
 <20191015002359.700a5df0@JRWUBU2> <83tv8ah5kk.fsf@gnu.org>
Message-ID: <20191015205215.773ac298@JRWUBU2>

On Tue, 15 Oct 2019 09:43:23 +0300
Eli Zaretskii via Unicode <unicode at unicode.org> wrote:

> > Date: Tue, 15 Oct 2019 00:23:59 +0100
> > From: Richard Wordingham via Unicode <unicode at unicode.org>
> >   
> > > I'm well aware of the official position.  However, when we
> > > attempted to implement it unconditionally in Emacs, some people
> > > objected, and brought up good reasons.  You can, of course, elect
> > > to disregard this experience, and instead learn it from your
> > > own.  
> > 
> > Is there a good record of these complaints anywhere?  
> 
> You could look up these discussions:
> 
>   https://lists.gnu.org/archive/html/emacs-devel/2016-02/msg00189.html
>   https://lists.gnu.org/archive/html/emacs-devel/2016-02/msg00506.html

These are complaints about primary-level searches, not canonical
equivalence.

> > (It would occasionally be useful to have an easily issued command
> > like 'delete preceding NFD codepoint'.)  
> 
> I agree.  Emacs commands that delete characters backward (usually
> invoked by the Backspace key) do that automatically, if the text
> before cursor was produced by composing several codepoints.

That's pretty standard, though it looks as though GTK has chosen to
reject the principle that backwards deletion deletes the last character
entered.

> Sure.  There's an Emacs command (C-u C-x =) which shows that
> information for the text at a given position.

Or commands what-cursor-position and describe-char if an emulator
gets in the way.  Having forward-char-intrusive would make it perfect.

Richard,

From unicode at unicode.org  Wed Oct 16 01:33:38 2019
From: unicode at unicode.org (Eli Zaretskii via Unicode)
Date: Wed, 16 Oct 2019 09:33:38 +0300
Subject: Annoyances from Implementation of Canonical Equivalence (was:
 Pure Regular Expression Engines and Literal Clusters)
In-Reply-To: <20191015205215.773ac298@JRWUBU2> (message from Richard
 Wordingham via Unicode on Tue, 15 Oct 2019 20:52:15 +0100)
References: <20191008152534.2068db6c@JRWUBU2>
 <CAN49p6r2EiCZLoOV1DngbM83d5Vzu7X0dOcjVnXf-CJK4+aR6Q@mail.gmail.com>
 <20191011200158.41a948f4@JRWUBU2>
 <CAN49p6r=D68Tzy1bzaXcvFVpv+RaXQG1GZnozBd4og=XkeRaDQ@mail.gmail.com>
 <20191012020212.6db1634a@JRWUBU2>
 <CAJ2xs_GDJMnqTmK9j3RDeygN=ka1FepOikTewmv-nNDewarnCw@mail.gmail.com>
 <20191012131755.7749a622@JRWUBU2>
 <3EA16101-7FD3-4BE8-82F7-DB32E36BBFAE@telia.com>
 <20191012233705.52544fb9@JRWUBU2>
 <C1F2BDD2-86E7-4643-BED8-B9298084469C@telia.com>
 <20191013140018.5ea512bc@JRWUBU2>
 <3893F0F9-3199-46A6-BC62-0404708CD71A@telia.com>
 <20191013201754.6597fdd0@JRWUBU2>
 <A5305A26-F8B8-43B6-96B8-474858B709A5@telia.com>
 <20191013225412.4f1772ca@JRWUBU2>
 <7DFB381E-EAE3-4B04-AA50-2E18E764E731@telia.com>
 <20191014011045.35c851e9@JRWUBU2> <83mue3kdrm.fsf@gnu.org>
 <20191014192939.34ea39ce@JRWUBU2> <83bluji300.fsf@gnu.org>
 <20191015002359.700a5df0@JRWUBU2> <83tv8ah5kk.fsf@gnu.org>
 <20191015205215.773ac298@JRWUBU2>
Message-ID: <83imopfbct.fsf@gnu.org>

> Date: Tue, 15 Oct 2019 20:52:15 +0100
> From: Richard Wordingham via Unicode <unicode at unicode.org>
> 
> > > > I'm well aware of the official position.  However, when we
> > > > attempted to implement it unconditionally in Emacs, some people
> > > > objected, and brought up good reasons.  You can, of course, elect
> > > > to disregard this experience, and instead learn it from your
> > > > own.  
> > > 
> > > Is there a good record of these complaints anywhere?  
> > 
> > You could look up these discussions:
> > 
> >   https://lists.gnu.org/archive/html/emacs-devel/2016-02/msg00189.html
> >   https://lists.gnu.org/archive/html/emacs-devel/2016-02/msg00506.html
> 
> These are complaints about primary-level searches, not canonical
> equivalence.

Not sure what you call primary-level searches, but if you deduced the
complaints were only about searches for base characters, then that's
not so.  They are long discussions with many sub-threads, so it might
be hard to find the specific details you are looking for.

However, the conclusion was very firm, and since we made the folding
optional 3 years ago, we had no complaints.

From unicode at unicode.org  Wed Oct 16 20:26:35 2019
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Thu, 17 Oct 2019 02:26:35 +0100
Subject: Annoyances from Implementation of Canonical Equivalence
In-Reply-To: <83imopfbct.fsf@gnu.org>
References: <20191008152534.2068db6c@JRWUBU2> <20191011200158.41a948f4@JRWUBU2>
 <CAN49p6r=D68Tzy1bzaXcvFVpv+RaXQG1GZnozBd4og=XkeRaDQ@mail.gmail.com>
 <20191012020212.6db1634a@JRWUBU2>
 <CAJ2xs_GDJMnqTmK9j3RDeygN=ka1FepOikTewmv-nNDewarnCw@mail.gmail.com>
 <20191012131755.7749a622@JRWUBU2>
 <3EA16101-7FD3-4BE8-82F7-DB32E36BBFAE@telia.com>
 <20191012233705.52544fb9@JRWUBU2>
 <C1F2BDD2-86E7-4643-BED8-B9298084469C@telia.com>
 <20191013140018.5ea512bc@JRWUBU2>
 <3893F0F9-3199-46A6-BC62-0404708CD71A@telia.com>
 <20191013201754.6597fdd0@JRWUBU2>
 <A5305A26-F8B8-43B6-96B8-474858B709A5@telia.com>
 <20191013225412.4f1772ca@JRWUBU2>
 <7DFB381E-EAE3-4B04-AA50-2E18E764E731@telia.com>
 <20191014011045.35c851e9@JRWUBU2> <83mue3kdrm.fsf@gnu.org>
 <20191014192939.34ea39ce@JRWUBU2> <83bluji300.fsf@gnu.org>
 <20191015002359.700a5df0@JRWUBU2> <83tv8ah5kk.fsf@gnu.org>
 <20191015205215.773ac298@JRWUBU2> <83imopfbct.fsf@gnu.org>
Message-ID: <20191017022635.301df2b7@JRWUBU2>

On Wed, 16 Oct 2019 09:33:38 +0300
Eli Zaretskii via Unicode <unicode at unicode.org> wrote:

> > These are complaints about primary-level searches, not canonical
> > equivalence.  
> 
> Not sure what you call primary-level searches, but if you deduced the
> complaints were only about searches for base characters, then that's
> not so.  They are long discussions with many sub-threads, so it might
> be hard to find the specific details you are looking for.

The nearest I've found to complaints about including canonical
equivalences are:

(a) an observation that very occasionally one would need to switch
canonical equivalence off.  In such cases, one is not concerned with
the text as such, but rather with how Unicode non-compliant processes
will handle it.  Compliant processes are often built out of
non-compliant processes.

(b) just possibly

"What we have seen is that the behavior that comes from that Unicode
data does not please the users very much.  Users seem to have many
different ideas of what folding is useful, and disagree with each
other greatly." -
https://lists.gnu.org/archive/html/emacs-devel/2016-02/msg01359.html

I can't tell what (b) was talking about; it may well have been about
folding or asymmetric search, as opposed to supporting canonical
equivalence.

(c) A search for 'n' finding '?'.

When it comes to canonical equivalence, one answer to (c) is that as
soon as one adds the next letter letter, e.g. 'na', the search will no
longer match '?'.  (This doesn't apply to diacritic-ignoring folding.)
That argument doesn't work with the Polish letter '?' though, as it can
be word-final.

In programming, one might be able to prevent the issue
by using 'n\b{g}', but that is a requirement of RL2.2, which doesn't
seem to be high on the list of implementer's priorities, especially as
it depends on properties outwith the UCD, defined in a non-ASCII file
to boot.  A better supported solution is probably 'n\P{Mn}'.

In many cases, the answer might be a search by collation graphemes, but
that has other issues besides language sensitivity.

Richard.


From unicode at unicode.org  Thu Oct 17 02:42:19 2019
From: unicode at unicode.org (Eli Zaretskii via Unicode)
Date: Thu, 17 Oct 2019 10:42:19 +0300
Subject: Annoyances from Implementation of Canonical Equivalence
In-Reply-To: <20191017022635.301df2b7@JRWUBU2> (message from Richard
 Wordingham on Thu, 17 Oct 2019 02:26:35 +0100)
References: <20191008152534.2068db6c@JRWUBU2> <20191011200158.41a948f4@JRWUBU2>
 <CAN49p6r=D68Tzy1bzaXcvFVpv+RaXQG1GZnozBd4og=XkeRaDQ@mail.gmail.com>
 <20191012020212.6db1634a@JRWUBU2>
 <CAJ2xs_GDJMnqTmK9j3RDeygN=ka1FepOikTewmv-nNDewarnCw@mail.gmail.com>
 <20191012131755.7749a622@JRWUBU2>
 <3EA16101-7FD3-4BE8-82F7-DB32E36BBFAE@telia.com>
 <20191012233705.52544fb9@JRWUBU2>
 <C1F2BDD2-86E7-4643-BED8-B9298084469C@telia.com>
 <20191013140018.5ea512bc@JRWUBU2>
 <3893F0F9-3199-46A6-BC62-0404708CD71A@telia.com>
 <20191013201754.6597fdd0@JRWUBU2>
 <A5305A26-F8B8-43B6-96B8-474858B709A5@telia.com>
 <20191013225412.4f1772ca@JRWUBU2>
 <7DFB381E-EAE3-4B04-AA50-2E18E764E731@telia.com>
 <20191014011045.35c851e9@JRWUBU2> <83mue3kdrm.fsf@gnu.org>
 <20191014192939.34ea39ce@JRWUBU2> <83bluji300.fsf@gnu.org>
 <20191015002359.700a5df0@JRWUBU2> <83tv8ah5kk.fsf@gnu.org>
 <20191015205215.773ac298@JRWUBU2>
 <83imopfbct.fsf@gnu.org> <20191017022635.301df2b7@JRWUBU2>
Message-ID: <838spjddic.fsf@gnu.org>

> Date: Thu, 17 Oct 2019 02:26:35 +0100
> From: Richard Wordingham <richard.wordingham at ntlworld.com>
> Cc: Eli Zaretskii <eliz at gnu.org>
> 
> (c) A search for 'n' finding '?'.
> 
> When it comes to canonical equivalence, one answer to (c) is that as
> soon as one adds the next letter letter, e.g. 'na', the search will no
> longer match '?'.

Sounds arbitrary to me.  How do we know that all the users will want
that?

> (This doesn't apply to diacritic-ignoring folding.)

But the issue _was_ diacritic-ignoring folding.

> That argument doesn't work with the Polish letter '?' though, as it can
> be word-final.

It actually doesn't work in general, and one factor is indeed
different languages.  The problem with ? was raised by
Spanish-speaking users, and only they were very much against folding
in this case.  Users of other languages didn't consider that a
problem, and many considered it a welcome feature.

> In many cases, the answer might be a search by collation graphemes, but
> that has other issues besides language sensitivity.

It is also unworkable, because search has to work in contexts where
the text is not displayed at all, and graphemes only exist at display
time.

From unicode at unicode.org  Thu Oct 17 15:58:50 2019
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Thu, 17 Oct 2019 21:58:50 +0100
Subject: Annoyances from Implementation of Canonical Equivalence
In-Reply-To: <838spjddic.fsf@gnu.org>
References: <20191008152534.2068db6c@JRWUBU2>
 <CAN49p6r=D68Tzy1bzaXcvFVpv+RaXQG1GZnozBd4og=XkeRaDQ@mail.gmail.com>
 <20191012020212.6db1634a@JRWUBU2>
 <CAJ2xs_GDJMnqTmK9j3RDeygN=ka1FepOikTewmv-nNDewarnCw@mail.gmail.com>
 <20191012131755.7749a622@JRWUBU2>
 <3EA16101-7FD3-4BE8-82F7-DB32E36BBFAE@telia.com>
 <20191012233705.52544fb9@JRWUBU2>
 <C1F2BDD2-86E7-4643-BED8-B9298084469C@telia.com>
 <20191013140018.5ea512bc@JRWUBU2>
 <3893F0F9-3199-46A6-BC62-0404708CD71A@telia.com>
 <20191013201754.6597fdd0@JRWUBU2>
 <A5305A26-F8B8-43B6-96B8-474858B709A5@telia.com>
 <20191013225412.4f1772ca@JRWUBU2>
 <7DFB381E-EAE3-4B04-AA50-2E18E764E731@telia.com>
 <20191014011045.35c851e9@JRWUBU2> <83mue3kdrm.fsf@gnu.org>
 <20191014192939.34ea39ce@JRWUBU2> <83bluji300.fsf@gnu.org>
 <20191015002359.700a5df0@JRWUBU2> <83tv8ah5kk.fsf@gnu.org>
 <20191015205215.773ac298@JRWUBU2> <83imopfbct.fsf@gnu.org>
 <20191017022635.301df2b7@JRWUBU2> <838spjddic.fsf@gnu.org>
Message-ID: <20191017215850.106b0475@JRWUBU2>

On Thu, 17 Oct 2019 10:42:19 +0300
Eli Zaretskii via Unicode <unicode at unicode.org> wrote:

> > Date: Thu, 17 Oct 2019 02:26:35 +0100
> > From: Richard Wordingham <richard.wordingham at ntlworld.com>
> > Cc: Eli Zaretskii <eliz at gnu.org>
> > 
> > (c) A search for 'n' finding '?'.
> > 
> > When it comes to canonical equivalence, one answer to (c) is that as
> > soon as one adds the next letter letter, e.g. 'na', the search will
> > no longer match '?'.  
> 
> Sounds arbitrary to me.  How do we know that all the users will want
> that?

If the change from codepoint by codepoint matching is just canonical
equivalence, then there is no way that the ?n? of ?na? will be matched
by the ?n? within ???.

> > (This doesn't apply to diacritic-ignoring folding.)  
> But the issue _was_ diacritic-ignoring folding.

Then we don't seem to have any evidence of user discontent arising from
supporting canonical equivalence.

> > That argument doesn't work with the Polish letter '?' though, as it
> > can be word-final.  

> It actually doesn't work in general, and one factor is indeed
> different languages.  The problem with ? was raised by
> Spanish-speaking users, and only they were very much against folding
> in this case.

I'm not talking about folding.  I'm talking about canonical
equivalence, which largely but not solely consists of treating
precomposed characters as the same as their *canonical* decompositions. 

> > In many cases, the answer might be a search by collation graphemes,
> > but that has other issues besides language sensitivity.  

> It is also unworkable, because search has to work in contexts where
> the text is not displayed at all, and graphemes only exist at display
> time.

The definition of a grapheme cluster is given in Section 9.9 of UTS#10,
which is currently at Version 12.1.0.  It is only connected to display
at a deep level, so display time is irrelevant.  Formally, it depends
on a collation, though the sorting aspect is irrelevant and is removed
for many 'search' collations in the CLDR.

So, if one were using a Spanish collation, on typing 'n' into the
incremental search string (and having it committed), the search wouldn't
consider a match with '?'. Then, on further typing the combining tilde,
it would reject the matches it had found and choose those matches with
'?', whether one codepoint or two.  Would that behaviour cause serious
grief for incremental search?  As I use an XSAMPA-based input
implemented in quail that attempts to generate text in form NFC, I would
type 'n~' to get the Spanish character, and so would never get an
intermediate state where the incremental search was searching for 'n'.
(At least, not in Emacs 25.3.1.)

Richard.


From unicode at unicode.org  Thu Oct 17 17:11:55 2019
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Thu, 17 Oct 2019 23:11:55 +0100
Subject: Collation Grapheme Clusters and Canonical Equivalence
Message-ID: <20191017231155.0f447b26@JRWUBU2>

There seems to be a Unicode non-compliance (C6) issue in the definition
of collation grapheme clusters (defined in UTS#10 Section 9.9).  Using
the DUCET collation, the canonically equivalent strings ??? <U+0E23 THAI
CHARACTER RO RUA, U+0E39 THAI CHARACTER SARA UU, U+0E49 THAI CHARACTER
MAI THO> and ??? <U+0E23, U+0E49, U+0E39> decompose into collation
grapheme clusters in two different ways.  The first decomposes into
<U+0E23> and <U+0E39, U+0E49> and the second decomposes into <U+0E23,
U+0E49> and <U+0E39>.

Thus UTS#18 RL3.2 'Tailored Grapheme Clusters', namely "To meet this
requirement, an implementation shall provide for collation grapheme
clusters matches based on a locale's collation order", requires
canonically equivalent sequences to be interpreted differently.

Is this a known issue?

Should I report it against UTS#10 or UTS#18?

Is the phrase 'collation order' intended to preclude the use of search
collations?  Search collations allow one to find a collation grapheme
cluster starting with U+0E15 THAI CHARACTER TO TAO in its exemplifying
word ???? <U+0E40 THAI CHARACTER SARA E, U+0E15, U+0E48 THAI CHARACTER
MAI EK, U+0E32 THAI CHARACTER SARA AA>.  DUCET splits it into <U+0E40,
U+0E15, U+0E48>, <U+0E32>, but most (all?) CLDR search collations split
it into <U+0E40>, <U+0E15, U+0E48>, <U+0E32>, matching the division
into grapheme clusters.

If we accept that in the Latin script Vietnamese tone marks have
primary weights (this only shows up with strings more than one
syllable long), I can produce more egregious examples based on the
various sequences canonically equivalent to U+1EAD LATIN SMALL LETTER A
WITH CIRCUMFLEX AND DOT BELOW or to U+1EDB LATIN SMALL LETTER O WITH
HORN AND ACUTE.

The root of the problem is the desire to match only contiguous
substrings.  This does not play nicely with canonical equivalence.

Richard.


From unicode at unicode.org  Fri Oct 18 01:45:14 2019
From: unicode at unicode.org (Eli Zaretskii via Unicode)
Date: Fri, 18 Oct 2019 09:45:14 +0300
Subject: Annoyances from Implementation of Canonical Equivalence
In-Reply-To: <20191017215850.106b0475@JRWUBU2> (message from Richard
 Wordingham via Unicode on Thu, 17 Oct 2019 21:58:50 +0100)
References: <20191008152534.2068db6c@JRWUBU2>
 <CAN49p6r=D68Tzy1bzaXcvFVpv+RaXQG1GZnozBd4og=XkeRaDQ@mail.gmail.com>
 <20191012020212.6db1634a@JRWUBU2>
 <CAJ2xs_GDJMnqTmK9j3RDeygN=ka1FepOikTewmv-nNDewarnCw@mail.gmail.com>
 <20191012131755.7749a622@JRWUBU2>
 <3EA16101-7FD3-4BE8-82F7-DB32E36BBFAE@telia.com>
 <20191012233705.52544fb9@JRWUBU2>
 <C1F2BDD2-86E7-4643-BED8-B9298084469C@telia.com>
 <20191013140018.5ea512bc@JRWUBU2>
 <3893F0F9-3199-46A6-BC62-0404708CD71A@telia.com>
 <20191013201754.6597fdd0@JRWUBU2>
 <A5305A26-F8B8-43B6-96B8-474858B709A5@telia.com>
 <20191013225412.4f1772ca@JRWUBU2>
 <7DFB381E-EAE3-4B04-AA50-2E18E764E731@telia.com>
 <20191014011045.35c851e9@JRWUBU2> <83mue3kdrm.fsf@gnu.org>
 <20191014192939.34ea39ce@JRWUBU2> <83bluji300.fsf@gnu.org>
 <20191015002359.700a5df0@JRWUBU2> <83tv8ah5kk.fsf@gnu.org>
 <20191015205215.773ac298@JRWUBU2> <83imopfbct.fsf@gnu.org>
 <20191017022635.301df2b7@JRWUBU2> <838spjddic.fsf@gnu.org>
 <20191017215850.106b0475@JRWUBU2>
Message-ID: <83y2xi8scl.fsf@gnu.org>

> Date: Thu, 17 Oct 2019 21:58:50 +0100
> From: Richard Wordingham via Unicode <unicode at unicode.org>
> 
> > Sounds arbitrary to me.  How do we know that all the users will want
> > that?
> 
> If the change from codepoint by codepoint matching is just canonical
> equivalence, then there is no way that the ?n? of ?na? will be matched
> by the ?n? within ???.

"Just canonical equivalence" is also quite arbitrary, for the user's
POV.  At least IME.

> > > (This doesn't apply to diacritic-ignoring folding.)  
> > But the issue _was_ diacritic-ignoring folding.
> 
> Then we don't seem to have any evidence of user discontent arising from
> supporting canonical equivalence.

Again, these are very closely related from user's POV.  Most users
don't understand the difference, in fact.  They are not Unicode
experts.

So maybe I was replying on a very different level, in which case
apologies for taking your time.

From unicode at unicode.org  Fri Oct 18 06:21:20 2019
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Fri, 18 Oct 2019 12:21:20 +0100
Subject: Collation Grapheme Clusters and Canonical Equivalence
In-Reply-To: <20191017231155.0f447b26@JRWUBU2>
References: <20191017231155.0f447b26@JRWUBU2>
Message-ID: <20191018122120.0e1f3f05@JRWUBU2>

On Thu, 17 Oct 2019 23:11:55 +0100
Richard Wordingham via Unicode <unicode at unicode.org> wrote:

> There seems to be a Unicode non-compliance (C6) issue in the
> definition of collation grapheme clusters (defined in UTS#10 Section
> 9.9).  Using the DUCET collation, the canonically equivalent strings
> ??? <U+0E23 THAI CHARACTER RO RUA, U+0E39 THAI CHARACTER SARA UU,
> U+0E49 THAI CHARACTER MAI THO> and ??? <U+0E23, U+0E49, U+0E39>
> decompose into collation grapheme clusters in two different ways.
> The first decomposes into <U+0E23> and <U+0E39, U+0E49> and the
> second decomposes into <U+0E23, U+0E49> and <U+0E39>.  

Correction:

One has to take the collating elements in NFD order, so the tone mark
(secondary weight) and the vowel (primary weight) also form a cluster,
so the division into clusters is <U+0E23>, <U+0E49, U+0E39>.  This
split respects canonical equivalence.

Replacement:

Now, one form of typo one may see in Thai is where the
vowel is typed twice.  Thai fonts often lack mark-to-mark positioning
for sequences that should not occur, so the two copies of the vowel may
be overlaid.  Proof-reading will not spot the mistake if the font or
layout engine does not assist.

Thus we can get <U+0E23, U+0E39, U+0E39, U+0E49> (417,000 raw Google
hits, the first 10 all good).  That splits into *three* collation
grapheme clusters - <U+0E23>, <U+0E39> and <U+0E39, U+0E49>.  Its
canonical equivalence <U+0E23, U+0E49, U+0E39, U+0E39> splits into two
grapheme clusters, for to form a sequence of collating elements
without skipping starting at the U+0E49, one must take all three
characters.  Overall, we end up with *two* collation grapheme clusters,
<U+0E23> and <U+0E49, U+0E39, U+0E39>.

> Thus UTS#18 RL3.2 'Tailored Grapheme Clusters', namely "To meet this
> requirement, an implementation shall provide for collation grapheme
> clusters matches based on a locale's collation order", requires
> canonically equivalent sequences to be interpreted differently.

Richard.


From unicode at unicode.org  Fri Oct 18 07:44:31 2019
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Fri, 18 Oct 2019 13:44:31 +0100
Subject: Annoyances from Implementation of Canonical Equivalence
In-Reply-To: <83y2xi8scl.fsf@gnu.org>
References: <20191008152534.2068db6c@JRWUBU2> <20191012020212.6db1634a@JRWUBU2>
 <CAJ2xs_GDJMnqTmK9j3RDeygN=ka1FepOikTewmv-nNDewarnCw@mail.gmail.com>
 <20191012131755.7749a622@JRWUBU2>
 <3EA16101-7FD3-4BE8-82F7-DB32E36BBFAE@telia.com>
 <20191012233705.52544fb9@JRWUBU2>
 <C1F2BDD2-86E7-4643-BED8-B9298084469C@telia.com>
 <20191013140018.5ea512bc@JRWUBU2>
 <3893F0F9-3199-46A6-BC62-0404708CD71A@telia.com>
 <20191013201754.6597fdd0@JRWUBU2>
 <A5305A26-F8B8-43B6-96B8-474858B709A5@telia.com>
 <20191013225412.4f1772ca@JRWUBU2>
 <7DFB381E-EAE3-4B04-AA50-2E18E764E731@telia.com>
 <20191014011045.35c851e9@JRWUBU2> <83mue3kdrm.fsf@gnu.org>
 <20191014192939.34ea39ce@JRWUBU2> <83bluji300.fsf@gnu.org>
 <20191015002359.700a5df0@JRWUBU2> <83tv8ah5kk.fsf@gnu.org>
 <20191015205215.773ac298@JRWUBU2> <83imopfbct.fsf@gnu.org>
 <20191017022635.301df2b7@JRWUBU2> <838spjddic.fsf@gnu.org>
 <20191017215850.106b0475@JRWUBU2> <83y2xi8scl.fsf@gnu.org>
Message-ID: <20191018134431.13ff0238@JRWUBU2>

On Fri, 18 Oct 2019 09:45:14 +0300
Eli Zaretskii via Unicode <unicode at unicode.org> wrote:

> > Date: Thu, 17 Oct 2019 21:58:50 +0100
> > From: Richard Wordingham via Unicode <unicode at unicode.org>
> >   
> > > Sounds arbitrary to me.  How do we know that all the users will
> > > want that?  
> > 
> > If the change from codepoint by codepoint matching is just canonical
> > equivalence, then there is no way that the ?n? of ?na? will be
> > matched by the ?n? within ???.  
> 
> "Just canonical equivalence" is also quite arbitrary, for the user's
> POV.  At least IME.

Here's a similar issue.  If I do an incremental search in Welsh text,
entering bac (on the way to entering bach) will find words like "bach"
and  "bachgen" even though their third letter is 'ch', not 'c'.

'Canonical equivalence' is 'DTRT', unless you're working with systems
too lazy or too primitive to DTRT.  It involves treating sequences of
character sequences declared to be identical in signification
identically.

The only pleasant justification for treating canonical sequences
inequivalently that I can think of is to treat the difference as a way
of recording how the text was typed.  Quite a few editing systems erase
that information, and I doubt people care how someone else typed the
text.

Richard.


From unicode at unicode.org  Mon Oct 21 04:21:03 2019
From: unicode at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via Unicode)
Date: Mon, 21 Oct 2019 11:21:03 +0200
Subject: Coding for Emoji: how to modify programs to work with emoji
Message-ID: <CAJ2xs_G+zvXZNCaQ3fhTv0V+yK-Wyqgtq8+Yd_yr9gmzSiQr8Q@mail.gmail.com>

FYI, here is my presentation from the IUC43: http://bit.ly/iuc43davis

Mark
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20191021/5c6ef754/attachment.html>

From unicode at unicode.org  Tue Oct 22 02:37:22 2019
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Tue, 22 Oct 2019 08:37:22 +0100
Subject: Coding for Emoji: how to modify programs to work with emoji
In-Reply-To: <CAJ2xs_G+zvXZNCaQ3fhTv0V+yK-Wyqgtq8+Yd_yr9gmzSiQr8Q@mail.gmail.com>
References: <CAJ2xs_G+zvXZNCaQ3fhTv0V+yK-Wyqgtq8+Yd_yr9gmzSiQr8Q@mail.gmail.com>
Message-ID: <20191022083722.6b456367@JRWUBU2>

On Mon, 21 Oct 2019 11:21:03 +0200
Mark Davis ?? via Unicode <unicode at unicode.org> wrote:

> FYI, here is my presentation from the IUC43: http://bit.ly/iuc43davis

When it comes to the second sentence of the text of Slide 7 'Grapheme
Clusters', my overwhelming reaction is one of extreme anger.  Slide 8
does nothing to lessen the offence.  The problem is that it gives the
impression that in general it is acceptable for backspace to delete the
whole grapheme cluster.

Richard.


From unicode at unicode.org  Tue Oct 22 04:04:01 2019
From: unicode at unicode.org (=?utf-8?Q?Daniel_B=C3=BCnzli?= via Unicode)
Date: Tue, 22 Oct 2019 11:04:01 +0200
Subject: Grapheme clusters and backspace (was Re: Coding for Emoji: how
 to modify programs to work with emoji)
In-Reply-To: <20191022083722.6b456367@JRWUBU2>
References: <CAJ2xs_G+zvXZNCaQ3fhTv0V+yK-Wyqgtq8+Yd_yr9gmzSiQr8Q@mail.gmail.com>
 <20191022083722.6b456367@JRWUBU2>
Message-ID: <etPan.5daec601.78454f36.14e@erratique.ch>

On 22 October 2019 at 09:37:22, Richard Wordingham via Unicode (unicode at unicode.org) wrote:

> When it comes to the second sentence of the text of Slide 7 'Grapheme
> Clusters', my overwhelming reaction is one of extreme anger. Slide 8
> does nothing to lessen the offence. The problem is that it gives the
> impression that in general it is acceptable for backspace to delete the
> whole grapheme cluster.

Let's turn extreme anger into knowledge.?

I'm not very knowledgable in ligature heavy scripts (I suspect that's what you refer to) and what you describe is the first thing I went with for a readline editor data structure.?

Would maybe care to expand when exactly you think it's not acceptable and what kind of tools or standard I can find the Unicode toolbox to implement an acceptable behaviour for backspace on general Unicode text.?

Best,?

Daniel


From unicode at unicode.org  Tue Oct 22 05:18:06 2019
From: unicode at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via Unicode)
Date: Tue, 22 Oct 2019 12:18:06 +0200
Subject: Coding for Emoji: how to modify programs to work with emoji
In-Reply-To: <20191022083722.6b456367@JRWUBU2>
References: <CAJ2xs_G+zvXZNCaQ3fhTv0V+yK-Wyqgtq8+Yd_yr9gmzSiQr8Q@mail.gmail.com>
 <20191022083722.6b456367@JRWUBU2>
Message-ID: <CAJ2xs_EKjQOEOgbnr3kzxEAHREEnagxRK1Pumyn1b5RKBwXp7A@mail.gmail.com>

That sentence is specific to Emoji sequences. I added a note make it clear
that the behavior of backspace for combining marks may be language or
script dependent.

BTW, the speaker notes were added quickly; feedback on them is welcome.

Mark


On Tue, Oct 22, 2019 at 9:41 AM Richard Wordingham via Unicode <
unicode at unicode.org> wrote:

> On Mon, 21 Oct 2019 11:21:03 +0200
> Mark Davis ?? via Unicode <unicode at unicode.org> wrote:
>
> > FYI, here is my presentation from the IUC43: http://bit.ly/iuc43davis
>
> When it comes to the second sentence of the text of Slide 7 'Grapheme
> Clusters', my overwhelming reaction is one of extreme anger.  Slide 8
> does nothing to lessen the offence.  The problem is that it gives the
> impression that in general it is acceptable for backspace to delete the
> whole grapheme cluster.
>
> Richard.
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20191022/95761c75/attachment.html>

From unicode at unicode.org  Tue Oct 22 15:44:10 2019
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Tue, 22 Oct 2019 21:44:10 +0100
Subject: Grapheme clusters and backspace (was Re: Coding for Emoji: how
 to modify programs to work with emoji)
In-Reply-To: <etPan.5daec601.78454f36.14e@erratique.ch>
References: <CAJ2xs_G+zvXZNCaQ3fhTv0V+yK-Wyqgtq8+Yd_yr9gmzSiQr8Q@mail.gmail.com>
 <20191022083722.6b456367@JRWUBU2>
 <etPan.5daec601.78454f36.14e@erratique.ch>
Message-ID: <20191022214410.6020c96b@JRWUBU2>

On Tue, 22 Oct 2019 11:04:01 +0200
Daniel B?nzli via Unicode <unicode at unicode.org> wrote:

> On 22 October 2019 at 09:37:22, Richard Wordingham via Unicode
> (unicode at unicode.org) wrote:
> 
> > When it comes to the second sentence of the text of Slide 7
> > 'Grapheme Clusters', my overwhelming reaction is one of extreme
> > anger. Slide 8 does nothing to lessen the offence. The problem is
> > that it gives the impression that in general it is acceptable for
> > backspace to delete the whole grapheme cluster.  
> 
> Let's turn extreme anger into knowledge.?
> 
> I'm not very knowledgable in ligature heavy scripts (I suspect that's
> what you refer to) and what you describe is the first thing I went
> with for a readline editor data structure.?

Not necessarily ligature-heavy, but heavy in combining characters.
Examples at the light end include IPA and pointed Hebrew.  The Thai
script is another fairly well-known one but Siamese itself doesn't use
more than two marks on a consonant.  (The vowel marks before and after
don't count - they work like letters.)

> Would maybe care to expand when exactly you think it's not acceptable
> and what kind of tools or standard I can find the Unicode toolbox to
> implement an acceptable behaviour for backspace on general Unicode
> text.?

The compromise that has generally been reached is that 'delete' deletes
a grapheme cluster and 'backspace' deletes a scalar value.  (There are
good editors like Emacs that delete only a single character.)  The
rationale for this is that backspace undoes the effect of a
keystroke. For a perfect match, the keyboard would need to handle the
backspace - and everyone editing the text would have to use compatible
keyboards!  That's not a very plausible scenario for a Wikipedia
article.

Now, deleting the last character is not very Unicode compliant; there
is a family of keyboard designs in development that by default deletes
the last character in NFC form if it is precomposed and otherwise the
last character in NFD forms.  UTS#35 Issue 36 Part 7 Section 5.21
allows for more elaborate behaviours.  I would contend that deleting
the last character is the best simple approximation.  However, it's not
impossible for a dead key implementation to decide that dead acute plus
'e' should be emitted as two characters, even though its more usual for
it to be emitted as a single character.

Now, there are cases where one may be unlikely to type a single
character.  I can imagine a variation sequence or being implemented as
a 'ligature', i.e. a single stroke (or IME selection action) yielding
the entry of a base character plus variation selector.  Emoji may be
another, though I must say I would probably enter a regional indicator
pair as two characters, and expect to be able to delete just the last
if I made an error, contra Davis 2019.

While stacker + consonant might be expected to be a unit, the original
designs envisaged them being a sequence.  Additionally, I would expect
an edit to change the subscripted consonant rather than remove it.  In
this case, delete last character and delete grapheme cluster agree for
the language-independent rules.

Richard.


From unicode at unicode.org  Tue Oct 22 16:27:27 2019
From: unicode at unicode.org (=?utf-8?Q?Daniel_B=C3=BCnzli?= via Unicode)
Date: Tue, 22 Oct 2019 23:27:27 +0200
Subject: Grapheme clusters and backspace (was Re: Coding for Emoji:
 how to modify programs to work with emoji)
In-Reply-To: <20191022214410.6020c96b@JRWUBU2>
References: <CAJ2xs_G+zvXZNCaQ3fhTv0V+yK-Wyqgtq8+Yd_yr9gmzSiQr8Q@mail.gmail.com>
 <20191022083722.6b456367@JRWUBU2>
 <etPan.5daec601.78454f36.14e@erratique.ch>
 <20191022214410.6020c96b@JRWUBU2>
Message-ID: <etPan.5daf743f.1c98485.f8c4@erratique.ch>

Thanks for you answer.

> The compromise that has generally been reached is that 'delete' deletes
> a grapheme cluster and 'backspace' deletes a scalar value. (There are
> good editors like Emacs that delete only a single character.)

Just to make things clear. When you say character in your message, you consistently mean scalar value right ?

Best,?

Daniel


From unicode at unicode.org  Tue Oct 22 17:32:31 2019
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Tue, 22 Oct 2019 23:32:31 +0100
Subject: Grapheme clusters and backspace (was Re: Coding for Emoji: how
 to modify programs to work with emoji)
In-Reply-To: <etPan.5daf743f.1c98485.f8c4@erratique.ch>
References: <CAJ2xs_G+zvXZNCaQ3fhTv0V+yK-Wyqgtq8+Yd_yr9gmzSiQr8Q@mail.gmail.com>
 <20191022083722.6b456367@JRWUBU2>
 <etPan.5daec601.78454f36.14e@erratique.ch>
 <20191022214410.6020c96b@JRWUBU2>
 <etPan.5daf743f.1c98485.f8c4@erratique.ch>
Message-ID: <20191022233231.441d2af1@JRWUBU2>

On Tue, 22 Oct 2019 23:27:27 +0200
Daniel B?nzli via Unicode <unicode at unicode.org> wrote:

> Thanks for you answer.
> 
> > The compromise that has generally been reached is that 'delete'
> > deletes a grapheme cluster and 'backspace' deletes a scalar value.
> > (There are good editors like Emacs that delete only a single
> > character.)  
> 
> Just to make things clear. When you say character in your message,
> you consistently mean scalar value right ?

Yes.

I find it hard to imagine that having to type them doesn't endow then
with some sort of reality in the users' minds, though some, such as
invisible stackers, are probably envisaged as control characters.

One does come across some odd entry methods, such as typing an Indic
akshara using the Latin script and then entering it as a whole.  That
is no more conducive to seeing the constituents as characters than is
typing wab- to get the hieroglyph ??.

Richard. 


From unicode at unicode.org  Tue Oct 22 18:15:57 2019
From: unicode at unicode.org (=?utf-8?B?TWFydGluIEouIETDvHJzdA==?= via Unicode)
Date: Tue, 22 Oct 2019 23:15:57 +0000
Subject: Grapheme clusters and backspace (was Re: Coding for Emoji: how to
 modify programs to work with emoji)
In-Reply-To: <20191022233231.441d2af1@JRWUBU2>
References: <CAJ2xs_G+zvXZNCaQ3fhTv0V+yK-Wyqgtq8+Yd_yr9gmzSiQr8Q@mail.gmail.com>
 <20191022083722.6b456367@JRWUBU2> <etPan.5daec601.78454f36.14e@erratique.ch>
 <20191022214410.6020c96b@JRWUBU2> <etPan.5daf743f.1c98485.f8c4@erratique.ch>
 <20191022233231.441d2af1@JRWUBU2>
Message-ID: <3e07968d-33a5-9864-570c-c365022883ce@it.aoyama.ac.jp>

Hello Richard, others,

On 2019/10/23 07:32, Richard Wordingham via Unicode wrote:
> On Tue, 22 Oct 2019 23:27:27 +0200
> Daniel B?nzli via Unicode <unicode at unicode.org> wrote:

>> Just to make things clear. When you say character in your message,
>> you consistently mean scalar value right ?
> 
> Yes.
> 
> I find it hard to imagine that having to type them doesn't endow then
> with some sort of reality in the users' minds, though some, such as
> invisible stackers, are probably envisaged as control characters.

I think this to some extent is a question of "reality in the users' 
minds". But to a very large extent, this is an issue of muscle memory. 
If a user works with a keyboard/input method that deletes a whole 
combination, their muscles will get used to that the same way they will 
get used to the other case.

Users are perfectly capable of talking about characters and in the same 
sentence use that word once for something like individual codepoints and 
later for a whole combination.

> One does come across some odd entry methods, such as typing an Indic
> akshara using the Latin script and then entering it as a whole.  That
> is no more conducive to seeing the constituents as characters than is
> typing wab- to get the hieroglyph ??.

The input of Japanese Kana is usually done from a Latin keyboard. As an 
example, to input the syllable "ka" (?), one presses the keys for 'k' 
and 'a'. In all the IMEs I have used, a backspace deletes the whole
"?", not only the 'a'. One has to get used to it (I still occasionally 
want to press two backspaces when realizing I made a typo), but one gets 
used to it.

There are also cases such as "kya" ? "??", where the three Latin 
keyboard presses cannot be allocated 2-1 or 1-2 to the two resulting 
Hiragana. In a sophisticated implementation, a backspace could go from
"??" to "ky", but that would only work immediately after input.

Of course, for Japanese input, Latin ? Kana is only the first layer, the 
second layer is Kana ? Kanji.

Regards,   Martin.


From unicode at unicode.org  Tue Oct 22 21:31:09 2019
From: unicode at unicode.org (Ben Morphett via Unicode)
Date: Wed, 23 Oct 2019 02:31:09 +0000
Subject: Unicode Digest, Vol 70, Issue 17
In-Reply-To: <mailman.4.1571763601.23563.unicode@unicode.org>
References: <mailman.4.1571763601.23563.unicode@unicode.org>
Message-ID: <CH2PR20MB296549C2CDB34FA2AF8436A7DB6B0@CH2PR20MB2965.namprd20.prod.outlook.com>

It totally depends on the editor.  In Notepad++, when I backspace over "Man Teacher: Dark Skin Tone", I get "Man Teacher: Dark Skin Tone" => ""Man: Dark Skin Tone" => gone.

In the Outlook e-mail editor, I get 

???????
?????
??

--
Cheers,
Ben Morphett

-----Original Message-----
From: Richard Wordingham via Unicode <unicode at unicode.org> 
Sent: Tuesday, 22 October 2019 6:37 PM
To: unicode at unicode.org
Subject: Re: Coding for Emoji: how to modify programs to work with emoji

On Mon, 21 Oct 2019 11:21:03 +0200
Mark Davis ?? via Unicode <unicode at unicode.org> wrote:

> FYI, here is my presentation from the IUC43: http://bit.ly/iuc43davis

When it comes to the second sentence of the text of Slide 7 'Grapheme Clusters', my overwhelming reaction is one of extreme anger.  Slide 8 does nothing to lessen the offence.  The problem is that it gives the impression that in general it is acceptable for backspace to delete the whole grapheme cluster.

Richard.


From unicode at unicode.org  Wed Oct 23 03:02:44 2019
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Wed, 23 Oct 2019 09:02:44 +0100
Subject: Grapheme clusters and backspace (was Re: Coding for Emoji: how
 to modify programs to work with emoji)
In-Reply-To: <3e07968d-33a5-9864-570c-c365022883ce@it.aoyama.ac.jp>
References: <CAJ2xs_G+zvXZNCaQ3fhTv0V+yK-Wyqgtq8+Yd_yr9gmzSiQr8Q@mail.gmail.com>
 <20191022083722.6b456367@JRWUBU2>
 <etPan.5daec601.78454f36.14e@erratique.ch>
 <20191022214410.6020c96b@JRWUBU2>
 <etPan.5daf743f.1c98485.f8c4@erratique.ch>
 <20191022233231.441d2af1@JRWUBU2>
 <3e07968d-33a5-9864-570c-c365022883ce@it.aoyama.ac.jp>
Message-ID: <20191023090244.05b22cf4@JRWUBU2>

On Tue, 22 Oct 2019 23:15:57 +0000
Martin J. D?rst via Unicode <unicode at unicode.org> wrote:

> I think this to some extent is a question of "reality in the users' 
> minds". But to a very large extent, this is an issue of muscle
> memory. If a user works with a keyboard/input method that deletes a
> whole combination, their muscles will get used to that the same way
> they will get used to the other case.

The issue is one of being able to edit the cluster.  Large clusters
call out for editing rather than replacement.

Richard.


From unicode at unicode.org  Wed Oct 23 11:39:04 2019
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Wed, 23 Oct 2019 17:39:04 +0100
Subject: Grapheme clusters & backspace (was: Unicode Digest, Vol 70,
 Issue 17)
In-Reply-To: <CH2PR20MB296549C2CDB34FA2AF8436A7DB6B0@CH2PR20MB2965.namprd20.prod.outlook.com>
References: <mailman.4.1571763601.23563.unicode@unicode.org>
 <CH2PR20MB296549C2CDB34FA2AF8436A7DB6B0@CH2PR20MB2965.namprd20.prod.outlook.com>
Message-ID: <20191023173904.4be31e5e@JRWUBU2>

On Wed, 23 Oct 2019 02:31:09 +0000
Ben Morphett via Unicode <unicode at unicode.org> wrote:

> It totally depends on the editor.  In Notepad++, when I backspace
> over "Man Teacher: Dark Skin Tone", I get "Man Teacher: Dark Skin
> Tone" => ""Man: Dark Skin Tone" => gone.

In MS Word 2016 on Windows 10, I get an intermediate stage of ?Man:
Dark Skin ZWJ?, which is comparable to my suggestion that only the
consonant be deleted from a sequence of Indic stacker + consonant, even
though it be very similar to a unitary consonant sign.  The main
difference in the Indic pair is that there is a (misplaced) grapheme
cluster boundary in the former.

Mark Davis has proclaimed that all these emoji behaviours are WRONG.    

What is wrong is that the ZWJ may go missing with copy and paste, as I
found between Word and plain Notepad.

Richard.


From unicode at unicode.org  Tue Oct 29 17:36:21 2019
From: unicode at unicode.org (wjgo_10009@btinternet.com via Unicode)
Date: Tue, 29 Oct 2019 22:36:21 +0000 (GMT)
Subject: New Public Review on QID emoji
Message-ID: <f565b94.1c71.16e19a9b24c.Webtop.221@btinternet.com>

Hello everyone

I have recently learned that there is a new Public Review Issue on QID 
emoji.

https://www.unicode.org/review/pri408/

Also the closure date for PRI 405 has been given an extension.

http://www.unicode.org/review/pri405/

https://www.unicode.org/review/

William Overington

Tuesday 29 October 2019


From unicode at unicode.org  Wed Oct 30 12:41:16 2019
From: unicode at unicode.org (wjgo_10009@btinternet.com via Unicode)
Date: Wed, 30 Oct 2019 17:41:16 +0000 (GMT)
Subject: New Public Review on QID emoji
Message-ID: <5eff0ea4.a4d.16e1dc1e66b.Webtop.52@btinternet.com>

Hello everyone

I have been reading about QID emoji and what is proposed.

At present I have a question to which I cannot find the answer.

Is the QID emoji format, if approved by the Unicode Technical Committee 
going to be sent to the ISO/IEC 10646 committee for consideration by 
that committee?

As the QID emoji format is in a Unicode Technical Standard and does not 
include the encoding of any new _atomic_ characters, I am concerned that 
the answer to the above question may well be along the lines of "No" 
maybe with some reasoning as to why not.

Yet will a QID emoji essentially be _de facto_ a character even if not 
_de jure_ a character?

For a QID emoji will not just be "markup using existing characters from 
the ISO/IEC 10646 standard that is synchronized with Unicode", such as 
would be a markup that anyone could devise for use in his or her 
research and experimentation or indeed some public use, it will be a 
Unicode Inc. endorsed "whatever" that is very closely linked to The 
Unicode Standard even if not deemed to be part of it.

As I understand the situation, in some countries people take no (formal) 
notice as such of The Unicode Standard but rely solely on ISO/IEC 10646. 
Often this may well present no practical problems in information 
technology and its applications because the two standards are 
synchronized each with the other.

Yet if QID emoji are implemented by Unicode Inc. without also being 
implemented by ISO/IEC 10646 then that could lead to future problems, 
notwithstanding any _de jure_ situation that QID emoji are not 
characters, because they will be much more than Private Use characters 
yet less than characters that are in ISO/IEC 10646.

I am in favour of the encoding of the QID emoji mechanism and its 
practical application. However I wonder about what are the consequences 
for interoperability and communication if QID emoji become used - maybe 
quite widely - and yet the tag sequences are not discernable in meaning 
from ISO/IEC 10646 or any related ISO/IEC documents.

William Overington

Wednesday 30 October 2019


From unicode at unicode.org  Wed Oct 30 14:18:44 2019
From: unicode at unicode.org (Ken Whistler via Unicode)
Date: Wed, 30 Oct 2019 12:18:44 -0700
Subject: New Public Review on QID emoji
In-Reply-To: <5eff0ea4.a4d.16e1dc1e66b.Webtop.52@btinternet.com>
References: <5eff0ea4.a4d.16e1dc1e66b.Webtop.52@btinternet.com>
Message-ID: <3d02402e-6ab0-2417-8c23-c958f3ae9092@sonic.net>


On 10/30/2019 10:41 AM, wjgo_10009 at btinternet.com via Unicode wrote:
>
> At present I have a question to which I cannot find the answer.
>
> Is the QID emoji format, if approved by the Unicode Technical 
> Committee going to be sent to the ISO/IEC 10646 committee for 
> consideration by that committee?
No.
>
> As the QID emoji format is in a Unicode Technical Standard and does 
> not include the encoding of any new _atomic_ characters, I am 
> concerned that the answer to the above question may well be along the 
> lines of "No" maybe with some reasoning as to why not.
As you surmised.
>
> Yet will a QID emoji essentially be _de facto_ a character even if not 
> _de jure_ a character?
That distinction is effectively meaningless. There are any number of 
entities that end users perceive as "characters", which are not 
represented by a single code point in the Unicode Standard (or 10646) -- 
and this has been the case now for decades.
>
>
> Yet if QID emoji are implemented by Unicode Inc. without also being 
> implemented by ISO/IEC 10646 then that could lead to future problems, 
> notwithstanding any _de jure_ situation that QID emoji are not 
> characters, because they will be much more than Private Use characters 
> yet less than characters that are in ISO/IEC 10646.

What you are missing is that *many* emoji are already represented by 
sequences of characters. See emoji modifier sequences, emoji flag 
sequences, emoji ZWJ sequences. *None* of those are specified in 10646, 
have not been for years now, and never will be. And yet, there is no de 
jure standardization crisis here, or any interoperability issue for 
emoji arising from that situation.

>
> I am in favour of the encoding of the QID emoji mechanism and its 
> practical application. However I wonder about what are the 
> consequences for interoperability and communication if QID emoji 
> become used - maybe quite widely - and yet the tag sequences are not 
> discernable in meaning from ISO/IEC 10646 or any related ISO/IEC 
> documents.

There may well be interoperability concerns specifically for the QID 
emoji mechanism, but that would be an issue pertaining to the 
architecture of that mechanism specifically. It isn't anything to do 
with the relationship between the Unicode Standard (and UTS #51) and 
ISO/IEC 10646.

--Ken