From cldr-users at unicode.org Fri Nov 15 20:17:29 2019 From: cldr-users at unicode.org (Kip Cole via CLDR-Users) Date: Sat, 16 Nov 2019 10:17:29 +0800 Subject: Transform rule syntax clarifications References: <0A126112-AFBE-428A-A6EF-DAEC87FEDB25@gmail.com> Message-ID: <3D859B79-8910-414B-A9CF-E48833673A43@gmail.com> I?m implementing the transform rules and would appreciate a few confirmations or corrections: ? ($glower $ddot?) $rough ? H | ? $1 ; The "$ddot?? Is interpreted as ?optional $ddot? in the usual regex meaning $accent_minus = [[$accent]-[$iotasub$macron]]; The ?[[..]-[..]]" is regex character set negation? $notAbove = [[:^ccc=0:] & [:^ccc=230:]]; The ?[[..]&[..]]" is regex character set intersection? | $1 $iotasub ? ($evowel $macron $accentMinus *) i ; That the ?*? here is ?zero of more times $accentMinus? in the usual regex meaning? And ?$1? is the capture result in the usual regex meaning too? t ($notAbove+) ? ; # ARABIC LETTER TEH MARBUTA The ?+? is the usual regex meaning of ?one or more times? Many thanks, ?Kip -------------- next part -------------- An HTML attachment was scrubbed... URL: From cldr-users at unicode.org Sat Nov 16 15:18:00 2019 From: cldr-users at unicode.org (Cameron Dutro via CLDR-Users) Date: Sat, 16 Nov 2019 13:18:00 -0800 Subject: Transform rule syntax clarifications In-Reply-To: <3D859B79-8910-414B-A9CF-E48833673A43@gmail.com> References: <0A126112-AFBE-428A-A6EF-DAEC87FEDB25@gmail.com> <3D859B79-8910-414B-A9CF-E48833673A43@gmail.com> Message-ID: Hey Kip, I'm certainly not an expert, but I did write the current transforms implementation in TwitterCLDR , so I think I can be of some help here. The transform rules syntax is very similar to regular expressions, so your intuition that "$ddot?" is interpreted as "optional $ddot" is correct. You are also correct as to the meaning of the asterisk, should work the same as it does in the regex world. The other bits of syntax you've mentioned are from the Unicode Set specification, which you can find in UTS #35 . Unicode Sets are like regex character classes, but as you've noticed, there are a couple of special operations they support that regexes don't. Specifically, the "-" operator is the symmetric difference between the two operands (UTS 35 says "asymmetric difference," but I don't think that's a thing - I can't find any definition of it online). The "&" operator is the set intersection of the two operands, and no operator is their union. Hope that helps! -Cameron On Fri, Nov 15, 2019 at 6:19 PM Kip Cole via CLDR-Users < cldr-users at unicode.org> wrote: > I?m implementing the transform rules and would appreciate a few > confirmations or corrections: > > ? ($glower $ddot?) $rough ? H | ? $1 ; > The "$ddot?? Is interpreted as ?optional $ddot? > in the usual regex meaning > > $accent_minus = [[$accent]-[$iotasub$macron]]; > The ?[[..]-[..]]" is regex character set negation? > > $notAbove = [[:^ccc=0:] & [:^ccc=230:]]; > The ?[[..]&[..]]" is regex character set intersection? > > | $1 $iotasub ? ($evowel $macron $accentMinus *) i ; > That the ?*? here is ?zero of more times $accentMinus? in the > usual regex meaning? And ?$1? is the capture result in the usual regex > meaning too? > > t ($notAbove+) ? ; # ARABIC LETTER TEH MARBUTA > The ?+? is the usual regex meaning of ?one or more times? > > Many thanks, ?Kip > > _______________________________________________ > CLDR-Users mailing list > CLDR-Users at unicode.org > http://unicode.org/mailman/listinfo/cldr-users > -------------- next part -------------- An HTML attachment was scrubbed... URL: From cldr-users at unicode.org Sat Nov 16 15:18:43 2019 From: cldr-users at unicode.org (Cameron Dutro via CLDR-Users) Date: Sat, 16 Nov 2019 13:18:43 -0800 Subject: Transform rule syntax clarifications In-Reply-To: References: <0A126112-AFBE-428A-A6EF-DAEC87FEDB25@gmail.com> <3D859B79-8910-414B-A9CF-E48833673A43@gmail.com> Message-ID: By the way, what language are you writing your implementation in? Is it open-source? Would love to take a look if possible :) -Cameron On Sat, Nov 16, 2019 at 1:18 PM Cameron Dutro wrote: > Hey Kip, > > I'm certainly not an expert, but I did write the current transforms > implementation > > in TwitterCLDR > , so I think > I can be of some help here. > > The transform rules syntax is very similar to regular expressions, so your > intuition that "$ddot?" is interpreted as "optional $ddot" is correct. You > are also correct as to the meaning of the asterisk, should work the same as > it does in the regex world. > > The other bits of syntax you've mentioned are from the Unicode Set > specification, which you can find in UTS #35 > . Unicode Sets are like > regex character classes, but as you've noticed, there are a couple of > special operations they support that regexes don't. Specifically, the "-" > operator is the symmetric difference > between the two > operands (UTS 35 says "asymmetric difference," but I don't think that's a > thing - I can't find any definition of it online). The "&" operator is the > set intersection of the two operands, and no operator is their union. > > Hope that helps! > > -Cameron > > On Fri, Nov 15, 2019 at 6:19 PM Kip Cole via CLDR-Users < > cldr-users at unicode.org> wrote: > >> I?m implementing the transform rules and would appreciate a few >> confirmations or corrections: >> >> ? ($glower $ddot?) $rough ? H | ? $1 ; >> The "$ddot?? Is interpreted as ?optional $ddot? >> in the usual regex meaning >> >> $accent_minus = [[$accent]-[$iotasub$macron]]; >> The ?[[..]-[..]]" is regex character set negation? >> >> $notAbove = [[:^ccc=0:] & [:^ccc=230:]]; >> The ?[[..]&[..]]" is regex character set intersection? >> >> | $1 $iotasub ? ($evowel $macron $accentMinus *) i ; >> That the ?*? here is ?zero of more times $accentMinus? in the >> usual regex meaning? And ?$1? is the capture result in the usual regex >> meaning too? >> >> t ($notAbove+) ? ; # ARABIC LETTER TEH MARBUTA >> The ?+? is the usual regex meaning of ?one or more times? >> >> Many thanks, ?Kip >> >> _______________________________________________ >> CLDR-Users mailing list >> CLDR-Users at unicode.org >> http://unicode.org/mailman/listinfo/cldr-users >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From cldr-users at unicode.org Sat Nov 16 19:48:56 2019 From: cldr-users at unicode.org (Kip Cole via CLDR-Users) Date: Sun, 17 Nov 2019 09:48:56 +0800 Subject: Transform rule syntax clarifications In-Reply-To: References: Message-ID: <1AFBE4B9-00DF-4842-BF7F-CB7E77AD1ADD@gmail.com> Cameron, its for the Elixir language. Numbers, Date/time, Lists and Units are mostly done. Messages is partly done. Transforms is the current work. All at https://github.com/elixir-cldr Thanks for the confirmation and pointers for transform rules. And for your work on TwitterCLDR, its definitely been an inspiration (I was more a Ruby person for a quite a while but now mostly working in Elixir). > On 17 Nov 2019, at 5:19 am, via CLDR-Users wrote: > > From: Cameron Dutro via CLDR-Users > > Subject: Re: Transform rule syntax clarifications > Date: 17 November 2019 at 5:18:43 am GMT+8 > To: Kip Cole > > Cc: cldr-users > > Reply-To: Cameron Dutro > > > > By the way, what language are you writing your implementation in? Is it open-source? Would love to take a look if possible :) > > -Cameron > -------------- next part -------------- An HTML attachment was scrubbed... URL: From cldr-users at unicode.org Sat Nov 16 20:37:24 2019 From: cldr-users at unicode.org (Richard Wordingham via CLDR-Users) Date: Sun, 17 Nov 2019 02:37:24 +0000 Subject: Transform rule syntax clarifications In-Reply-To: References: <0A126112-AFBE-428A-A6EF-DAEC87FEDB25@gmail.com> <3D859B79-8910-414B-A9CF-E48833673A43@gmail.com> Message-ID: <20191117023724.1f8bdf5a@JRWUBU2> On Sat, 16 Nov 2019 13:18:00 -0800 Cameron Dutro via CLDR-Users wrote: > The other bits of syntax you've mentioned are from the Unicode Set > specification, which you can find in UTS #35 > . Unicode Sets are > like regex character classes, but as you've noticed, there are a > couple of special operations they support that regexes don't. > Specifically, the "-" operator is the symmetric difference > between the two > operands (UTS 35 says "asymmetric difference," but I don't think > that's a thing - I can't find any definition of it online). It very much is a thing! In this particular case, $accent_minus = [[$accent]-[$iotasub$macron]]; is probably the same as the symmetric difference, because from the names i think everything in the second set is in the first set, but this doesn't always apply. [abcd] - [abef] is [cd], not the symmetric difference [cdef]. Richard. From cldr-users at unicode.org Sun Nov 17 09:18:43 2019 From: cldr-users at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via CLDR-Users) Date: Sun, 17 Nov 2019 15:18:43 +0000 Subject: Transform rule syntax clarifications In-Reply-To: <20191117023724.1f8bdf5a@JRWUBU2> References: <0A126112-AFBE-428A-A6EF-DAEC87FEDB25@gmail.com> <3D859B79-8910-414B-A9CF-E48833673A43@gmail.com> <20191117023724.1f8bdf5a@JRWUBU2> Message-ID: {phone} On Sun, Nov 17, 2019, 02:38 Richard Wordingham via CLDR-Users < cldr-users at unicode.org> wrote: > On Sat, 16 Nov 2019 13:18:00 -0800 > Cameron Dutro via CLDR-Users wrote: > > > The other bits of syntax you've mentioned are from the Unicode Set > > specification, which you can find in UTS #35 > > . Unicode Sets are > > like regex character classes, but as you've noticed, there are a > > couple of special operations they support that regexes don't. > Some regrexes do https://regular-expressions.mobi/charclasssubtract.html?wlr=1 > Specifically, the "-" operator is the symmetric difference > > between the two > > operands (UTS 35 says "asymmetric difference," but I don't think > > that's a thing - I can't find any definition of it online). > > It very much is a thing! In this particular case, > > $accent_minus = [[$accent]-[$iotasub$macron]]; > > is probably the same as the symmetric difference, because from > the names i think everything in the second set is in the first set, but > this doesn't always apply. [abcd] - [abef] is [cd], not the symmetric > difference [cdef]. > Also Unicodeset alternatives don't support backup. > > Richard. > _______________________________________________ > CLDR-Users mailing list > CLDR-Users at unicode.org > http://unicode.org/mailman/listinfo/cldr-users > -------------- next part -------------- An HTML attachment was scrubbed... URL: From cldr-users at unicode.org Sun Nov 17 22:41:47 2019 From: cldr-users at unicode.org (=?utf-8?B?TWFydGluIEouIETDvHJzdA==?= via CLDR-Users) Date: Mon, 18 Nov 2019 04:41:47 +0000 Subject: Transform rule syntax clarifications In-Reply-To: <20191117023724.1f8bdf5a@JRWUBU2> References: <0A126112-AFBE-428A-A6EF-DAEC87FEDB25@gmail.com> <3D859B79-8910-414B-A9CF-E48833673A43@gmail.com> <20191117023724.1f8bdf5a@JRWUBU2> Message-ID: <1cc46a2e-e7e9-6956-b59b-13414ef5eb28@it.aoyama.ac.jp> On 2019/11/17 11:37, Richard Wordingham via CLDR-Users wrote: > On Sat, 16 Nov 2019 13:18:00 -0800 > Cameron Dutro via CLDR-Users wrote: > >> The other bits of syntax you've mentioned are from the Unicode Set >> specification, which you can find in UTS #35 >> . Unicode Sets are >> like regex character classes, but as you've noticed, there are a >> couple of special operations they support that regexes don't. >> Specifically, the "-" operator is the symmetric difference >> between the two >> operands (UTS 35 says "asymmetric difference," but I don't think >> that's a thing - I can't find any definition of it online). > > It very much is a thing! Well, yes, except that it's usually just called "set difference" without an explicit adjective. (I'd strongly suggest that UTS 35 put the word 'asymmetric' in parentheses.) Also, one wouldn't use the symbol '-' for symmetric difference. Regards, Martin. > In this particular case, > > $accent_minus = [[$accent]-[$iotasub$macron]]; > > is probably the same as the symmetric difference, because from > the names i think everything in the second set is in the first set, but > this doesn't always apply. [abcd] - [abef] is [cd], not the symmetric > difference [cdef]. > > Richard. From cldr-users at unicode.org Mon Nov 18 02:22:26 2019 From: cldr-users at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via CLDR-Users) Date: Mon, 18 Nov 2019 08:22:26 +0000 Subject: Transform rule syntax clarifications In-Reply-To: <1cc46a2e-e7e9-6956-b59b-13414ef5eb28@it.aoyama.ac.jp> References: <0A126112-AFBE-428A-A6EF-DAEC87FEDB25@gmail.com> <3D859B79-8910-414B-A9CF-E48833673A43@gmail.com> <20191117023724.1f8bdf5a@JRWUBU2> <1cc46a2e-e7e9-6956-b59b-13414ef5eb28@it.aoyama.ac.jp> Message-ID: It should use A?B when taking about the math operation. Could you file a ticket? {phone} On Mon, Nov 18, 2019, 04:42 Martin J. D?rst via CLDR-Users < cldr-users at unicode.org> wrote: > On 2019/11/17 11:37, Richard Wordingham via CLDR-Users wrote: > > On Sat, 16 Nov 2019 13:18:00 -0800 > > Cameron Dutro via CLDR-Users wrote: > > > >> The other bits of syntax you've mentioned are from the Unicode Set > >> specification, which you can find in UTS #35 > >> . Unicode Sets are > >> like regex character classes, but as you've noticed, there are a > >> couple of special operations they support that regexes don't. > >> Specifically, the "-" operator is the symmetric difference > >> between the two > >> operands (UTS 35 says "asymmetric difference," but I don't think > >> that's a thing - I can't find any definition of it online). > > > > It very much is a thing! > > Well, yes, except that it's usually just called "set difference" without > an explicit adjective. (I'd strongly suggest that UTS 35 put the word > 'asymmetric' in parentheses.) Also, one wouldn't use the symbol '-' for > symmetric difference. > > Regards, Martin. > > > In this particular case, > > > > $accent_minus = [[$accent]-[$iotasub$macron]]; > > > > is probably the same as the symmetric difference, because from > > the names i think everything in the second set is in the first set, but > > this doesn't always apply. [abcd] - [abef] is [cd], not the symmetric > > difference [cdef]. > > > > Richard. > > _______________________________________________ > CLDR-Users mailing list > CLDR-Users at unicode.org > http://unicode.org/mailman/listinfo/cldr-users > -------------- next part -------------- An HTML attachment was scrubbed... URL: From cldr-users at unicode.org Mon Nov 18 03:03:18 2019 From: cldr-users at unicode.org (=?utf-8?B?TWFydGluIEouIETDvHJzdA==?= via CLDR-Users) Date: Mon, 18 Nov 2019 09:03:18 +0000 Subject: Transform rule syntax clarifications In-Reply-To: References: <0A126112-AFBE-428A-A6EF-DAEC87FEDB25@gmail.com> <3D859B79-8910-414B-A9CF-E48833673A43@gmail.com> <20191117023724.1f8bdf5a@JRWUBU2> <1cc46a2e-e7e9-6956-b59b-13414ef5eb28@it.aoyama.ac.jp> Message-ID: On 2019/11/18 17:22, Mark Davis ?? wrote: > It should use A?B when taking about the math operation. Could you file a > ticket? Done. Regards, Martin. > {phone} > > On Mon, Nov 18, 2019, 04:42 Martin J. D?rst via CLDR-Users < > cldr-users at unicode.org> wrote: > >> On 2019/11/17 11:37, Richard Wordingham via CLDR-Users wrote: >>> On Sat, 16 Nov 2019 13:18:00 -0800 >>> Cameron Dutro via CLDR-Users wrote: >>> >>>> The other bits of syntax you've mentioned are from the Unicode Set >>>> specification, which you can find in UTS #35 >>>> . Unicode Sets are >>>> like regex character classes, but as you've noticed, there are a >>>> couple of special operations they support that regexes don't. >>>> Specifically, the "-" operator is the symmetric difference >>>> between the two >>>> operands (UTS 35 says "asymmetric difference," but I don't think >>>> that's a thing - I can't find any definition of it online). >>> >>> It very much is a thing! >> >> Well, yes, except that it's usually just called "set difference" without >> an explicit adjective. (I'd strongly suggest that UTS 35 put the word >> 'asymmetric' in parentheses.) Also, one wouldn't use the symbol '-' for >> symmetric difference. >> >> Regards, Martin. >> >>> In this particular case, >>> >>> $accent_minus = [[$accent]-[$iotasub$macron]]; >>> >>> is probably the same as the symmetric difference, because from >>> the names i think everything in the second set is in the first set, but >>> this doesn't always apply. [abcd] - [abef] is [cd], not the symmetric >>> difference [cdef]. >>> >>> Richard. > From cldr-users at unicode.org Mon Nov 18 04:41:44 2019 From: cldr-users at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via CLDR-Users) Date: Mon, 18 Nov 2019 10:41:44 +0000 Subject: Transform rule syntax clarifications In-Reply-To: References: <0A126112-AFBE-428A-A6EF-DAEC87FEDB25@gmail.com> <3D859B79-8910-414B-A9CF-E48833673A43@gmail.com> <20191117023724.1f8bdf5a@JRWUBU2> <1cc46a2e-e7e9-6956-b59b-13414ef5eb28@it.aoyama.ac.jp> Message-ID: I was about to point people to that, and request people to add additional comments on clarifications of the syntax, but I couldn't find your ticket at https://unicode-org.atlassian.net/issues/?filter=10033. Did you file it for CLDR? Mark On Mon, Nov 18, 2019 at 9:03 AM Martin J. D?rst wrote: > On 2019/11/18 17:22, Mark Davis ?? wrote: > > It should use A?B when taking about the math operation. Could you file a > > ticket? > > Done. Regards, Martin. > > > {phone} > > > > On Mon, Nov 18, 2019, 04:42 Martin J. D?rst via CLDR-Users < > > cldr-users at unicode.org> wrote: > > > >> On 2019/11/17 11:37, Richard Wordingham via CLDR-Users wrote: > >>> On Sat, 16 Nov 2019 13:18:00 -0800 > >>> Cameron Dutro via CLDR-Users wrote: > >>> > >>>> The other bits of syntax you've mentioned are from the Unicode Set > >>>> specification, which you can find in UTS #35 > >>>> . Unicode Sets are > >>>> like regex character classes, but as you've noticed, there are a > >>>> couple of special operations they support that regexes don't. > >>>> Specifically, the "-" operator is the symmetric difference > >>>> between the two > >>>> operands (UTS 35 says "asymmetric difference," but I don't think > >>>> that's a thing - I can't find any definition of it online). > >>> > >>> It very much is a thing! > >> > >> Well, yes, except that it's usually just called "set difference" without > >> an explicit adjective. (I'd strongly suggest that UTS 35 put the word > >> 'asymmetric' in parentheses.) Also, one wouldn't use the symbol '-' for > >> symmetric difference. > >> > >> Regards, Martin. > >> > >>> In this particular case, > >>> > >>> $accent_minus = [[$accent]-[$iotasub$macron]]; > >>> > >>> is probably the same as the symmetric difference, because from > >>> the names i think everything in the second set is in the first set, but > >>> this doesn't always apply. [abcd] - [abef] is [cd], not the symmetric > >>> difference [cdef]. > >>> > >>> Richard. > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From cldr-users at unicode.org Mon Nov 18 04:50:15 2019 From: cldr-users at unicode.org (=?utf-8?B?TWFydGluIEouIETDvHJzdA==?= via CLDR-Users) Date: Mon, 18 Nov 2019 10:50:15 +0000 Subject: Transform rule syntax clarifications In-Reply-To: References: <0A126112-AFBE-428A-A6EF-DAEC87FEDB25@gmail.com> <3D859B79-8910-414B-A9CF-E48833673A43@gmail.com> <20191117023724.1f8bdf5a@JRWUBU2> <1cc46a2e-e7e9-6956-b59b-13414ef5eb28@it.aoyama.ac.jp> Message-ID: Hello Mark, On 2019/11/18 19:41, Mark Davis ?? wrote: > I was about to point people to that, and request people to add additional > comments on clarifications of the syntax, but I couldn't find your ticket > at https://unicode-org.atlassian.net/issues/?filter=10033. Did you file it > for CLDR? I filed it at https://www.unicode.org/reporting.html, because I thought that that was where to file for UTS #35. Feel free to transfer it from there to wherever you think it should go. Regards, Martin. > Mark > > > On Mon, Nov 18, 2019 at 9:03 AM Martin J. D?rst > wrote: > >> On 2019/11/18 17:22, Mark Davis ?? wrote: >>> It should use A?B when taking about the math operation. Could you file a >>> ticket? >> >> Done. Regards, Martin. >> >>> {phone} >>> >>> On Mon, Nov 18, 2019, 04:42 Martin J. D?rst via CLDR-Users < >>> cldr-users at unicode.org> wrote: >>> >>>> On 2019/11/17 11:37, Richard Wordingham via CLDR-Users wrote: >>>>> On Sat, 16 Nov 2019 13:18:00 -0800 >>>>> Cameron Dutro via CLDR-Users wrote: >>>>> >>>>>> The other bits of syntax you've mentioned are from the Unicode Set >>>>>> specification, which you can find in UTS #35 >>>>>> . Unicode Sets are >>>>>> like regex character classes, but as you've noticed, there are a >>>>>> couple of special operations they support that regexes don't. >>>>>> Specifically, the "-" operator is the symmetric difference >>>>>> between the two >>>>>> operands (UTS 35 says "asymmetric difference," but I don't think >>>>>> that's a thing - I can't find any definition of it online). >>>>> >>>>> It very much is a thing! >>>> >>>> Well, yes, except that it's usually just called "set difference" without >>>> an explicit adjective. (I'd strongly suggest that UTS 35 put the word >>>> 'asymmetric' in parentheses.) Also, one wouldn't use the symbol '-' for >>>> symmetric difference. >>>> >>>> Regards, Martin. >>>> >>>>> In this particular case, >>>>> >>>>> $accent_minus = [[$accent]-[$iotasub$macron]]; >>>>> >>>>> is probably the same as the symmetric difference, because from >>>>> the names i think everything in the second set is in the first set, but >>>>> this doesn't always apply. [abcd] - [abef] is [cd], not the symmetric >>>>> difference [cdef]. >>>>> >>>>> Richard. >>> >> > From cldr-users at unicode.org Mon Nov 18 06:24:38 2019 From: cldr-users at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via CLDR-Users) Date: Mon, 18 Nov 2019 12:24:38 +0000 Subject: Transform rule syntax clarifications In-Reply-To: References: <0A126112-AFBE-428A-A6EF-DAEC87FEDB25@gmail.com> <3D859B79-8910-414B-A9CF-E48833673A43@gmail.com> <20191117023724.1f8bdf5a@JRWUBU2> <1cc46a2e-e7e9-6956-b59b-13414ef5eb28@it.aoyama.ac.jp> Message-ID: Thanks! For cldr/ldml/ICU wer habe real big tracking systems. I'll move it. {phone} On Mon, Nov 18, 2019, 10:50 Martin J. D?rst wrote: > Hello Mark, > > On 2019/11/18 19:41, Mark Davis ?? wrote: > > I was about to point people to that, and request people to add additional > > comments on clarifications of the syntax, but I couldn't find your ticket > > at https://unicode-org.atlassian.net/issues/?filter=10033. Did you file > it > > for CLDR? > > I filed it at https://www.unicode.org/reporting.html, because I thought > that that was where to file for UTS #35. Feel free to transfer it from > there to wherever you think it should go. > > Regards, Martin. > > > Mark > > > > > > On Mon, Nov 18, 2019 at 9:03 AM Martin J. D?rst > > wrote: > > > >> On 2019/11/18 17:22, Mark Davis ?? wrote: > >>> It should use A?B when taking about the math operation. Could you file > a > >>> ticket? > >> > >> Done. Regards, Martin. > >> > >>> {phone} > >>> > >>> On Mon, Nov 18, 2019, 04:42 Martin J. D?rst via CLDR-Users < > >>> cldr-users at unicode.org> wrote: > >>> > >>>> On 2019/11/17 11:37, Richard Wordingham via CLDR-Users wrote: > >>>>> On Sat, 16 Nov 2019 13:18:00 -0800 > >>>>> Cameron Dutro via CLDR-Users wrote: > >>>>> > >>>>>> The other bits of syntax you've mentioned are from the Unicode Set > >>>>>> specification, which you can find in UTS #35 > >>>>>> . Unicode Sets are > >>>>>> like regex character classes, but as you've noticed, there are a > >>>>>> couple of special operations they support that regexes don't. > >>>>>> Specifically, the "-" operator is the symmetric difference > >>>>>> between the > two > >>>>>> operands (UTS 35 says "asymmetric difference," but I don't think > >>>>>> that's a thing - I can't find any definition of it online). > >>>>> > >>>>> It very much is a thing! > >>>> > >>>> Well, yes, except that it's usually just called "set difference" > without > >>>> an explicit adjective. (I'd strongly suggest that UTS 35 put the word > >>>> 'asymmetric' in parentheses.) Also, one wouldn't use the symbol '-' > for > >>>> symmetric difference. > >>>> > >>>> Regards, Martin. > >>>> > >>>>> In this particular case, > >>>>> > >>>>> $accent_minus = [[$accent]-[$iotasub$macron]]; > >>>>> > >>>>> is probably the same as the symmetric difference, because from > >>>>> the names i think everything in the second set is in the first set, > but > >>>>> this doesn't always apply. [abcd] - [abef] is [cd], not the > symmetric > >>>>> difference [cdef]. > >>>>> > >>>>> Richard. > >>> > >> > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From cldr-users at unicode.org Mon Nov 18 23:52:35 2019 From: cldr-users at unicode.org (Cameron Dutro via CLDR-Users) Date: Mon, 18 Nov 2019 21:52:35 -0800 Subject: Transform rule syntax clarifications In-Reply-To: <20191117023724.1f8bdf5a@JRWUBU2> References: <0A126112-AFBE-428A-A6EF-DAEC87FEDB25@gmail.com> <3D859B79-8910-414B-A9CF-E48833673A43@gmail.com> <20191117023724.1f8bdf5a@JRWUBU2> Message-ID: Ah ok, that explains why it's difficult to Google for this. The asymmetric difference is simply the removal of every instance of the elements of one set from another, but because sets only contain unique elements by default, the asymmetric difference is really just the set difference. Does that sound about right? Glad the wording will be adjusted in the docs :) -Cameron On Sat, Nov 16, 2019 at 6:39 PM Richard Wordingham via CLDR-Users < cldr-users at unicode.org> wrote: > On Sat, 16 Nov 2019 13:18:00 -0800 > Cameron Dutro via CLDR-Users wrote: > > > The other bits of syntax you've mentioned are from the Unicode Set > > specification, which you can find in UTS #35 > > . Unicode Sets are > > like regex character classes, but as you've noticed, there are a > > couple of special operations they support that regexes don't. > > Specifically, the "-" operator is the symmetric difference > > between the two > > operands (UTS 35 says "asymmetric difference," but I don't think > > that's a thing - I can't find any definition of it online). > > It very much is a thing! In this particular case, > > $accent_minus = [[$accent]-[$iotasub$macron]]; > > is probably the same as the symmetric difference, because from > the names i think everything in the second set is in the first set, but > this doesn't always apply. [abcd] - [abef] is [cd], not the symmetric > difference [cdef]. > > Richard. > _______________________________________________ > CLDR-Users mailing list > CLDR-Users at unicode.org > http://unicode.org/mailman/listinfo/cldr-users > -------------- next part -------------- An HTML attachment was scrubbed... URL: From cldr-users at unicode.org Tue Nov 19 07:14:02 2019 From: cldr-users at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via CLDR-Users) Date: Tue, 19 Nov 2019 13:14:02 +0000 Subject: Transform rule syntax clarifications In-Reply-To: References: <0A126112-AFBE-428A-A6EF-DAEC87FEDB25@gmail.com> <3D859B79-8910-414B-A9CF-E48833673A43@gmail.com> <20191117023724.1f8bdf5a@JRWUBU2> Message-ID: That's not it. Eg A = {c d} B = {d e} Diff (asym) = {c} Diff (sym) = {c d} {phone} On Tue, Nov 19, 2019, 05:53 Cameron Dutro via CLDR-Users < cldr-users at unicode.org> wrote: > Ah ok, that explains why it's difficult to Google for this. The asymmetric > difference is simply the removal of every instance of the elements of one > set from another, but because sets only contain unique elements by default, > the asymmetric difference is really just the set difference. Does that > sound about right? > > Glad the wording will be adjusted in the docs :) > > -Cameron > > On Sat, Nov 16, 2019 at 6:39 PM Richard Wordingham via CLDR-Users < > cldr-users at unicode.org> wrote: > >> On Sat, 16 Nov 2019 13:18:00 -0800 >> Cameron Dutro via CLDR-Users wrote: >> >> > The other bits of syntax you've mentioned are from the Unicode Set >> > specification, which you can find in UTS #35 >> > . Unicode Sets are >> > like regex character classes, but as you've noticed, there are a >> > couple of special operations they support that regexes don't. >> > Specifically, the "-" operator is the symmetric difference >> > between the two >> > operands (UTS 35 says "asymmetric difference," but I don't think >> > that's a thing - I can't find any definition of it online). >> >> It very much is a thing! In this particular case, >> >> $accent_minus = [[$accent]-[$iotasub$macron]]; >> >> is probably the same as the symmetric difference, because from >> the names i think everything in the second set is in the first set, but >> this doesn't always apply. [abcd] - [abef] is [cd], not the symmetric >> difference [cdef]. >> >> Richard. >> _______________________________________________ >> CLDR-Users mailing list >> CLDR-Users at unicode.org >> http://unicode.org/mailman/listinfo/cldr-users >> > _______________________________________________ > CLDR-Users mailing list > CLDR-Users at unicode.org > http://unicode.org/mailman/listinfo/cldr-users > -------------- next part -------------- An HTML attachment was scrubbed... URL: From cldr-users at unicode.org Tue Nov 19 12:53:33 2019 From: cldr-users at unicode.org (Asmus Freytag via CLDR-Users) Date: Tue, 19 Nov 2019 10:53:33 -0800 Subject: Transform rule syntax clarifications In-Reply-To: References: <0A126112-AFBE-428A-A6EF-DAEC87FEDB25@gmail.com> <3D859B79-8910-414B-A9CF-E48833673A43@gmail.com> <20191117023724.1f8bdf5a@JRWUBU2> Message-ID: <0aa90b0d-2986-5ff9-388f-a9df0a1ffd81@ix.netcom.com> An HTML attachment was scrubbed... URL: From cldr-users at unicode.org Tue Nov 19 13:01:38 2019 From: cldr-users at unicode.org (Richard Wordingham via CLDR-Users) Date: Tue, 19 Nov 2019 19:01:38 +0000 Subject: Transform rule syntax clarifications In-Reply-To: References: <0A126112-AFBE-428A-A6EF-DAEC87FEDB25@gmail.com> <3D859B79-8910-414B-A9CF-E48833673A43@gmail.com> <20191117023724.1f8bdf5a@JRWUBU2> Message-ID: <20191119190138.512d8576@JRWUBU2> On Tue, 19 Nov 2019 13:14:02 +0000 Mark Davis ?? via CLDR-Users wrote: > That's not it. Eg > > A = {c d} > B = {d e} > > Diff (asym) = {c} > > Diff (sym) = {c d} I trust you mean Diff (sym) = {c e} ICU Unicode set 'symmetric' difference is probably more complicated. Richard. From cldr-users at unicode.org Tue Nov 19 14:42:25 2019 From: cldr-users at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via CLDR-Users) Date: Tue, 19 Nov 2019 20:42:25 +0000 Subject: Transform rule syntax clarifications In-Reply-To: <20191119190138.512d8576@JRWUBU2> References: <0A126112-AFBE-428A-A6EF-DAEC87FEDB25@gmail.com> <3D859B79-8910-414B-A9CF-E48833673A43@gmail.com> <20191117023724.1f8bdf5a@JRWUBU2> <20191119190138.512d8576@JRWUBU2> Message-ID: Yes, typo. And thanks also to Asmus for the clarification (got lazy since am on phone). {phone} On Tue, Nov 19, 2019, 19:02 Richard Wordingham via CLDR-Users < cldr-users at unicode.org> wrote: > On Tue, 19 Nov 2019 13:14:02 +0000 > Mark Davis ?? via CLDR-Users wrote: > > > That's not it. Eg > > > > A = {c d} > > B = {d e} > > > > Diff (asym) = {c} > > > > Diff (sym) = {c d} > > I trust you mean > > Diff (sym) = {c e} > > ICU Unicode set 'symmetric' difference is probably more complicated. > > Richard. > > _______________________________________________ > CLDR-Users mailing list > CLDR-Users at unicode.org > http://unicode.org/mailman/listinfo/cldr-users > -------------- next part -------------- An HTML attachment was scrubbed... URL: From cldr-users at unicode.org Tue Nov 19 23:25:08 2019 From: cldr-users at unicode.org (Cameron Dutro via CLDR-Users) Date: Tue, 19 Nov 2019 21:25:08 -0800 Subject: Transform rule syntax clarifications In-Reply-To: References: <0A126112-AFBE-428A-A6EF-DAEC87FEDB25@gmail.com> <3D859B79-8910-414B-A9CF-E48833673A43@gmail.com> <20191117023724.1f8bdf5a@JRWUBU2> <20191119190138.512d8576@JRWUBU2> Message-ID: The difference of two sets is dependent on order just as it is for subtraction. Mark's example is identical to the "regular" difference: {c, d} asym diff {d, e} = {c} just as {c, d} - {d, e} = {c} What am I missing here? -Cameron On Tue, Nov 19, 2019 at 12:44 PM Mark Davis ?? via CLDR-Users < cldr-users at unicode.org> wrote: > Yes, typo. And thanks also to Asmus for the clarification (got lazy since > am on phone). > > {phone} > > On Tue, Nov 19, 2019, 19:02 Richard Wordingham via CLDR-Users < > cldr-users at unicode.org> wrote: > >> On Tue, 19 Nov 2019 13:14:02 +0000 >> Mark Davis ?? via CLDR-Users wrote: >> >> > That's not it. Eg >> > >> > A = {c d} >> > B = {d e} >> > >> > Diff (asym) = {c} >> > >> > Diff (sym) = {c d} >> >> I trust you mean >> >> Diff (sym) = {c e} >> >> ICU Unicode set 'symmetric' difference is probably more complicated. >> >> Richard. >> >> _______________________________________________ >> CLDR-Users mailing list >> CLDR-Users at unicode.org >> http://unicode.org/mailman/listinfo/cldr-users >> > _______________________________________________ > CLDR-Users mailing list > CLDR-Users at unicode.org > http://unicode.org/mailman/listinfo/cldr-users > -------------- next part -------------- An HTML attachment was scrubbed... URL: From cldr-users at unicode.org Wed Nov 20 15:05:59 2019 From: cldr-users at unicode.org (Diego Plentz via CLDR-Users) Date: Wed, 20 Nov 2019 16:05:59 -0500 Subject: CLDR release process Message-ID: Hi, I want to understand a little better the release process of a new version of CLDR, especially in regards to the code that is already merged in the Github repo: after a PR is approved and merged, does that information gets validated again or can I assume it is going to be in the next version? I'm asking because here at Shopify we're considering to use the master branch directly to update our internal tools and always get the most up-to-date data. Kind regards -------------- next part -------------- An HTML attachment was scrubbed... URL: From cldr-users at unicode.org Sat Nov 30 01:04:41 2019 From: cldr-users at unicode.org (Cameron Dutro via CLDR-Users) Date: Fri, 29 Nov 2019 23:04:41 -0800 Subject: Break Rules Message-ID: Hey everyone, I'm looking at ICU's text segmentation implementation and have noticed that the break rules used internally in ICU are not the same as the rules included in the CLDR. They appear to contain slightly different syntax in some cases while others are just straight up different (or missing entirely). ICU also appears to contain "title" break rules which are not present in the CLDR data set. Some questions: 1. Why does ICU use a different set of break rules than what's specified in the CLDR? 2. Can the title break rules be contributed back to CLDR? 3. It doesn't look like ICU passes all the Unicode line break tests. Why? Thanks! -Cameron -------------- next part -------------- An HTML attachment was scrubbed... URL: From cldr-users at unicode.org Sat Nov 30 10:51:46 2019 From: cldr-users at unicode.org (Markus Scherer via CLDR-Users) Date: Sat, 30 Nov 2019 08:51:46 -0800 Subject: Break Rules In-Reply-To: References: Message-ID: I think I can answer for the "title" rules. Many years ago, the Unicode Standard titlecasing algorithm specified its own way of finding text positions for where to apply titlecasing. That got replaced by referring to the word segmentation spec. I believe that ICU's BreakIterator.getTitleInstance() implements the old rules. We have deprecated it in favor of getWordInstance(). So I would argue against propagating the "title" rules. markus -------------- next part -------------- An HTML attachment was scrubbed... URL: