From cameron at lumoslabs.com Tue Nov 8 12:43:14 2016 From: cameron at lumoslabs.com (Cameron Dutro) Date: Tue, 8 Nov 2016 10:43:14 -0800 Subject: Inconsistent RBNF Data? Message-ID: Hey everyone, I'm running into a strange inconsistency between ICU's output and the data available in CLDR when formatting numbers using RBNF rules. One specific example is the spellout-cardinal-feminine rule set in Spanish. In CLDR v30 and v29 , the rule for 101 is "ciento" which is incorrect for the feminine case. ICU however formats feminine spellouts correctly by using "cienta." Where in the world is ICU getting its data? Why does it appear as if ICU isn't actually using the currently available CLDR data? Thanks for your help, -Cameron -------------- next part -------------- An HTML attachment was scrubbed... URL: From srl at icu-project.org Tue Nov 8 13:27:11 2016 From: srl at icu-project.org (Steven R. Loomis) Date: Tue, 08 Nov 2016 11:27:11 -0800 Subject: Inconsistent RBNF Data? In-Reply-To: References: Message-ID: <34DC01B8-B8BD-4D30-8905-1D56BB644021@icu-project.org> It can be helpful give some ICU source code, and which version is being used. But probably relevant is http://unicode.org/cldr/trac/changeset/9025 ? perhaps you are comparing an ICU older than this commit? -s El 11/8/16 10:43 AM, "CLDR-Users en nombre de Cameron Dutro" escribi?: Hey everyone, I'm running into a strange inconsistency between ICU's output and the data available in CLDR when formatting numbers using RBNF rules. One specific example is the spellout-cardinal-feminine rule set in Spanish. In CLDR v30 and v29, the rule for 101 is "ciento" which is incorrect for the feminine case. ICU however formats feminine spellouts correctly by using "cienta." Where in the world is ICU getting its data? Why does it appear as if ICU isn't actually using the currently available CLDR data? Thanks for your help, -Cameron _______________________________________________ CLDR-Users mailing list CLDR-Users at unicode.org http://unicode.org/mailman/listinfo/cldr-users -------------- next part -------------- An HTML attachment was scrubbed... URL: From cameron at lumoslabs.com Tue Nov 8 14:20:13 2016 From: cameron at lumoslabs.com (Cameron Dutro) Date: Tue, 8 Nov 2016 12:20:13 -0800 Subject: Inconsistent RBNF Data? In-Reply-To: <34DC01B8-B8BD-4D30-8905-1D56BB644021@icu-project.org> References: <34DC01B8-B8BD-4D30-8905-1D56BB644021@icu-project.org> Message-ID: Ah right, I forgot to mention the version of ICU. I'm using v57.1 which I thought was the version that corresponds to CLDR v29. The source code is actually Ruby code (running on JRuby). You can see the code in question here . Steven, it looks like that changeset was submitted 3 years ago, but isn't reflected in v29 or v30 of CLDR (but appears to have made it into ICU somehow). Thanks for your help! -Cameron On Tue, Nov 8, 2016 at 11:27 AM, Steven R. Loomis wrote: > It can be helpful give some ICU source code, and which version is being > used. > > But probably relevant is http://unicode.org/cldr/trac/changeset/9025 ? > perhaps you are comparing an ICU older than this commit? > > -s > > El 11/8/16 10:43 AM, "CLDR-Users en nombre de Cameron Dutro" < > cldr-users-bounces at unicode.org en nombre de cameron at lumoslabs.com> > escribi?: > > Hey everyone, > > I'm running into a strange inconsistency between ICU's output and the data > available in CLDR when formatting numbers using RBNF rules. > > One specific example is the spellout-cardinal-feminine rule set in > Spanish. In CLDR v30 > > and v29 > , > the rule for 101 is "ciento" which is incorrect for the feminine case. ICU > however formats feminine spellouts correctly by using "cienta." > > Where in the world is ICU getting its data? Why does it appear as if ICU > isn't actually using the currently available CLDR data? > > Thanks for your help, > > -Cameron > _______________________________________________ CLDR-Users mailing list > CLDR-Users at unicode.org http://unicode.org/mailman/listinfo/cldr-users > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From kent.karlsson14 at telia.com Tue Nov 8 14:56:48 2016 From: kent.karlsson14 at telia.com (Kent Karlsson) Date: Tue, 08 Nov 2016 21:56:48 +0100 Subject: Inconsistent RBNF Data? In-Reply-To: <34DC01B8-B8BD-4D30-8905-1D56BB644021@icu-project.org> Message-ID: My question is if the corresponding patch should be applied for Portuguese, which currently use "cento" also for the feminine case. BUT NOTE THAT: The current version for Spanish has that patch reversed, according to CLDR?Ticket #6461 . /Kent K Den 2016-11-08 20:27, skrev "Steven R. Loomis" : > It can be helpful give some ICU source code, and which version is being used. > > But probably relevant is http://unicode.org/cldr/trac/changeset/9025 ? perhaps > you are comparing an ICU older than this commit? > > -s > > El 11/8/16 10:43 AM, "CLDR-Users en nombre de Cameron Dutro" > escribi?: > >> Hey everyone, >> >> I'm running into a strange inconsistency between ICU's output and the data >> available in CLDR when formatting numbers using RBNF rules. >> >> One specific example is the spellout-cardinal-feminine rule set in Spanish. >> In CLDR v30 >> > L128> and v29 >> > > , the rule for 101 is "ciento" which is incorrect for the feminine case. >> ICU however formats feminine spellouts correctly by using "cienta." >> >> Where in the world is ICU getting its data? Why does it appear as if ICU >> isn't actually using the currently available CLDR data? >> >> Thanks for your help, >> >> -Cameron >> _______________________________________________ CLDR-Users mailing list >> CLDR-Users at unicode.org http://unicode.org/mailman/listinfo/cldr-users > > > _______________________________________________ > CLDR-Users mailing list > CLDR-Users at unicode.org > http://unicode.org/mailman/listinfo/cldr-users -------------- next part -------------- An HTML attachment was scrubbed... URL: From srl at icu-project.org Tue Nov 8 15:18:15 2016 From: srl at icu-project.org (Steven R. Loomis) Date: Tue, 08 Nov 2016 13:18:15 -0800 Subject: Inconsistent RBNF Data? In-Reply-To: References: <34DC01B8-B8BD-4D30-8905-1D56BB644021@icu-project.org> Message-ID: <7593D67E-0CC0-42FE-B355-8A1307F241D8@icu-project.org> It?s not a reversed patch, it just claims that it should be ciento for 101 and NOT cienta. -s El 11/8/16 12:56 PM, "CLDR-Users en nombre de Kent Karlsson" escribi?: Re: Inconsistent RBNF Data? My question is if the corresponding patch should be applied for Portuguese, which currently use "cento" also for the feminine case. BUT NOTE THAT: The current version for Spanish has that patch reversed, according to CLDR Ticket #6461 . /Kent K Den 2016-11-08 20:27, skrev "Steven R. Loomis" : It can be helpful give some ICU source code, and which version is being used. But probably relevant is http://unicode.org/cldr/trac/changeset/9025 ? perhaps you are comparing an ICU older than this commit? -s El 11/8/16 10:43 AM, "CLDR-Users en nombre de Cameron Dutro" escribi?: Hey everyone, I'm running into a strange inconsistency between ICU's output and the data available in CLDR when formatting numbers using RBNF rules. One specific example is the spellout-cardinal-feminine rule set in Spanish. In CLDR v30 and v29 , the rule for 101 is "ciento" which is incorrect for the feminine case. ICU however formats feminine spellouts correctly by using "cienta." Where in the world is ICU getting its data? Why does it appear as if ICU isn't actually using the currently available CLDR data? Thanks for your help, -Cameron _______________________________________________ CLDR-Users mailing list CLDR-Users at unicode.org http://unicode.org/mailman/listinfo/cldr-users _______________________________________________ CLDR-Users mailing list CLDR-Users at unicode.org http://unicode.org/mailman/listinfo/cldr-users _______________________________________________ CLDR-Users mailing list CLDR-Users at unicode.org http://unicode.org/mailman/listinfo/cldr-users -------------- next part -------------- An HTML attachment was scrubbed... URL: From cameron at lumoslabs.com Tue Nov 8 17:04:31 2016 From: cameron at lumoslabs.com (Cameron Dutro) Date: Tue, 8 Nov 2016 15:04:31 -0800 Subject: Inconsistent RBNF Data? In-Reply-To: <7593D67E-0CC0-42FE-B355-8A1307F241D8@icu-project.org> References: <34DC01B8-B8BD-4D30-8905-1D56BB644021@icu-project.org> <7593D67E-0CC0-42FE-B355-8A1307F241D8@icu-project.org> Message-ID: Huh ok I didn't realize that "ciento" is correct. The question still remains: why does ICU generate "cienta" instead? -Cameron On Tue, Nov 8, 2016 at 1:18 PM, Steven R. Loomis wrote: > It?s not a reversed patch, it just claims that it should be ciento for 101 > and NOT cienta. > > -s > > El 11/8/16 12:56 PM, "CLDR-Users en nombre de Kent Karlsson" < > cldr-users-bounces at unicode.org en nombre de kent.karlsson14 at telia.com> > escribi?: > > > My question is if the corresponding patch should be applied for > Portuguese, which currently use "cento" also for the feminine case. > > BUT NOTE THAT: The current version for Spanish has that patch reversed, > according to *CLDR Ticket #6461 >*. > > /Kent K > > Den 2016-11-08 20:27, skrev "Steven R. Loomis" : > > It can be helpful give some ICU source code, and which version is being > used. > > But probably relevant is http://unicode.org/cldr/trac/changeset/9025 ? > perhaps you are comparing an ICU older than this commit? > > -s > > El 11/8/16 10:43 AM, "CLDR-Users en nombre de Cameron Dutro" < > cldr-users-bounces at unicode.org en nombre de cameron at lumoslabs.com> > escribi?: > > Hey everyone, > > I'm running into a strange inconsistency between ICU's output and the data > available in CLDR when formatting numbers using RBNF rules. > > One specific example is the spellout-cardinal-feminine rule set in > Spanish. In CLDR v30 browser/tags/release-30-d05/common/rbnf/es.xml#L128> and v29 < > http://unicode.org/cldr/trac/browser/tags/release-29/ > common/rbnf/es.xml#L128> , the rule for 101 is "ciento" which is > incorrect for the feminine case. ICU however formats feminine spellouts > correctly by using "cienta." > > Where in the world is ICU getting its data? Why does it appear as if ICU > isn't actually using the currently available CLDR data? > > Thanks for your help, > > -Cameron > _______________________________________________ CLDR-Users mailing list > CLDR-Users at unicode.org http://unicode.org/mailman/listinfo/cldr-users > > > ------------------------------ > _______________________________________________ > CLDR-Users mailing list > CLDR-Users at unicode.org > http://unicode.org/mailman/listinfo/cldr-users > > _______________________________________________ CLDR-Users mailing list > CLDR-Users at unicode.org http://unicode.org/mailman/listinfo/cldr-users > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From srl at icu-project.org Tue Nov 8 17:16:56 2016 From: srl at icu-project.org (Steven R. Loomis) Date: Tue, 08 Nov 2016 15:16:56 -0800 Subject: Inconsistent RBNF Data? In-Reply-To: References: <34DC01B8-B8BD-4D30-8905-1D56BB644021@icu-project.org> Message-ID: <3F9EF0E8-5689-48BC-A1E6-80446A363F16@icu-project.org> So- can you reproduce the issue with ICU4C or ICU4J of a certain version? There?s an API to request the CLDR version. In ICU4C you can use the ?icuinfo? app or ulocdata_getCLDRVersion(), in J you can do ?java ?jar icu4j.jar? or LocaleData.getCLDRVersion() El 11/8/16 12:20 PM, "CLDR-Users en nombre de Cameron Dutro" escribi?: Ah right, I forgot to mention the version of ICU. I'm using v57.1 which I thought was the version that corresponds to CLDR v29. The source code is actually Ruby code (running on JRuby). You can see the code in question here. Steven, it looks like that changeset was submitted 3 years ago, but isn't reflected in v29 or v30 of CLDR (but appears to have made it into ICU somehow). Thanks for your help! -Cameron On Tue, Nov 8, 2016 at 11:27 AM, Steven R. Loomis wrote: It can be helpful give some ICU source code, and which version is being used. But probably relevant is http://unicode.org/cldr/trac/changeset/9025 ? perhaps you are comparing an ICU older than this commit? -s El 11/8/16 10:43 AM, "CLDR-Users en nombre de Cameron Dutro" escribi?: Hey everyone, I'm running into a strange inconsistency between ICU's output and the data available in CLDR when formatting numbers using RBNF rules. One specific example is the spellout-cardinal-feminine rule set in Spanish. In CLDR v30 and v29, the rule for 101 is "ciento" which is incorrect for the feminine case. ICU however formats feminine spellouts correctly by using "cienta." Where in the world is ICU getting its data? Why does it appear as if ICU isn't actually using the currently available CLDR data? Thanks for your help, -Cameron _______________________________________________ CLDR-Users mailing list CLDR-Users at unicode.orghttp://unicode.org/mailman/listinfo/cldr-users _______________________________________________ CLDR-Users mailing list CLDR-Users at unicode.org http://unicode.org/mailman/listinfo/cldr-users -------------- next part -------------- An HTML attachment was scrubbed... URL: From cameron at lumoslabs.com Wed Nov 9 11:11:27 2016 From: cameron at lumoslabs.com (Cameron Dutro) Date: Wed, 9 Nov 2016 09:11:27 -0800 Subject: Inconsistent RBNF Data? In-Reply-To: <3F9EF0E8-5689-48BC-A1E6-80446A363F16@icu-project.org> References: <34DC01B8-B8BD-4D30-8905-1D56BB644021@icu-project.org> <3F9EF0E8-5689-48BC-A1E6-80446A363F16@icu-project.org> Message-ID: Hey Steven et al., This turned out to be my fault. I had two different versions of ICU on my classpath, one recent and one quite old. I *thought* the newer one was loaded but the older one took precedence because it occurred earlier in the classpath and caused my script to generate invalid test cases. As you suggested I wrote a small bit of Java code to try and reproduce the problem, which to my surprise produced the correct result. Apologies for dragging everyone into this! Thank you all for your help :) -Cameron On Tue, Nov 8, 2016 at 3:16 PM, Steven R. Loomis wrote: > So- can you reproduce the issue with ICU4C or ICU4J of a certain version? > > There?s an API to request the CLDR version. In ICU4C you can use the > ?icuinfo? app or ulocdata_getCLDRVersion(), in J you can do ?java ?jar > icu4j.jar? or LocaleData.getCLDRVersion() > > El 11/8/16 12:20 PM, "CLDR-Users en nombre de Cameron Dutro" < > cldr-users-bounces at unicode.org en nombre de cameron at lumoslabs.com> > escribi?: > > Ah right, I forgot to mention the version of ICU. I'm using v57.1 which I > thought was the version that corresponds to CLDR v29. > > The source code is actually Ruby code (running on JRuby). You can see the > code in question here > > . > > Steven, it looks like that changeset was submitted 3 years ago, but isn't > reflected in v29 or v30 of CLDR (but appears to have made it into ICU > somehow). > > Thanks for your help! > > -Cameron > > On Tue, Nov 8, 2016 at 11:27 AM, Steven R. Loomis > wrote: > >> It can be helpful give some ICU source code, and which version is being >> used. >> >> But probably relevant is http://unicode.org/cldr/trac/changeset/9025 ? >> perhaps you are comparing an ICU older than this commit? >> >> -s >> >> El 11/8/16 10:43 AM, "CLDR-Users en nombre de Cameron Dutro" < >> cldr-users-bounces at unicode.org en nombre de cameron at lumoslabs.com> >> escribi?: >> >> Hey everyone, >> >> I'm running into a strange inconsistency between ICU's output and the >> data available in CLDR when formatting numbers using RBNF rules. >> >> One specific example is the spellout-cardinal-feminine rule set in >> Spanish. In CLDR v30 >> >> and v29 >> , >> the rule for 101 is "ciento" which is incorrect for the feminine case. ICU >> however formats feminine spellouts correctly by using "cienta." >> >> Where in the world is ICU getting its data? Why does it appear as if ICU >> isn't actually using the currently available CLDR data? >> >> Thanks for your help, >> >> -Cameron >> _______________________________________________ CLDR-Users mailing list >> CLDR-Users at unicode.orghttp://unicode.org/mailman/listinfo/cldr-users >> >> > _______________________________________________ CLDR-Users mailing list > CLDR-Users at unicode.org http://unicode.org/mailman/listinfo/cldr-users > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From srl at icu-project.org Wed Nov 9 11:43:25 2016 From: srl at icu-project.org (Steven R. Loomis) Date: Wed, 09 Nov 2016 09:43:25 -0800 Subject: Inconsistent RBNF Data? In-Reply-To: References: <34DC01B8-B8BD-4D30-8905-1D56BB644021@icu-project.org> <3F9EF0E8-5689-48BC-A1E6-80446A363F16@icu-project.org> Message-ID: <16E73C89-04BB-48AE-B45E-0D966573B11E@icu-project.org> Cameron, Thanks for following up! Cc George? Steven El 11/9/16 9:11 AM, "CLDR-Users en nombre de Cameron Dutro" escribi?: Hey Steven et al., This turned out to be my fault. I had two different versions of ICU on my classpath, one recent and one quite old. I thought the newer one was loaded but the older one took precedence because it occurred earlier in the classpath and caused my script to generate invalid test cases. As you suggested I wrote a small bit of Java code to try and reproduce the problem, which to my surprise produced the correct result. Apologies for dragging everyone into this! Thank you all for your help :) -Cameron On Tue, Nov 8, 2016 at 3:16 PM, Steven R. Loomis wrote: So- can you reproduce the issue with ICU4C or ICU4J of a certain version? There?s an API to request the CLDR version. In ICU4C you can use the ?icuinfo? app or ulocdata_getCLDRVersion(), in J you can do ?java ?jar icu4j.jar? or LocaleData.getCLDRVersion() El 11/8/16 12:20 PM, "CLDR-Users en nombre de Cameron Dutro" escribi?: Ah right, I forgot to mention the version of ICU. I'm using v57.1 which I thought was the version that corresponds to CLDR v29. The source code is actually Ruby code (running on JRuby). You can see the code in question here. Steven, it looks like that changeset was submitted 3 years ago, but isn't reflected in v29 or v30 of CLDR (but appears to have made it into ICU somehow). Thanks for your help! -Cameron On Tue, Nov 8, 2016 at 11:27 AM, Steven R. Loomis wrote: It can be helpful give some ICU source code, and which version is being used. But probably relevant is http://unicode.org/cldr/trac/changeset/9025 ? perhaps you are comparing an ICU older than this commit? -s El 11/8/16 10:43 AM, "CLDR-Users en nombre de Cameron Dutro" escribi?: Hey everyone, I'm running into a strange inconsistency between ICU's output and the data available in CLDR when formatting numbers using RBNF rules. One specific example is the spellout-cardinal-feminine rule set in Spanish. In CLDR v30 and v29, the rule for 101 is "ciento" which is incorrect for the feminine case. ICU however formats feminine spellouts correctly by using "cienta." Where in the world is ICU getting its data? Why does it appear as if ICU isn't actually using the currently available CLDR data? Thanks for your help, -Cameron _______________________________________________ CLDR-Users mailing list CLDR-Users at unicode.orghttp://unicode.org/mailman/listinfo/cldr-users _______________________________________________ CLDR-Users mailing list CLDR-Users at unicode.orghttp://unicode.org/mailman/listinfo/cldr-users _______________________________________________ CLDR-Users mailing list CLDR-Users at unicode.org http://unicode.org/mailman/listinfo/cldr-users -------------- next part -------------- An HTML attachment was scrubbed... URL: From kent.karlsson14 at telia.com Wed Nov 9 15:23:29 2016 From: kent.karlsson14 at telia.com (Kent Karlsson) Date: Wed, 09 Nov 2016 22:23:29 +0100 Subject: Inconsistent RBNF Data? In-Reply-To: <7593D67E-0CC0-42FE-B355-8A1307F241D8@icu-project.org> Message-ID: Right, I misread. I note that both Spanish and Portuguese has "ciento" as corrections to the "-feminine" cases. Seems to be mostly correct (even if slightly counterintuitive for those of us that have not grown up with Spanish)... Though it is not entirely universal: * http://libraryofthewolves.blogspot.se/2012/06/cienta-o-ciento-como-es-correc to.html (though in this regards only refers to Spanish as spoken in Dominicana...) * http://www.monografias.com/trabajos89/barrick-gold-cotui/barrick-gold-cotui. shtml (Dominicana again, written, not just a reference) * http://documents.mx/documents/lizania-55a93191902ba.html (anthology, Spanish author) (Note also that "sienta"/"siento" is sometimes misspelled as "cienta"/"ciento", in case you do a search.) RBNF for es-DO maybe should keep the "cienta" for feminine... /Kent K Den 2016-11-08 22:18, skrev "Steven R. Loomis" : > It?s not a reversed patch, it just claims that it should be ciento for 101 > and NOT cienta. > > -s > > El 11/8/16 12:56 PM, "CLDR-Users en nombre de Kent Karlsson" > > escribi?: > >> Re: Inconsistent RBNF Data? >> >> My question is if the corresponding patch should be applied for Portuguese, >> which currently use "cento" also for the feminine case. >> >> BUT NOTE THAT: The current version for Spanish has that patch reversed, >> according to CLDR Ticket #6461 . >> >> /Kent K >> >> Den 2016-11-08 20:27, skrev "Steven R. Loomis" : >> >>> It can be helpful give some ICU source code, and which version is being >>> used. >>> >>> But probably relevant is http://unicode.org/cldr/trac/changeset/9025 ? >>> perhaps you are comparing an ICU older than this commit? >>> >>> -s >>> >>> El 11/8/16 10:43 AM, "CLDR-Users en nombre de Cameron Dutro" >>> >>> escribi?: >>> >>>> Hey everyone, >>>> >>>> I'm running into a strange inconsistency between ICU's output and the data >>>> available in CLDR when formatting numbers using RBNF rules. >>>> >>>> One specific example is the spellout-cardinal-feminine rule set in Spanish. >>>> In CLDR v30 >>>> >>> l#L128> and v29 >>>> >>> 28> , the rule for 101 is "ciento" which is incorrect for the feminine >>>> case. ICU however formats feminine spellouts correctly by using "cienta." >>>> >>>> Where in the world is ICU getting its data? Why does it appear as if ICU >>>> isn't actually using the currently available CLDR data? >>>> >>>> Thanks for your help, >>>> >>>> -Cameron >>>> _______________________________________________ CLDR-Users mailing list >>>> CLDR-Users at unicode.org http://unicode.org/mailman/listinfo/cldr-users >>> >>> >>> _______________________________________________ >>> CLDR-Users mailing list >>> CLDR-Users at unicode.org >>> http://unicode.org/mailman/listinfo/cldr-users >> _______________________________________________ CLDR-Users mailing list >> CLDR-Users at unicode.org http://unicode.org/mailman/listinfo/cldr-users > -------------- next part -------------- An HTML attachment was scrubbed... URL: From fios at foramnagaidhlig.net Thu Nov 10 02:42:44 2016 From: fios at foramnagaidhlig.net (=?UTF-8?Q?F=c3=b2ram_na_G=c3=a0idhlig?=) Date: Thu, 10 Nov 2016 08:42:44 +0000 Subject: Inconsistent RBNF Data? In-Reply-To: References: Message-ID: Sgr?obh Kent Karlsson na leanas 09/11/2016 aig 21:23: > Right, I misread. > > I note that both Spanish and Portuguese has "ciento" as corrections to > the "-feminine" cases. That's not quite true for Portuguese - the forms are a bit different. I had a quick look at the RBNF though and it looks correct there. http://unicode.org/cldr/trac/browser/tags/release-30-d05/common/rbnf/pt.xml#L78 From rxaviers at gmail.com Thu Nov 10 04:30:22 2016 From: rxaviers at gmail.com (Rafael Xavier) Date: Thu, 10 Nov 2016 08:30:22 -0200 Subject: Inconsistent RBNF Data? In-Reply-To: References: Message-ID: > > That's not quite true for Portuguese - the forms are a bit different. I > had a quick look at the RBNF though and it looks correct there. > > http://unicode.org/cldr/trac/browser/tags/release-30-d05/ > common/rbnf/pt.xml#L78 > +1 confirming existing RBNF for pt is correct (as a native Portuguese speaker and [1]). 1: http://veja.abril.com.br/blog/sobre-palavras/consultorio/duzentas-mil-pessoas-ou-duzentos-mil-pessoas/ On Thu, Nov 10, 2016 at 6:42 AM, F?ram na G?idhlig wrote: > Sgr?obh Kent Karlsson na leanas 09/11/2016 aig 21:23: > > Right, I misread. > > > > I note that both Spanish and Portuguese has "ciento" as corrections to > > the "-feminine" cases. > > That's not quite true for Portuguese - the forms are a bit different. I > had a quick look at the RBNF though and it looks correct there. > > http://unicode.org/cldr/trac/browser/tags/release-30-d05/ > common/rbnf/pt.xml#L78 > _______________________________________________ > CLDR-Users mailing list > CLDR-Users at unicode.org > http://unicode.org/mailman/listinfo/cldr-users > -- +55 (16) 98138-1583, skype: rxaviers http://rafael.xavier.blog.br -------------- next part -------------- An HTML attachment was scrubbed... URL: From mats.gbproject at gmail.com Thu Nov 10 16:54:21 2016 From: mats.gbproject at gmail.com (Mats Blakstad) Date: Thu, 10 Nov 2016 23:54:21 +0100 Subject: Dataset for all ISO639 code sorted by country/territory? In-Reply-To: References: <20161110105658.665a7a7059d7ee80bb4d670165c8327d.a8ff034ef1.wbe@email03.godaddy.com> Message-ID: I'm continuing the discussion I started on unicode at unicode.org here; http://unicode.org/pipermail/unicode/2016-September/003964.html Sorry for posting in wrong email list! On 10 November 2016 at 20:34, Shawn Steele wrote: > I didn't really say anything because this is kinda a hopeless task, but it > seems like some realities are being overlooked. I'm as curious about > cataloguing everything as the next OCD guy, but a general solution doesn't > seem practical. > > Maybe in addition to number of speakers we could give each language different values for the different territories like official / unofficial, lingua franca / home language, recognized / not recognized, etc Maybe we could manage to work out some more objective categories? Then the dataset could cover more different needs from those that want to use it to extract the list they want, as example they could make a list of only the official languages in the world sorted by country/territory, or maybe a list of all non-recognized languages in different countries. > * There are a *lot* of languages > Yes :) We would not get all in the start, but if we could start add data for all the languages it can be done a little by little. For myself I have many contacts that I think could be interested to help add information. > * Many countries have speakers of several languages. > * In the US it's "obvious" that a list of languages for the US > should include "English" > For sure! The amount of speakers and that it is the primary language used speakse for it. Beside, is not "US English" considered a variant of English? > * Spanish in the US is less obvious, however it is often > considered important. > It is interesting issue. Wasn't Spanish the primary language in southern US while being a part of Mexico? And is there not a lot of Spanish newspapsers/media in the US? > * However, that's a slippery slope as there are many other > languages with large groups of speakers in the US. If such a list includes > Spanish, should it not include some of the others? San Francisco requires > documents in 4 languages but provides telephone help for 200 languages. > Where's the line? > * Some languages happen in many places. There are a disproportionate # of > Englishes in CLDR, however Chinese is also spoken in lots of the countries > that have English available in CLDR. Yet CLDR doesn't provide data for > those. > Could you elaborate a little bit on this? > * Some language/region combinations could encounter geopolitical issues. > Like "it's not legal for that language to be spoken in XX" (but it > happens). Or "that language isn't YY country's language, it's ours!!!" > We could add documented amount of speakers and tag it as "not recognized" > > * The requirement "where the language has been spoken traditionally" is > really, really subjective. "Traditionally" the US is an English speaking > country. However, "Traditionally", there are hundreds of languages that > have been spoken in the US. What could be more "traditional" than the > native American languages? Yet those often have low numbers of speakers in > the modern world, many are even dying languages. There are also a number > of "traditional" languages spoken by the original settlers. Which differ > than the set of languages spoken by modern immigrants. So your data is > going to be very skewed depending on the person collecting the data's > definition of "traditional". > I agree "traditional" is not a good way to collect the data. Native american languages should of course be mapped with territories despite having few speakers. The point is to map all languages. We could also map languages with years, English is then spoken in what is USA today since 1607. Urdu is spoken in what is today Norway since the 1970th. > > Ethnologue has done a decent job of identifying languages and the number > of speakers in various areas, but it would be very difficult to draw a line > that selected "English and Spanish in the US" and was consistent with > similar real-life impacts across the other languages. Do you pick the top > n languages for each country? Languages with > x million speakers (that > would be very different in small and big countries). Languages with > y% > of the speakers in the different countries? > If Ethnologue have done it, I guess it should also be possible for CLDR also? However they operate with a category "Immigrant Languages", I'm not sure what that means, ss exmaple Turkish, the second most spoken language of Germany, is marked it as "Immigrant Language", I'm not sure how they make that distinction. > > And then you end up with each application having to figure out it's own > bar. Applications will have different market considerations and other > reasons to target different regions/languages. That would skew any list > for their purposes. > Okay, at least it could be possible to add number of speakers for other "6,300 lesser-known living languages", or why do we cut the list to 675 languages? -------------- next part -------------- An HTML attachment was scrubbed... URL: From kent.karlsson14 at telia.com Thu Nov 10 19:10:14 2016 From: kent.karlsson14 at telia.com (Kent Karlsson) Date: Fri, 11 Nov 2016 02:10:14 +0100 Subject: Inconsistent RBNF Data? In-Reply-To: Message-ID: Looking at the RBNF source can be hard to follow (especially for the more complicated cases, even Italian is quite complex). Though I used my own program for testing nearly a decade ago when I worked on this, there is now a public web page (not made by me, but by the person that took over maintaining the RBNF rules) for testing RBNFs: http://st.unicode.org/cldr-apps/numbers.jsp This is easier to follow than the rules themselves and can be used to find errors and test fixes to the RBNF rules. Note that the rules are in ICU format, not in the XML format found in CLDR. You can edit the rules, and the numbers to be used for testing. /Kent K Den 2016-11-10 11:30, skrev "Rafael Xavier" : >> That's not quite true for Portuguese - the forms are a bit different. I >> had a quick look at the RBNF though and it looks correct there. >> >> http://unicode.org/cldr/trac/browser/tags/release-30-d05/common/rbnf/pt.xml#L >> 78 > > +1 confirming existing RBNF for pt is correct (as a native Portuguese speaker > and [1]). > > 1: > http://veja.abril.com.br/blog/sobre-palavras/consultorio/duzentas-mil-pessoas- > ou-duzentos-mil-pessoas/ > > On Thu, Nov 10, 2016 at 6:42 AM, F?ram na G?idhlig > wrote: >> Sgr?obh Kent Karlsson na leanas 09/11/2016 aig 21:23: >>> > Right, I misread. >>> > >>> > I note that both Spanish and Portuguese has "ciento" as corrections to >>> > the "-feminine" cases. >> >> That's not quite true for Portuguese - the forms are a bit different. I >> had a quick look at the RBNF though and it looks correct there. >> >> http://unicode.org/cldr/trac/browser/tags/release-30-d05/common/rbnf/pt.xml#L >> 78 >> _______________________________________________ >> CLDR-Users mailing list >> CLDR-Users at unicode.org >> http://unicode.org/mailman/listinfo/cldr-users > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From cameron at lumoslabs.com Thu Nov 10 21:26:30 2016 From: cameron at lumoslabs.com (Cameron Dutro) Date: Fri, 11 Nov 2016 03:26:30 +0000 Subject: Inconsistent RBNF Data? In-Reply-To: References: Message-ID: That's a fantastic resource, thanks Kent! -Cameron On Thu, Nov 10, 2016 at 5:11 PM Kent Karlsson wrote: > Looking at the RBNF source can be hard to follow (especially for the more > complicated cases, even Italian is > quite complex). > > Though I used my own program for testing nearly a decade ago when I worked > on this, there is now a public > web page (not made by me, but by the person that took over maintaining the > RBNF rules) for testing RBNFs: > > http://st.unicode.org/cldr-apps/numbers.jsp > > This is easier to follow than the rules themselves and can be used to find > errors and test fixes to the RBNF rules. > Note that the rules are in ICU format, not in the XML format found in > CLDR. You can edit the rules, and the > numbers to be used for testing. > > /Kent K > > > Den 2016-11-10 11:30, skrev "Rafael Xavier" : > > That's not quite true for Portuguese - the forms are a bit different. I > had a quick look at the RBNF though and it looks correct there. > > > http://unicode.org/cldr/trac/browser/tags/release-30-d05/common/rbnf/pt.xml#L78 > > > +1 confirming existing RBNF for pt is correct (as a native Portuguese > speaker and [1]). > > 1: > http://veja.abril.com.br/blog/sobre-palavras/consultorio/duzentas-mil-pessoas-ou-duzentos-mil-pessoas/ > > On Thu, Nov 10, 2016 at 6:42 AM, F?ram na G?idhlig < > fios at foramnagaidhlig.net> wrote: > > Sgr?obh Kent Karlsson na leanas 09/11/2016 aig 21:23: > > Right, I misread. > > > > I note that both Spanish and Portuguese has "ciento" as corrections to > > the "-feminine" cases. > > That's not quite true for Portuguese - the forms are a bit different. I > had a quick look at the RBNF though and it looks correct there. > > > http://unicode.org/cldr/trac/browser/tags/release-30-d05/common/rbnf/pt.xml#L78 > _______________________________________________ > CLDR-Users mailing list > CLDR-Users at unicode.org > http://unicode.org/mailman/listinfo/cldr-users > > > > _______________________________________________ > CLDR-Users mailing list > CLDR-Users at unicode.org > http://unicode.org/mailman/listinfo/cldr-users > -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Fri Nov 11 01:44:17 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Fri, 11 Nov 2016 08:44:17 +0100 Subject: Inconsistent RBNF Data? In-Reply-To: References: Message-ID: Well for French ordinals, this still uses old rules using spaces instead of hyphens. Fractions in French are spelled (like in English) as : cardinal (numerator) + space + ordinal (denominator) where the ordinal (denominator) here is taking the singular or plural form, according to the value of the leading cardinal (numerator) and French plural rules. With the old spelling rules of ordinals still using spaces (like in cardinals), it is difficult to guess the values being meant in - "(les) deux cent sept milli?mes" But this ambiguity is solved cleanly in all cases by always using hyphens instead of spaces within ordinals. So a fraction will contain its last space just before the full ordinal for the denominator : - "(le) deux-cent-sept-milli?me" = (the) 1 / 207,000 = the 207,000th (singular) - "(les) deux-cent-sept-milli?mes" = (the) 1 / 207,000 = the 207,000th (plural) - "(les) deux cent-sept-milli?mes" = (the) 2 / 107,000 (plural only) - "(les) deux cent sept-milli?mes" = (the) 200 / 7,000 (plural only) - "(les) deux-cent-sept milli?mes" = (the) 207 / 1,000 (plural only) ---- Note: The ordinal (for the numerator) traditionally keeps using spaces instead of hyphens, except between tens and units, as in: - "(tens)-et-un" (10n+1), "(tens)-deux" (10n+2), ... "(tens)-neuf" (10n+9), - "soixante-dix" (70) = "septante" (in Belgian French and Swiss French), - "soixante-et-onze" (71) = "septante-et-un" (in Belgian French and Swiss French), - "soixante-douze" (72) = "septante-deux" (in Belgian French and Swiss French), - "soixante-treize" (73) = "septante-trois" (in Belgian French and Swiss French), - "soixante-quatorze" (74) = "septante-quatre" (in Belgian French and Swiss French), - "soixante-quinze" (75) = "septante-cinq" (in Belgian French and Swiss French), - "soixante-seize" (76) = "septante-six" (in Belgian French and Swiss French), - "soixante dix-sept" (77) = "septante-sept" (in Belgian French and Swiss French), - "soixante dix-huit" (78) = "septante-huit" (in Belgian French and Swiss French), - "soixante dix-neuf" (79) = "septante-neuf" (in Belgian French and Swiss French), - "quatre-vingt" (80) = "octante" (in Belgian French and Swiss French), - "quatre-vingt onze" (81) = "octante-et-un" (in Belgian French and Swiss French), - "quatre-vingt douze" (82) = "octante-deux" (in Belgian French and Swiss French), - "quatre-vingt treize" (83) = "octante-trois" (in Belgian French and Swiss French), - "quatre-vingt quatorze" (84) = "octante-quatre" (in Belgian French and Swiss French), - "quatre-vingt quinze" (85) = "octante-cinq" (in Belgian French and Swiss French), - "quatre-vingt seize" (86) = "octante-six" (in Belgian French and Swiss French), - "quatre-vingt dix-sept" (87) = "octante-sept" (in Belgian French and Swiss French), - "quatre-vingt dix-huit" (88) = "octante-huit" (in Belgian French and Swiss French), - "quatre-vingt dix-neuf" (89) = "octante-neuf" (in Belgian French and Swiss French), But even in this list of cardinals spaces are also possible everywhere instead of hyphens between tens and units, the hyphen is strongly recommended only in "quatre-vingt". It is only the most common usage to use hyphens between tens an unit in cardinals. In legal documents, cardinals are written using hyphens everywhere instead of spaces. In all cases (traditional, most common, or legal), the space remains mandatory (an hyphen is strictly forbidden there) only between the numerator cardinal and the denominator ordinal of a fraction. 2016-11-11 2:10 GMT+01:00 Kent Karlsson : > Looking at the RBNF source can be hard to follow (especially for the more > complicated cases, even Italian is > quite complex). > > Though I used my own program for testing nearly a decade ago when I worked > on this, there is now a public > web page (not made by me, but by the person that took over maintaining the > RBNF rules) for testing RBNFs: > > http://st.unicode.org/cldr-apps/numbers.jsp > > This is easier to follow than the rules themselves and can be used to find > errors and test fixes to the RBNF rules. > Note that the rules are in ICU format, not in the XML format found in > CLDR. You can edit the rules, and the > numbers to be used for testing. > > /Kent K > > Den 2016-11-10 11:30, skrev "Rafael Xavier" : > > That's not quite true for Portuguese - the forms are a bit different. I > had a quick look at the RBNF though and it looks correct there. > > http://unicode.org/cldr/trac/browser/tags/release-30-d05/ > common/rbnf/pt.xml#L78 > > > +1 confirming existing RBNF for pt is correct (as a native Portuguese > speaker and [1]). > > 1: http://veja.abril.com.br/blog/sobre-palavras/consultorio/ > duzentas-mil-pessoas-ou-duzentos-mil-pessoas/ > > On Thu, Nov 10, 2016 at 6:42 AM, F?ram na G?idhlig < > fios at foramnagaidhlig.net> wrote: > > Sgr?obh Kent Karlsson na leanas 09/11/2016 aig 21:23: > > Right, I misread. > > > > I note that both Spanish and Portuguese has "ciento" as corrections to > > the "-feminine" cases. > > That's not quite true for Portuguese - the forms are a bit different. I > had a quick look at the RBNF though and it looks correct there. > > http://unicode.org/cldr/trac/browser/tags/release-30-d05/ > common/rbnf/pt.xml#L78 > _______________________________________________ > CLDR-Users mailing list > CLDR-Users at unicode.org > http://unicode.org/mailman/listinfo/cldr-users > > > > > _______________________________________________ > CLDR-Users mailing list > CLDR-Users at unicode.org > http://unicode.org/mailman/listinfo/cldr-users > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Fri Nov 11 03:53:58 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Fri, 11 Nov 2016 10:53:58 +0100 Subject: Inconsistent RBNF Data? In-Reply-To: References: Message-ID: Basically the needed changes are there for French ordinals: [...] %%spellout-ordinal: [...] - 200: <%spellout-cardinal-masculine< cent>%%cents-o>; + 200: <%spellout-cardinal-masculine<-cent>%%cents-o>; 1000: mill>%%mille-o>; - 2000: <%%spellout-leading< mill>%%mille-o>; + 2000: <%%spellout-leading<-mill>%%mille-o>; - 1000000: <%%spellout-leading< million>%%cents-o>; + 1000000: <%%spellout-leading<-million>%%cents-o>; - 1000000000: <%%spellout-leading< milliard>%%cents-o>; + 1000000000: <%%spellout-leading<-milliard>%%cents-o>; - 1000000000000: <%%spellout-leading< billion>%%cents-o>; + 1000000000000: <%%spellout-leading<-billion>%%cents-o>; - 1000000000000000: <%%spellout-leading< billiard>%%cents-o>; - 1000000000000000: <%%spellout-leading<-billiard>%%cents-o>; 1000000000000000000: =#,##0=; [...] 2016-11-11 8:44 GMT+01:00 Philippe Verdy : > Well for French ordinals, this still uses old rules using spaces instead > of hyphens. > > Fractions in French are spelled (like in English) as : cardinal > (numerator) + space + ordinal (denominator) > where the ordinal (denominator) here is taking the singular or plural > form, according to the value of the leading cardinal (numerator) and French > plural rules. > > With the old spelling rules of ordinals still using spaces (like in > cardinals), it is difficult to guess the values being meant in > - "(les) deux cent sept milli?mes" > > But this ambiguity is solved cleanly in all cases by always using hyphens > instead of spaces within ordinals. > > So a fraction will contain its last space just before the full ordinal for > the denominator : > > - "(le) deux-cent-sept-milli?me" = (the) 1 / 207,000 = the 207,000th > (singular) > - "(les) deux-cent-sept-milli?mes" = (the) 1 / 207,000 = the 207,000th > (plural) > - "(les) deux cent-sept-milli?mes" = (the) 2 / 107,000 (plural only) > - "(les) deux cent sept-milli?mes" = (the) 200 / 7,000 (plural only) > - "(les) deux-cent-sept milli?mes" = (the) 207 / 1,000 (plural only) > > ---- > Note: The ordinal (for the numerator) traditionally keeps using spaces > instead of hyphens, except between tens and units, as in: > > - "(tens)-et-un" (10n+1), "(tens)-deux" (10n+2), ... "(tens)-neuf" (10n+9), > - "soixante-dix" (70) = "septante" (in Belgian French and Swiss French), > - "soixante-et-onze" (71) = "septante-et-un" (in Belgian French and Swiss > French), > - "soixante-douze" (72) = "septante-deux" (in Belgian French and Swiss > French), > - "soixante-treize" (73) = "septante-trois" (in Belgian French and Swiss > French), > - "soixante-quatorze" (74) = "septante-quatre" (in Belgian French and > Swiss French), > - "soixante-quinze" (75) = "septante-cinq" (in Belgian French and Swiss > French), > - "soixante-seize" (76) = "septante-six" (in Belgian French and Swiss > French), > - "soixante dix-sept" (77) = "septante-sept" (in Belgian French and Swiss > French), > - "soixante dix-huit" (78) = "septante-huit" (in Belgian French and Swiss > French), > - "soixante dix-neuf" (79) = "septante-neuf" (in Belgian French and Swiss > French), > - "quatre-vingt" (80) = "octante" (in Belgian French and Swiss French), > - "quatre-vingt onze" (81) = "octante-et-un" (in Belgian French and Swiss > French), > - "quatre-vingt douze" (82) = "octante-deux" (in Belgian French and Swiss > French), > - "quatre-vingt treize" (83) = "octante-trois" (in Belgian French and > Swiss French), > - "quatre-vingt quatorze" (84) = "octante-quatre" (in Belgian French and > Swiss French), > - "quatre-vingt quinze" (85) = "octante-cinq" (in Belgian French and Swiss > French), > - "quatre-vingt seize" (86) = "octante-six" (in Belgian French and Swiss > French), > - "quatre-vingt dix-sept" (87) = "octante-sept" (in Belgian French and > Swiss French), > - "quatre-vingt dix-huit" (88) = "octante-huit" (in Belgian French and > Swiss French), > - "quatre-vingt dix-neuf" (89) = "octante-neuf" (in Belgian French and > Swiss French), > > But even in this list of cardinals spaces are also possible everywhere > instead of hyphens between tens and units, the hyphen is strongly > recommended only in "quatre-vingt". It is only the most common usage to use > hyphens between tens an unit in cardinals. > > In legal documents, cardinals are written using hyphens everywhere instead > of spaces. > > In all cases (traditional, most common, or legal), the space remains > mandatory (an hyphen is strictly forbidden there) only between the > numerator cardinal and the denominator ordinal of a fraction. > > > 2016-11-11 2:10 GMT+01:00 Kent Karlsson : > >> Looking at the RBNF source can be hard to follow (especially for the more >> complicated cases, even Italian is >> quite complex). >> >> Though I used my own program for testing nearly a decade ago when I >> worked on this, there is now a public >> web page (not made by me, but by the person that took over maintaining >> the RBNF rules) for testing RBNFs: >> >> http://st.unicode.org/cldr-apps/numbers.jsp >> >> This is easier to follow than the rules themselves and can be used to >> find errors and test fixes to the RBNF rules. >> Note that the rules are in ICU format, not in the XML format found in >> CLDR. You can edit the rules, and the >> numbers to be used for testing. >> >> /Kent K >> >> Den 2016-11-10 11:30, skrev "Rafael Xavier" : >> >> That's not quite true for Portuguese - the forms are a bit different. I >> had a quick look at the RBNF though and it looks correct there. >> >> http://unicode.org/cldr/trac/browser/tags/release-30-d05/com >> mon/rbnf/pt.xml#L78 >> >> >> +1 confirming existing RBNF for pt is correct (as a native Portuguese >> speaker and [1]). >> >> 1: http://veja.abril.com.br/blog/sobre-palavras/consultorio/duz >> entas-mil-pessoas-ou-duzentos-mil-pessoas/ >> >> On Thu, Nov 10, 2016 at 6:42 AM, F?ram na G?idhlig < >> fios at foramnagaidhlig.net> wrote: >> >> Sgr?obh Kent Karlsson na leanas 09/11/2016 aig 21:23: >> > Right, I misread. >> > >> > I note that both Spanish and Portuguese has "ciento" as corrections to >> > the "-feminine" cases. >> >> That's not quite true for Portuguese - the forms are a bit different. I >> had a quick look at the RBNF though and it looks correct there. >> >> http://unicode.org/cldr/trac/browser/tags/release-30-d05/com >> mon/rbnf/pt.xml#L78 >> _______________________________________________ >> CLDR-Users mailing list >> CLDR-Users at unicode.org >> http://unicode.org/mailman/listinfo/cldr-users >> >> >> >> >> _______________________________________________ >> CLDR-Users mailing list >> CLDR-Users at unicode.org >> http://unicode.org/mailman/listinfo/cldr-users >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Fri Nov 11 03:59:11 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Fri, 11 Nov 2016 10:59:11 +0100 Subject: Inconsistent RBNF Data? In-Reply-To: References: Message-ID: I forgot this additional subrule to replace spaces by hyphens in ordinals: [...] %%cents-o 0: i?me; 1: -=%%et-unieme=; - 2: ' =%%spellout-ordinal=; + 2: -=%%spellout-ordinal=; 11: -et-onzi?me; - 12: ' =%%spellout-ordinal=; - 12: -=%%spellout-ordinal=; [...] 2016-11-11 10:53 GMT+01:00 Philippe Verdy : > Basically the needed changes are there for French ordinals: > > [...] > %%spellout-ordinal: > [...] > - 200: <%spellout-cardinal-masculine< cent>%%cents-o>; > + 200: <%spellout-cardinal-masculine<-cent>%%cents-o>; > 1000: mill>%%mille-o>; > - 2000: <%%spellout-leading< mill>%%mille-o>; > + 2000: <%%spellout-leading<-mill>%%mille-o>; > - 1000000: <%%spellout-leading< million>%%cents-o>; > + 1000000: <%%spellout-leading<-million>%%cents-o>; > - 1000000000: <%%spellout-leading< milliard>%%cents-o>; > + 1000000000: <%%spellout-leading<-milliard>%%cents-o>; > - 1000000000000: <%%spellout-leading< billion>%%cents-o>; > + 1000000000000: <%%spellout-leading<-billion>%%cents-o>; > - 1000000000000000: <%%spellout-leading< billiard>%%cents-o>; > - 1000000000000000: <%%spellout-leading<-billiard>%%cents-o>; > 1000000000000000000: =#,##0=; > [...] > > 2016-11-11 8:44 GMT+01:00 Philippe Verdy : > >> Well for French ordinals, this still uses old rules using spaces instead >> of hyphens. >> >> Fractions in French are spelled (like in English) as : cardinal >> (numerator) + space + ordinal (denominator) >> where the ordinal (denominator) here is taking the singular or plural >> form, according to the value of the leading cardinal (numerator) and French >> plural rules. >> >> With the old spelling rules of ordinals still using spaces (like in >> cardinals), it is difficult to guess the values being meant in >> - "(les) deux cent sept milli?mes" >> >> But this ambiguity is solved cleanly in all cases by always using hyphens >> instead of spaces within ordinals. >> >> So a fraction will contain its last space just before the full ordinal >> for the denominator : >> >> - "(le) deux-cent-sept-milli?me" = (the) 1 / 207,000 = the 207,000th >> (singular) >> - "(les) deux-cent-sept-milli?mes" = (the) 1 / 207,000 = the 207,000th >> (plural) >> - "(les) deux cent-sept-milli?mes" = (the) 2 / 107,000 (plural only) >> - "(les) deux cent sept-milli?mes" = (the) 200 / 7,000 (plural only) >> - "(les) deux-cent-sept milli?mes" = (the) 207 / 1,000 (plural only) >> >> ---- >> Note: The ordinal (for the numerator) traditionally keeps using spaces >> instead of hyphens, except between tens and units, as in: >> >> - "(tens)-et-un" (10n+1), "(tens)-deux" (10n+2), ... "(tens)-neuf" >> (10n+9), >> - "soixante-dix" (70) = "septante" (in Belgian French and Swiss French), >> - "soixante-et-onze" (71) = "septante-et-un" (in Belgian French and Swiss >> French), >> - "soixante-douze" (72) = "septante-deux" (in Belgian French and Swiss >> French), >> - "soixante-treize" (73) = "septante-trois" (in Belgian French and Swiss >> French), >> - "soixante-quatorze" (74) = "septante-quatre" (in Belgian French and >> Swiss French), >> - "soixante-quinze" (75) = "septante-cinq" (in Belgian French and Swiss >> French), >> - "soixante-seize" (76) = "septante-six" (in Belgian French and Swiss >> French), >> - "soixante dix-sept" (77) = "septante-sept" (in Belgian French and Swiss >> French), >> - "soixante dix-huit" (78) = "septante-huit" (in Belgian French and Swiss >> French), >> - "soixante dix-neuf" (79) = "septante-neuf" (in Belgian French and Swiss >> French), >> - "quatre-vingt" (80) = "octante" (in Belgian French and Swiss French), >> - "quatre-vingt onze" (81) = "octante-et-un" (in Belgian French and Swiss >> French), >> - "quatre-vingt douze" (82) = "octante-deux" (in Belgian French and Swiss >> French), >> - "quatre-vingt treize" (83) = "octante-trois" (in Belgian French and >> Swiss French), >> - "quatre-vingt quatorze" (84) = "octante-quatre" (in Belgian French and >> Swiss French), >> - "quatre-vingt quinze" (85) = "octante-cinq" (in Belgian French and >> Swiss French), >> - "quatre-vingt seize" (86) = "octante-six" (in Belgian French and Swiss >> French), >> - "quatre-vingt dix-sept" (87) = "octante-sept" (in Belgian French and >> Swiss French), >> - "quatre-vingt dix-huit" (88) = "octante-huit" (in Belgian French and >> Swiss French), >> - "quatre-vingt dix-neuf" (89) = "octante-neuf" (in Belgian French and >> Swiss French), >> >> But even in this list of cardinals spaces are also possible everywhere >> instead of hyphens between tens and units, the hyphen is strongly >> recommended only in "quatre-vingt". It is only the most common usage to use >> hyphens between tens an unit in cardinals. >> >> In legal documents, cardinals are written using hyphens everywhere >> instead of spaces. >> >> In all cases (traditional, most common, or legal), the space remains >> mandatory (an hyphen is strictly forbidden there) only between the >> numerator cardinal and the denominator ordinal of a fraction. >> >> >> 2016-11-11 2:10 GMT+01:00 Kent Karlsson : >> >>> Looking at the RBNF source can be hard to follow (especially for the >>> more complicated cases, even Italian is >>> quite complex). >>> >>> Though I used my own program for testing nearly a decade ago when I >>> worked on this, there is now a public >>> web page (not made by me, but by the person that took over maintaining >>> the RBNF rules) for testing RBNFs: >>> >>> http://st.unicode.org/cldr-apps/numbers.jsp >>> >>> This is easier to follow than the rules themselves and can be used to >>> find errors and test fixes to the RBNF rules. >>> Note that the rules are in ICU format, not in the XML format found in >>> CLDR. You can edit the rules, and the >>> numbers to be used for testing. >>> >>> /Kent K >>> >>> Den 2016-11-10 11:30, skrev "Rafael Xavier" : >>> >>> That's not quite true for Portuguese - the forms are a bit different. I >>> had a quick look at the RBNF though and it looks correct there. >>> >>> http://unicode.org/cldr/trac/browser/tags/release-30-d05/com >>> mon/rbnf/pt.xml#L78 >>> >>> >>> +1 confirming existing RBNF for pt is correct (as a native Portuguese >>> speaker and [1]). >>> >>> 1: http://veja.abril.com.br/blog/sobre-palavras/consultorio/duz >>> entas-mil-pessoas-ou-duzentos-mil-pessoas/ >>> >>> On Thu, Nov 10, 2016 at 6:42 AM, F?ram na G?idhlig < >>> fios at foramnagaidhlig.net> wrote: >>> >>> Sgr?obh Kent Karlsson na leanas 09/11/2016 aig 21:23: >>> > Right, I misread. >>> > >>> > I note that both Spanish and Portuguese has "ciento" as corrections to >>> > the "-feminine" cases. >>> >>> That's not quite true for Portuguese - the forms are a bit different. I >>> had a quick look at the RBNF though and it looks correct there. >>> >>> http://unicode.org/cldr/trac/browser/tags/release-30-d05/com >>> mon/rbnf/pt.xml#L78 >>> _______________________________________________ >>> CLDR-Users mailing list >>> CLDR-Users at unicode.org >>> http://unicode.org/mailman/listinfo/cldr-users >>> >>> >>> >>> >>> _______________________________________________ >>> CLDR-Users mailing list >>> CLDR-Users at unicode.org >>> http://unicode.org/mailman/listinfo/cldr-users >>> >>> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Fri Nov 11 04:03:12 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Fri, 11 Nov 2016 11:03:12 +0100 Subject: Inconsistent RBNF Data? In-Reply-To: References: Message-ID: As well this additional subrule: [...] %%mille-o: 0: i?me; 1: e-=%%et-unieme=; - 2: e =%%spellout-ordinal=; + 2: e-=%%spellout-ordinal=; 11: e-et-onzi?me; - 12: e =%%spellout-ordinal=; + 12: e-=%%spellout-ordinal=; [...] 2016-11-11 10:59 GMT+01:00 Philippe Verdy : > I forgot this additional subrule to replace spaces by hyphens in ordinals: > > [...] > %%cents-o > 0: i?me; > 1: -=%%et-unieme=; > - 2: ' =%%spellout-ordinal=; > + 2: -=%%spellout-ordinal=; > 11: -et-onzi?me; > - 12: ' =%%spellout-ordinal=; > - 12: -=%%spellout-ordinal=; > [...] > > 2016-11-11 10:53 GMT+01:00 Philippe Verdy : > >> Basically the needed changes are there for French ordinals: >> >> [...] >> %%spellout-ordinal: >> [...] >> - 200: <%spellout-cardinal-masculine< cent>%%cents-o>; >> + 200: <%spellout-cardinal-masculine<-cent>%%cents-o>; >> 1000: mill>%%mille-o>; >> - 2000: <%%spellout-leading< mill>%%mille-o>; >> + 2000: <%%spellout-leading<-mill>%%mille-o>; >> - 1000000: <%%spellout-leading< million>%%cents-o>; >> + 1000000: <%%spellout-leading<-million>%%cents-o>; >> - 1000000000: <%%spellout-leading< milliard>%%cents-o>; >> + 1000000000: <%%spellout-leading<-milliard>%%cents-o>; >> - 1000000000000: <%%spellout-leading< billion>%%cents-o>; >> + 1000000000000: <%%spellout-leading<-billion>%%cents-o>; >> - 1000000000000000: <%%spellout-leading< billiard>%%cents-o>; >> - 1000000000000000: <%%spellout-leading<-billiard>%%cents-o>; >> 1000000000000000000: =#,##0=; >> [...] >> >> 2016-11-11 8:44 GMT+01:00 Philippe Verdy : >> >>> Well for French ordinals, this still uses old rules using spaces instead >>> of hyphens. >>> >>> Fractions in French are spelled (like in English) as : cardinal >>> (numerator) + space + ordinal (denominator) >>> where the ordinal (denominator) here is taking the singular or plural >>> form, according to the value of the leading cardinal (numerator) and French >>> plural rules. >>> >>> With the old spelling rules of ordinals still using spaces (like in >>> cardinals), it is difficult to guess the values being meant in >>> - "(les) deux cent sept milli?mes" >>> >>> But this ambiguity is solved cleanly in all cases by always using >>> hyphens instead of spaces within ordinals. >>> >>> So a fraction will contain its last space just before the full ordinal >>> for the denominator : >>> >>> - "(le) deux-cent-sept-milli?me" = (the) 1 / 207,000 = the 207,000th >>> (singular) >>> - "(les) deux-cent-sept-milli?mes" = (the) 1 / 207,000 = the 207,000th >>> (plural) >>> - "(les) deux cent-sept-milli?mes" = (the) 2 / 107,000 (plural only) >>> - "(les) deux cent sept-milli?mes" = (the) 200 / 7,000 (plural only) >>> - "(les) deux-cent-sept milli?mes" = (the) 207 / 1,000 (plural only) >>> >>> ---- >>> Note: The ordinal (for the numerator) traditionally keeps using spaces >>> instead of hyphens, except between tens and units, as in: >>> >>> - "(tens)-et-un" (10n+1), "(tens)-deux" (10n+2), ... "(tens)-neuf" >>> (10n+9), >>> - "soixante-dix" (70) = "septante" (in Belgian French and Swiss French), >>> - "soixante-et-onze" (71) = "septante-et-un" (in Belgian French and >>> Swiss French), >>> - "soixante-douze" (72) = "septante-deux" (in Belgian French and Swiss >>> French), >>> - "soixante-treize" (73) = "septante-trois" (in Belgian French and Swiss >>> French), >>> - "soixante-quatorze" (74) = "septante-quatre" (in Belgian French and >>> Swiss French), >>> - "soixante-quinze" (75) = "septante-cinq" (in Belgian French and Swiss >>> French), >>> - "soixante-seize" (76) = "septante-six" (in Belgian French and Swiss >>> French), >>> - "soixante dix-sept" (77) = "septante-sept" (in Belgian French and >>> Swiss French), >>> - "soixante dix-huit" (78) = "septante-huit" (in Belgian French and >>> Swiss French), >>> - "soixante dix-neuf" (79) = "septante-neuf" (in Belgian French and >>> Swiss French), >>> - "quatre-vingt" (80) = "octante" (in Belgian French and Swiss French), >>> - "quatre-vingt onze" (81) = "octante-et-un" (in Belgian French and >>> Swiss French), >>> - "quatre-vingt douze" (82) = "octante-deux" (in Belgian French and >>> Swiss French), >>> - "quatre-vingt treize" (83) = "octante-trois" (in Belgian French and >>> Swiss French), >>> - "quatre-vingt quatorze" (84) = "octante-quatre" (in Belgian French and >>> Swiss French), >>> - "quatre-vingt quinze" (85) = "octante-cinq" (in Belgian French and >>> Swiss French), >>> - "quatre-vingt seize" (86) = "octante-six" (in Belgian French and Swiss >>> French), >>> - "quatre-vingt dix-sept" (87) = "octante-sept" (in Belgian French and >>> Swiss French), >>> - "quatre-vingt dix-huit" (88) = "octante-huit" (in Belgian French and >>> Swiss French), >>> - "quatre-vingt dix-neuf" (89) = "octante-neuf" (in Belgian French and >>> Swiss French), >>> >>> But even in this list of cardinals spaces are also possible everywhere >>> instead of hyphens between tens and units, the hyphen is strongly >>> recommended only in "quatre-vingt". It is only the most common usage to use >>> hyphens between tens an unit in cardinals. >>> >>> In legal documents, cardinals are written using hyphens everywhere >>> instead of spaces. >>> >>> In all cases (traditional, most common, or legal), the space remains >>> mandatory (an hyphen is strictly forbidden there) only between the >>> numerator cardinal and the denominator ordinal of a fraction. >>> >>> >>> 2016-11-11 2:10 GMT+01:00 Kent Karlsson : >>> >>>> Looking at the RBNF source can be hard to follow (especially for the >>>> more complicated cases, even Italian is >>>> quite complex). >>>> >>>> Though I used my own program for testing nearly a decade ago when I >>>> worked on this, there is now a public >>>> web page (not made by me, but by the person that took over maintaining >>>> the RBNF rules) for testing RBNFs: >>>> >>>> http://st.unicode.org/cldr-apps/numbers.jsp >>>> >>>> This is easier to follow than the rules themselves and can be used to >>>> find errors and test fixes to the RBNF rules. >>>> Note that the rules are in ICU format, not in the XML format found in >>>> CLDR. You can edit the rules, and the >>>> numbers to be used for testing. >>>> >>>> /Kent K >>>> >>>> Den 2016-11-10 11:30, skrev "Rafael Xavier" : >>>> >>>> That's not quite true for Portuguese - the forms are a bit different. I >>>> had a quick look at the RBNF though and it looks correct there. >>>> >>>> http://unicode.org/cldr/trac/browser/tags/release-30-d05/com >>>> mon/rbnf/pt.xml#L78 >>>> >>>> >>>> +1 confirming existing RBNF for pt is correct (as a native Portuguese >>>> speaker and [1]). >>>> >>>> 1: http://veja.abril.com.br/blog/sobre-palavras/consultorio/duz >>>> entas-mil-pessoas-ou-duzentos-mil-pessoas/ >>>> >>>> On Thu, Nov 10, 2016 at 6:42 AM, F?ram na G?idhlig < >>>> fios at foramnagaidhlig.net> wrote: >>>> >>>> Sgr?obh Kent Karlsson na leanas 09/11/2016 aig 21:23: >>>> > Right, I misread. >>>> > >>>> > I note that both Spanish and Portuguese has "ciento" as corrections to >>>> > the "-feminine" cases. >>>> >>>> That's not quite true for Portuguese - the forms are a bit different. I >>>> had a quick look at the RBNF though and it looks correct there. >>>> >>>> http://unicode.org/cldr/trac/browser/tags/release-30-d05/com >>>> mon/rbnf/pt.xml#L78 >>>> _______________________________________________ >>>> CLDR-Users mailing list >>>> CLDR-Users at unicode.org >>>> http://unicode.org/mailman/listinfo/cldr-users >>>> >>>> >>>> >>>> >>>> _______________________________________________ >>>> CLDR-Users mailing list >>>> CLDR-Users at unicode.org >>>> http://unicode.org/mailman/listinfo/cldr-users >>>> >>>> >>> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From kent.karlsson14 at telia.com Fri Nov 11 06:45:21 2016 From: kent.karlsson14 at telia.com (Kent Karlsson) Date: Fri, 11 Nov 2016 13:45:21 +0100 Subject: Inconsistent RBNF Data? In-Reply-To: Message-ID: When I submitted the (initial) rules, French had "all hyphens". That was removed later, by request from some French translator(s?). I protested, but they are still not restored. I suggest you file a CLDR ticket to renew the issue for the CLDR committee. /Kent K Den 2016-11-11 08:44, skrev "Philippe Verdy" : > Well for French ordinals, this still uses old rules using spaces instead of > hyphens. > > Fractions in French are spelled (like in English) as : cardinal (numerator) + > space + ordinal (denominator) > where the ordinal (denominator) here is taking the singular or plural form, > according to the value of the leading cardinal (numerator) and French plural > rules. > > With the old spelling rules of ordinals still using spaces (like in > cardinals), it is difficult to guess the values being meant in > - "(les)?deux cent sept milli?mes" > > But this ambiguity is solved cleanly in all cases by always using hyphens > instead of spaces within ordinals. > > So a fraction will contain its last space just before the full ordinal for the > denominator : > > - "(le) deux-cent-sept-milli?me" = (the) 1 / 207,000 = the 207,000th > (singular) > - "(les) deux-cent-sept-milli?mes" = (the) 1 / 207,000 = the 207,000th > (plural) > - "(les) deux cent-sept-milli?mes" = (the) 2 / 107,000 (plural only) > - "(les) deux cent sept-milli?mes" = (the) 200 / 7,000 (plural only) > - "(les) deux-cent-sept milli?mes" = (the) 207 / 1,000 (plural only) > > ---- > Note: ?The ordinal (for the numerator) traditionally keeps using spaces > instead of hyphens, except between tens and units, as in: > > - "(tens)-et-un" (10n+1), "(tens)-deux" (10n+2), ... "(tens)-neuf" (10n+9), > - "soixante-dix" (70) = "septante" (in Belgian French and Swiss French), > - "soixante-et-onze" (71) = "septante-et-un" (in Belgian French and Swiss > French), > - "soixante-douze" (72) = "septante-deux" (in Belgian French and Swiss > French), > - "soixante-treize" (73) = "septante-trois" (in Belgian French and Swiss > French), > - "soixante-quatorze" (74) = "septante-quatre" (in Belgian French and Swiss > French), > - "soixante-quinze" (75) = "septante-cinq" (in Belgian French and Swiss > French), > - "soixante-seize" (76) = "septante-six" (in Belgian French and Swiss French), > - "soixante dix-sept" (77) = "septante-sept" (in Belgian French and Swiss > French), > - "soixante dix-huit" (78) = "septante-huit" (in Belgian French and Swiss > French), > - "soixante dix-neuf" (79) = "septante-neuf" (in Belgian French and Swiss > French), > - "quatre-vingt" (80) = "octante" (in Belgian French and Swiss French), > - "quatre-vingt onze" (81) = "octante-et-un" (in Belgian French and Swiss > French), > - "quatre-vingt douze" (82) = "octante-deux" (in Belgian French and Swiss > French), > - "quatre-vingt treize" (83) = "octante-trois" (in Belgian French and Swiss > French), > - "quatre-vingt quatorze" (84) = "octante-quatre" (in Belgian French and Swiss > French), > - "quatre-vingt quinze" (85) = "octante-cinq" (in Belgian French and Swiss > French), > - "quatre-vingt seize" (86) = "octante-six" (in Belgian French and Swiss > French), > - "quatre-vingt dix-sept" (87) = "octante-sept" (in Belgian French and Swiss > French), > - "quatre-vingt dix-huit" (88) = "octante-huit" (in Belgian French and Swiss > French), > - "quatre-vingt dix-neuf" (89) = "octante-neuf" (in Belgian French and Swiss > French), > > But even in this list of cardinals spaces are also possible everywhere instead > of hyphens between tens and units, the hyphen is strongly recommended only in > "quatre-vingt". It is only the most common usage to use hyphens between tens > an unit in cardinals. > > In legal documents, cardinals are written using hyphens everywhere instead of > spaces. > > In all cases (traditional, most common, or legal), the space remains mandatory > (an hyphen is strictly forbidden there) only between the numerator cardinal > and the denominator ordinal of a fraction. > > > 2016-11-11 2:10 GMT+01:00 Kent Karlsson : >> Looking at the RBNF source can be hard to follow (especially for the more >> complicated cases, even Italian is >> quite complex). >> >> Though I used my own program for testing nearly a decade ago when I worked on >> this, there is now a public >> web page (not made by me, but by the person that took over maintaining the >> RBNF rules) for testing RBNFs: >> >> http://st.unicode.org/cldr-apps/numbers.jsp >> >> This is easier to follow than the rules themselves and can be used to find >> errors and test fixes to the RBNF rules. >> Note that the rules are in ICU format, not in the XML format found in CLDR. >> You can edit the rules, and the >> numbers to be used for testing. >> >> /Kent K >> >> Den 2016-11-10 11:30, skrev "Rafael Xavier" > >: >> >>>> That's not quite true for Portuguese - the forms are a bit different. I >>>> had a quick look at the RBNF though and it looks correct there. >>>> >>>> http://unicode.org/cldr/trac/browser/tags/release-30-d05/common/rbnf/pt.xml >>>> #L78 >>> >>> +1 confirming existing RBNF for pt is correct (as a native Portuguese >>> speaker and [1]). >>> >>> 1: >>> http://veja.abril.com.br/blog/sobre-palavras/consultorio/duzentas-mil-pessoa >>> s-ou-duzentos-mil-pessoas/ >>> >>> On Thu, Nov 10, 2016 at 6:42 AM, F?ram na G?idhlig >> > wrote: >>>> Sgr?obh Kent Karlsson na leanas 09/11/2016 aig 21:23: >>>>> > Right, I misread. >>>>> > >>>>> > I note that both Spanish and Portuguese has "ciento" as corrections to >>>>> > the "-feminine" cases. >>>> >>>> That's not quite true for Portuguese - the forms are a bit different. I >>>> had a quick look at the RBNF though and it looks correct there. >>>> >>>> http://unicode.org/cldr/trac/browser/tags/release-30-d05/common/rbnf/pt.xml >>>> #L78 >>>> _______________________________________________ >>>> CLDR-Users mailing list >>>> CLDR-Users at unicode.org >>>> http://unicode.org/mailman/listinfo/cldr-users >>> >>> >> >> _______________________________________________ >> CLDR-Users mailing list >> CLDR-Users at unicode.org >> http://unicode.org/mailman/listinfo/cldr-users >> > > > > _______________________________________________ > CLDR-Users mailing list > CLDR-Users at unicode.org > http://unicode.org/mailman/listinfo/cldr-users -------------- next part -------------- An HTML attachment was scrubbed... URL: From hugh_paterson at sil.org Fri Nov 11 00:03:55 2016 From: hugh_paterson at sil.org (Hugh Paterson) Date: Thu, 10 Nov 2016 22:03:55 -0800 Subject: Dataset for all ISO639 code sorted by country/territory? In-Reply-To: References: <20161110105658.665a7a7059d7ee80bb4d670165c8327d.a8ff034ef1.wbe@email03.godaddy.com> Message-ID: In regards to determining the scope of usage of languages, have you considered looking at the EGIDS values of languages in the Ethnologue? That column would be very useful to answer your questions in more general terms. I agree that the indexing of institutional support for languages in diaspora is something of a quagmire to untangle. - Hugh Paterson III On Thu, Nov 10, 2016 at 2:54 PM, Mats Blakstad wrote: > I'm continuing the discussion I started on unicode at unicode.org here; > http://unicode.org/pipermail/unicode/2016-September/003964.html > Sorry for posting in wrong email list! > > On 10 November 2016 at 20:34, Shawn Steele > wrote: > >> I didn't really say anything because this is kinda a hopeless task, but >> it seems like some realities are being overlooked. I'm as curious about >> cataloguing everything as the next OCD guy, but a general solution doesn't >> seem practical. >> >> Maybe in addition to number of speakers we could give each language > different values for the different territories like official / unofficial, > lingua franca / home language, recognized / not recognized, etc > Maybe we could manage to work out some more objective categories? > Then the dataset could cover more different needs from those that want to > use it to extract the list they want, as example they could make a list of > only the official languages in the world sorted by country/territory, or > maybe a list of all non-recognized languages in different countries. > > >> * There are a *lot* of languages >> > Yes :) We would not get all in the start, but if we could start add data > for all the languages it can be done a little by little. > For myself I have many contacts that I think could be interested to help > add information. > > >> * Many countries have speakers of several languages. >> * In the US it's "obvious" that a list of languages for the US >> should include "English" >> > For sure! The amount of speakers and that it is the primary language used > speakse for it. > Beside, is not "US English" considered a variant of English? > > >> * Spanish in the US is less obvious, however it is often >> considered important. >> > It is interesting issue. Wasn't Spanish the primary language in southern > US while being a part of Mexico? > And is there not a lot of Spanish newspapsers/media in the US? > > >> * However, that's a slippery slope as there are many other >> languages with large groups of speakers in the US. If such a list includes >> Spanish, should it not include some of the others? San Francisco requires >> documents in 4 languages but provides telephone help for 200 languages. >> Where's the line? >> * Some languages happen in many places. There are a disproportionate # >> of Englishes in CLDR, however Chinese is also spoken in lots of the >> countries that have English available in CLDR. Yet CLDR doesn't provide >> data for those. >> > Could you elaborate a little bit on this? > > >> * Some language/region combinations could encounter geopolitical issues. >> Like "it's not legal for that language to be spoken in XX" (but it >> happens). Or "that language isn't YY country's language, it's ours!!!" >> > We could add documented amount of speakers and tag it as "not recognized" > >> >> * The requirement "where the language has been spoken traditionally" is >> really, really subjective. "Traditionally" the US is an English speaking >> country. However, "Traditionally", there are hundreds of languages that >> have been spoken in the US. What could be more "traditional" than the >> native American languages? Yet those often have low numbers of speakers in >> the modern world, many are even dying languages. There are also a number >> of "traditional" languages spoken by the original settlers. Which differ >> than the set of languages spoken by modern immigrants. So your data is >> going to be very skewed depending on the person collecting the data's >> definition of "traditional". >> > I agree "traditional" is not a good way to collect the data. > Native american languages should of course be mapped with territories > despite having few speakers. The point is to map all languages. > We could also map languages with years, English is then spoken in what is > USA today since 1607. > Urdu is spoken in what is today Norway since the 1970th. > > >> >> Ethnologue has done a decent job of identifying languages and the number >> of speakers in various areas, but it would be very difficult to draw a line >> that selected "English and Spanish in the US" and was consistent with >> similar real-life impacts across the other languages. Do you pick the top >> n languages for each country? Languages with > x million speakers (that >> would be very different in small and big countries). Languages with > y% >> of the speakers in the different countries? >> > > If Ethnologue have done it, I guess it should also be possible for CLDR > also? > However they operate with a category "Immigrant Languages", I'm not sure > what that means, ss exmaple Turkish, the second most spoken language of > Germany, is marked it as "Immigrant Language", I'm not sure how they make > that distinction. > > >> >> And then you end up with each application having to figure out it's own >> bar. Applications will have different market considerations and other >> reasons to target different regions/languages. That would skew any list >> for their purposes. >> > > Okay, at least it could be possible to add number of speakers for other > "6,300 lesser-known living languages", or why do we cut the list to 675 > languages? > > > > _______________________________________________ > CLDR-Users mailing list > CLDR-Users at unicode.org > http://unicode.org/mailman/listinfo/cldr-users > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hugh_paterson at sil.org Wed Nov 16 11:42:28 2016 From: hugh_paterson at sil.org (Hugh Paterson) Date: Wed, 16 Nov 2016 09:42:28 -0800 Subject: Dataset for all ISO639 code sorted by country/territory? In-Reply-To: References: <20161110105658.665a7a7059d7ee80bb4d670165c8327d.a8ff034ef1.wbe@email03.godaddy.com> Message-ID: Also, after thinking about this some more: If as is the stated case with San Francisco, "San Francisco requires documents in 4 languages but provides telephone help for 200 languages. Where's the line?" How would you propose that Unicode database maintainers, de-list institutional support for languages when institutional support ceases. i.e. lets say that San Francisco falls on some hard times finically, and can not afford to operate in 4 languages, and reduces their support to two languages, How is this to be reflected in this proposal? - Hugh Paterson III On Thu, Nov 10, 2016 at 2:54 PM, Mats Blakstad wrote: > I'm continuing the discussion I started on unicode at unicode.org here; > http://unicode.org/pipermail/unicode/2016-September/003964.html > Sorry for posting in wrong email list! > > On 10 November 2016 at 20:34, Shawn Steele > wrote: > >> I didn't really say anything because this is kinda a hopeless task, but >> it seems like some realities are being overlooked. I'm as curious about >> cataloguing everything as the next OCD guy, but a general solution doesn't >> seem practical. >> >> Maybe in addition to number of speakers we could give each language > different values for the different territories like official / unofficial, > lingua franca / home language, recognized / not recognized, etc > Maybe we could manage to work out some more objective categories? > Then the dataset could cover more different needs from those that want to > use it to extract the list they want, as example they could make a list of > only the official languages in the world sorted by country/territory, or > maybe a list of all non-recognized languages in different countries. > > >> * There are a *lot* of languages >> > Yes :) We would not get all in the start, but if we could start add data > for all the languages it can be done a little by little. > For myself I have many contacts that I think could be interested to help > add information. > > >> * Many countries have speakers of several languages. >> * In the US it's "obvious" that a list of languages for the US >> should include "English" >> > For sure! The amount of speakers and that it is the primary language used > speakse for it. > Beside, is not "US English" considered a variant of English? > > >> * Spanish in the US is less obvious, however it is often >> considered important. >> > It is interesting issue. Wasn't Spanish the primary language in southern > US while being a part of Mexico? > And is there not a lot of Spanish newspapsers/media in the US? > > >> * However, that's a slippery slope as there are many other >> languages with large groups of speakers in the US. If such a list includes >> Spanish, should it not include some of the others? San Francisco requires >> documents in 4 languages but provides telephone help for 200 languages. >> Where's the line? >> * Some languages happen in many places. There are a disproportionate # >> of Englishes in CLDR, however Chinese is also spoken in lots of the >> countries that have English available in CLDR. Yet CLDR doesn't provide >> data for those. >> > Could you elaborate a little bit on this? > > >> * Some language/region combinations could encounter geopolitical issues. >> Like "it's not legal for that language to be spoken in XX" (but it >> happens). Or "that language isn't YY country's language, it's ours!!!" >> > We could add documented amount of speakers and tag it as "not recognized" > >> >> * The requirement "where the language has been spoken traditionally" is >> really, really subjective. "Traditionally" the US is an English speaking >> country. However, "Traditionally", there are hundreds of languages that >> have been spoken in the US. What could be more "traditional" than the >> native American languages? Yet those often have low numbers of speakers in >> the modern world, many are even dying languages. There are also a number >> of "traditional" languages spoken by the original settlers. Which differ >> than the set of languages spoken by modern immigrants. So your data is >> going to be very skewed depending on the person collecting the data's >> definition of "traditional". >> > I agree "traditional" is not a good way to collect the data. > Native american languages should of course be mapped with territories > despite having few speakers. The point is to map all languages. > We could also map languages with years, English is then spoken in what is > USA today since 1607. > Urdu is spoken in what is today Norway since the 1970th. > > >> >> Ethnologue has done a decent job of identifying languages and the number >> of speakers in various areas, but it would be very difficult to draw a line >> that selected "English and Spanish in the US" and was consistent with >> similar real-life impacts across the other languages. Do you pick the top >> n languages for each country? Languages with > x million speakers (that >> would be very different in small and big countries). Languages with > y% >> of the speakers in the different countries? >> > > If Ethnologue have done it, I guess it should also be possible for CLDR > also? > However they operate with a category "Immigrant Languages", I'm not sure > what that means, ss exmaple Turkish, the second most spoken language of > Germany, is marked it as "Immigrant Language", I'm not sure how they make > that distinction. > > >> >> And then you end up with each application having to figure out it's own >> bar. Applications will have different market considerations and other >> reasons to target different regions/languages. That would skew any list >> for their purposes. >> > > Okay, at least it could be possible to add number of speakers for other > "6,300 lesser-known living languages", or why do we cut the list to 675 > languages? > > > > _______________________________________________ > CLDR-Users mailing list > CLDR-Users at unicode.org > http://unicode.org/mailman/listinfo/cldr-users > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mats.gbproject at gmail.com Sun Nov 20 11:41:10 2016 From: mats.gbproject at gmail.com (Mats Blakstad) Date: Sun, 20 Nov 2016 18:41:10 +0100 Subject: Dataset for all ISO639 code sorted by country/territory? In-Reply-To: References: <20161110105658.665a7a7059d7ee80bb4d670165c8327d.a8ff034ef1.wbe@email03.godaddy.com> Message-ID: I think it would be good to be able add years to the language data so if Tagalog was not offical because it became to expensive for Calefornia we could say it was official until 2016. I think also this would be helpful to add for language population as this can be collected from different years, and it can be easier to see if the numbers are really outdated: http://www.unicode.org/cldr/charts/latest/supplemental/language_territory_information.html I opened two tickets in CLDR: http://unicode.org/cldr/trac/ticket/9916 http://unicode.org/cldr/trac/ticket/9915 On 16 November 2016 at 18:42, Hugh Paterson wrote: > Also, after thinking about this some more: If as is the stated case with San > Francisco, > "San Francisco requires documents in 4 languages but provides telephone > help for 200 languages. Where's the line?" > > How would you propose that Unicode database maintainers, > de-list institutional support for languages when institutional support > ceases. > > i.e. lets say that San Francisco falls on some hard times finically, and > can not afford to operate in 4 languages, and reduces their support to two > languages, How is this to be reflected in this proposal? > > - Hugh Paterson III > > On Thu, Nov 10, 2016 at 2:54 PM, Mats Blakstad > wrote: > >> I'm continuing the discussion I started on unicode at unicode.org here; >> http://unicode.org/pipermail/unicode/2016-September/003964.html >> Sorry for posting in wrong email list! >> >> On 10 November 2016 at 20:34, Shawn Steele >> wrote: >> >>> I didn't really say anything because this is kinda a hopeless task, but >>> it seems like some realities are being overlooked. I'm as curious about >>> cataloguing everything as the next OCD guy, but a general solution doesn't >>> seem practical. >>> >>> Maybe in addition to number of speakers we could give each language >> different values for the different territories like official / unofficial, >> lingua franca / home language, recognized / not recognized, etc >> Maybe we could manage to work out some more objective categories? >> Then the dataset could cover more different needs from those that want to >> use it to extract the list they want, as example they could make a list of >> only the official languages in the world sorted by country/territory, or >> maybe a list of all non-recognized languages in different countries. >> >> >>> * There are a *lot* of languages >>> >> Yes :) We would not get all in the start, but if we could start add data >> for all the languages it can be done a little by little. >> For myself I have many contacts that I think could be interested to help >> add information. >> >> >>> * Many countries have speakers of several languages. >>> * In the US it's "obvious" that a list of languages for the US >>> should include "English" >>> >> For sure! The amount of speakers and that it is the primary language used >> speakse for it. >> Beside, is not "US English" considered a variant of English? >> >> >>> * Spanish in the US is less obvious, however it is often >>> considered important. >>> >> It is interesting issue. Wasn't Spanish the primary language in southern >> US while being a part of Mexico? >> And is there not a lot of Spanish newspapsers/media in the US? >> >> >>> * However, that's a slippery slope as there are many other >>> languages with large groups of speakers in the US. If such a list includes >>> Spanish, should it not include some of the others? San Francisco requires >>> documents in 4 languages but provides telephone help for 200 languages. >>> Where's the line? >>> * Some languages happen in many places. There are a disproportionate # >>> of Englishes in CLDR, however Chinese is also spoken in lots of the >>> countries that have English available in CLDR. Yet CLDR doesn't provide >>> data for those. >>> >> Could you elaborate a little bit on this? >> >> >>> * Some language/region combinations could encounter geopolitical >>> issues. Like "it's not legal for that language to be spoken in XX" (but it >>> happens). Or "that language isn't YY country's language, it's ours!!!" >>> >> We could add documented amount of speakers and tag it as "not recognized" >> >>> >>> * The requirement "where the language has been spoken traditionally" is >>> really, really subjective. "Traditionally" the US is an English speaking >>> country. However, "Traditionally", there are hundreds of languages that >>> have been spoken in the US. What could be more "traditional" than the >>> native American languages? Yet those often have low numbers of speakers in >>> the modern world, many are even dying languages. There are also a number >>> of "traditional" languages spoken by the original settlers. Which differ >>> than the set of languages spoken by modern immigrants. So your data is >>> going to be very skewed depending on the person collecting the data's >>> definition of "traditional". >>> >> I agree "traditional" is not a good way to collect the data. >> Native american languages should of course be mapped with territories >> despite having few speakers. The point is to map all languages. >> We could also map languages with years, English is then spoken in what is >> USA today since 1607. >> Urdu is spoken in what is today Norway since the 1970th. >> >> >>> >>> Ethnologue has done a decent job of identifying languages and the number >>> of speakers in various areas, but it would be very difficult to draw a line >>> that selected "English and Spanish in the US" and was consistent with >>> similar real-life impacts across the other languages. Do you pick the top >>> n languages for each country? Languages with > x million speakers (that >>> would be very different in small and big countries). Languages with > y% >>> of the speakers in the different countries? >>> >> >> If Ethnologue have done it, I guess it should also be possible for CLDR >> also? >> However they operate with a category "Immigrant Languages", I'm not sure >> what that means, ss exmaple Turkish, the second most spoken language of >> Germany, is marked it as "Immigrant Language", I'm not sure how they make >> that distinction. >> >> >>> >>> And then you end up with each application having to figure out it's own >>> bar. Applications will have different market considerations and other >>> reasons to target different regions/languages. That would skew any list >>> for their purposes. >>> >> >> Okay, at least it could be possible to add number of speakers for other >> "6,300 lesser-known living languages", or why do we cut the list to 675 >> languages? >> >> >> >> _______________________________________________ >> CLDR-Users mailing list >> CLDR-Users at unicode.org >> http://unicode.org/mailman/listinfo/cldr-users >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mats.gbproject at gmail.com Sun Nov 20 12:30:25 2016 From: mats.gbproject at gmail.com (Mats Blakstad) Date: Sun, 20 Nov 2016 19:30:25 +0100 Subject: Dataset for all ISO639 code sorted by country/territory? In-Reply-To: References: <20161110105658.665a7a7059d7ee80bb4d670165c8327d.a8ff034ef1.wbe@email03.godaddy.com> Message-ID: On 20 Nov 2016 7:09 pm, "Shawn Steele" wrote: > > Knowing ?official? languages at the city level doesn?t seem that interesting to me. How do people/software developers use the data? > I agree that it is not nessecary with data on city level. What I suggest was to provide it for subdivisions. One use case could be provide translations in a regional language to users from that region (provide Catalan translations to people from Catalunia. > > > Ethnologue shows more Finnish speakers than Creek speakers in the US. Certainly, the languages that are spoken only within a region have a special relationship (but some seem missing?), but how do the other ?immigrant? languages like Korean get chosen? In the tickets Ive opened Ive not suggested any definition of immigrant languages. As long as we have a data source we could add population for a language to a territory. More than xx% of the speakers? More than a million speakers? Also, the percentages seem pretty different than Ethnologue, does CLDR have a better source? > I also wonder about the sources! > > > Tagalog isn?t even listed for US (even in Ethnologue?) so having a date range, particularly for that, seems silly. > That the data is not there today is a poor argument for not providing it. > > > But, again, how is this data used? > > > > -Shawn > > > > From: Mats Blakstad [mailto:mats.gbproject at gmail.com] > Sent: Sunday, November 20, 2016 9:41 AM > To: Hugh Paterson > Cc: Shawn Steele ; cldr-users < cldr-users at unicode.org>; Doug Ewell > Subject: Re: Dataset for all ISO639 code sorted by country/territory? > > > > I think it would be good to be able add years to the language data so if Tagalog was not offical because it became to expensive for Calefornia we could say it was official until 2016. > > > > I think also this would be helpful to add for language population as this can be collected from different years, and it can be easier to see if the numbers are really outdated: > http://www.unicode.org/cldr/charts/latest/supplemental/language_territory_information.html > > > > I opened two tickets in CLDR: > http://unicode.org/cldr/trac/ticket/9916 > > http://unicode.org/cldr/trac/ticket/9915 > > > > On 16 November 2016 at 18:42, Hugh Paterson wrote: >> >> Also, after thinking about this some more: If as is the stated case with San Francisco, >> >> "San Francisco requires documents in 4 languages but provides telephone help for 200 languages. Where's the line?" >> >> >> >> How would you propose that Unicode database maintainers, de-list institutional support for languages when institutional support ceases. >> >> >> >> i.e. lets say that San Francisco falls on some hard times finically, and can not afford to operate in 4 languages, and reduces their support to two languages, How is this to be reflected in this proposal? >> >> >> >> - Hugh Paterson III >> >> >> >> On Thu, Nov 10, 2016 at 2:54 PM, Mats Blakstad wrote: >>> >>> I'm continuing the discussion I started on unicode at unicode.org here; >>> http://unicode.org/pipermail/unicode/2016-September/003964.html >>> >>> Sorry for posting in wrong email list! >>> >>> >>> >>> On 10 November 2016 at 20:34, Shawn Steele wrote: >>>> >>>> I didn't really say anything because this is kinda a hopeless task, but it seems like some realities are being overlooked. I'm as curious about cataloguing everything as the next OCD guy, but a general solution doesn't seem practical. >>> >>> Maybe in addition to number of speakers we could give each language different values for the different territories like official / unofficial, lingua franca / home language, recognized / not recognized, etc >>> >>> Maybe we could manage to work out some more objective categories? >>> Then the dataset could cover more different needs from those that want to use it to extract the list they want, as example they could make a list of only the official languages in the world sorted by country/territory, or maybe a list of all non-recognized languages in different countries. >>> >>> >>>> >>>> * There are a *lot* of languages >>> >>> Yes :) We would not get all in the start, but if we could start add data for all the languages it can be done a little by little. >>> >>> For myself I have many contacts that I think could be interested to help add information. >>> >>> >>>> >>>> * Many countries have speakers of several languages. >>>> * In the US it's "obvious" that a list of languages for the US should include "English" >>> >>> For sure! The amount of speakers and that it is the primary language used speakse for it. >>> >>> Beside, is not "US English" considered a variant of English? >>> >>>> >>>> * Spanish in the US is less obvious, however it is often considered important. >>> >>> It is interesting issue. Wasn't Spanish the primary language in southern US while being a part of Mexico? >>> >>> And is there not a lot of Spanish newspapsers/media in the US? >>> >>> >>>> >>>> * However, that's a slippery slope as there are many other languages with large groups of speakers in the US. If such a list includes Spanish, should it not include some of the others? San Francisco requires documents in 4 languages but provides telephone help for 200 languages. Where's the line? >>>> * Some languages happen in many places. There are a disproportionate # of Englishes in CLDR, however Chinese is also spoken in lots of the countries that have English available in CLDR. Yet CLDR doesn't provide data for those. >>> >>> Could you elaborate a little bit on this? >>> >>> >>>> >>>> * Some language/region combinations could encounter geopolitical issues. Like "it's not legal for that language to be spoken in XX" (but it happens). Or "that language isn't YY country's language, it's ours!!!" >>> >>> We could add documented amount of speakers and tag it as "not recognized" >>>> >>>> >>>> * The requirement "where the language has been spoken traditionally" is really, really subjective. "Traditionally" the US is an English speaking country. However, "Traditionally", there are hundreds of languages that have been spoken in the US. What could be more "traditional" than the native American languages? Yet those often have low numbers of speakers in the modern world, many are even dying languages. There are also a number of "traditional" languages spoken by the original settlers. Which differ than the set of languages spoken by modern immigrants. So your data is going to be very skewed depending on the person collecting the data's definition of "traditional". >>> >>> I agree "traditional" is not a good way to collect the data. >>> >>> Native american languages should of course be mapped with territories despite having few speakers. The point is to map all languages. >>> >>> We could also map languages with years, English is then spoken in what is USA today since 1607. >>> >>> Urdu is spoken in what is today Norway since the 1970th. >>> >>> >>>> >>>> >>>> Ethnologue has done a decent job of identifying languages and the number of speakers in various areas, but it would be very difficult to draw a line that selected "English and Spanish in the US" and was consistent with similar real-life impacts across the other languages. Do you pick the top n languages for each country? Languages with > x million speakers (that would be very different in small and big countries). Languages with > y% of the speakers in the different countries? >>> >>> >>> >>> If Ethnologue have done it, I guess it should also be possible for CLDR also? >>> >>> However they operate with a category "Immigrant Languages", I'm not sure what that means, ss exmaple Turkish, the second most spoken language of Germany, is marked it as "Immigrant Language", I'm not sure how they make that distinction. >>> >>> >>>> >>>> >>>> And then you end up with each application having to figure out it's own bar. Applications will have different market considerations and other reasons to target different regions/languages. That would skew any list for their purposes. >>> >>> >>> >>> Okay, at least it could be possible to add number of speakers for other "6,300 lesser-known living languages", or why do we cut the list to 675 languages? >>> >>> >>> >>> >>> >>> >>> >>> _______________________________________________ >>> CLDR-Users mailing list >>> CLDR-Users at unicode.org >>> http://unicode.org/mailman/listinfo/cldr-users >> >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Sun Nov 20 12:54:08 2016 From: doug at ewellic.org (Doug Ewell) Date: Sun, 20 Nov 2016 11:54:08 -0700 Subject: Dataset for all ISO639 code sorted by country/territory? Message-ID: Mats, I think you are genuinely underestimating the time and effort that this project would take. --Doug Ewell | Thornton, CO, US | ewellic.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From mats.gbproject at gmail.com Sun Nov 20 13:32:21 2016 From: mats.gbproject at gmail.com (Mats Blakstad) Date: Sun, 20 Nov 2016 20:32:21 +0100 Subject: Dataset for all ISO639 code sorted by country/territory? In-Reply-To: References: <20161110105658.665a7a7059d7ee80bb4d670165c8327d.a8ff034ef1.wbe@email03.godaddy.com> Message-ID: On 20 November 2016 at 19:54, Shawn Steele wrote: > > That the data is not there today is a poor argument for not providing it. > > Ethnologue mentions the US has nearly 291 living languages + 11 extinct > languages. > Is this including or excluding "immigrant" languages? > Should all 291 be listed in the table? > Yes for sure. > It seems to me that if CLDR is ?merely? a copy of Ethnologue, then as a > software developer I may prefer to go straight to the source. > The problem arise when that source is not open source and you need to pay money to access the dataset or to use it. As a developer I may prefer to make use of and change the data in anyway to suits myself without asking anyone for premission to do it. Besides making it an open source dataset can also help increase the quality as more people can help develop the data end correct it. > I?m also not sure what I?m supposed to do knowing that there are 291 > living languages in the US. I?m probably not going to localize to all of > them. > Maybe you will not, but maybe others want to. To know the languages of your target markets are important. > If I have a language picker, it seems to me that I?d perhaps want a > shorter list of more common languages, but also I?d prefer users be able to > pick a language not on the list. > Makeing CLDR provide information of lesses known languages will of course not enfource you to have a long list of languages in your website or prevent you from lettings your users choose other languages. You can can choose yourself how to use it. In fact you get a much more flexible solution, you can easily extract exactly the list you want, e.g. all languages spoken by more than 100 000 people. So I guess the way it works is that CLDR simply provdes the data and you can choose yourself exactly how you want to use it. -------------- next part -------------- An HTML attachment was scrubbed... URL: From mats.gbproject at gmail.com Sun Nov 20 13:35:14 2016 From: mats.gbproject at gmail.com (Mats Blakstad) Date: Sun, 20 Nov 2016 20:35:14 +0100 Subject: Dataset for all ISO639 code sorted by country/territory? In-Reply-To: References: Message-ID: I understand it would take a lot of time to collect the full data, but it also depends on how much engagement you manage to create for the work. On the other side: to simply allow users to start provide the data is first step in the process, and to do it would take very little time to do it! On 20 November 2016 at 19:54, Doug Ewell wrote: > Mats, > > I think you are genuinely underestimating the time and effort that this > project would take. > > -- > Doug Ewell | Thornton, CO, US | ewellic.org > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mats.gbproject at gmail.com Sun Nov 20 15:02:41 2016 From: mats.gbproject at gmail.com (Mats Blakstad) Date: Sun, 20 Nov 2016 22:02:41 +0100 Subject: Dataset for all ISO639 code sorted by country/territory? In-Reply-To: References: Message-ID: On 20 November 2016 at 21:11, Shawn Steele wrote: > > I understand it would take a lot of time to collect the full data, but > it also depends on how much engagement you manage to create for the work. > > > > Given the CLDR track record, where all languages do not even have locale > data > Why do not every language have a local in CLDR? And should they not have? Are the locals used not just those same used in the IANA subtag register? What are the criteria for a language to be included in CLDR? -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Sun Nov 20 15:53:54 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Sun, 20 Nov 2016 22:53:54 +0100 Subject: Dataset for all ISO639 code sorted by country/territory? In-Reply-To: References: Message-ID: I think it just requires a minimal dataset: ask for it, submit the data, it will be made available for vetting, and if vetting makes it suitable for publication with the minimal core set of properties, it will be added to the published list. But you need paritcipants and need to convince some CLDR members to support this addition and dedicated some competent resources to validate the surveys (otherwise a supporting "academy" can also join the CLDR TC to get more votes than a single vote for all non-members (which would not be anough for publication, or would only be published separately with a "draft" status, waiting for more supporters. 2016-11-20 22:02 GMT+01:00 Mats Blakstad : > > > On 20 November 2016 at 21:11, Shawn Steele > wrote: > >> > I understand it would take a lot of time to collect the full data, but >> it also depends on how much engagement you manage to create for the work. >> >> >> >> Given the CLDR track record, where all languages do not even have locale >> data >> > > Why do not every language have a local in CLDR? And should they not have? > > Are the locals used not just those same used in the IANA subtag register? > > What are the criteria for a language to be included in CLDR? > > > > > _______________________________________________ > CLDR-Users mailing list > CLDR-Users at unicode.org > http://unicode.org/mailman/listinfo/cldr-users > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mark at macchiato.com Sun Nov 20 16:20:40 2016 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Sun, 20 Nov 2016 15:20:40 -0700 Subject: Dataset for all ISO639 code sorted by country/territory? In-Reply-To: References: Message-ID: The way we are set up now in CLDR, people can always provide additional information on language population via tickets, such as http://unicode.org/cldr/trac/ticket/9856. And on the status (official, etc) of each language. There is already a ticket to allow the addition of subdivisions (http://unicode.org/cldr/trac/ticket/9897). We've had probably about 300 such tickets and there are others slated for the current release. The process is far from as simple as you state, since we need to have accessible, authoritative references for the data. And often when we look into those sources, we find that the figures stated in the ticket are simply wrong, and need to be corrected. Or the source cite figures are themselves out of date. So any willing parties, such as you, can do the research and supply more data. As for changes over time: the data is stated in terms of percentages of the country's population. So if the language growth is roughly the same as the overall country's population growth, then that is reflected in the figures going forward. Of course, where the growth (or decrease) varies from the country's (which can clearly happen over time, or in case of upheavals or population movements), then people should file tickets to correct the values. Mark BTW, in an ideal world, for each country we'd be able to collect a set of language tuples for people who are functional in each language in the tuple, with a percentage of the population that each applies to, eg: 75% {English} 15% {English, Spanish} 7.5% {Spanish} ... Some countries collect and make available data that is roughly at that level in each census, but most do not. Thus we are not able to provide that kind of data (which would be very useful). Mark On Sun, Nov 20, 2016 at 12:35 PM, Mats Blakstad wrote: > I understand it would take a lot of time to collect the full data, but it > also depends on how much engagement you manage to create for the work. > > On the other side: to simply allow users to start provide the data is > first step in the process, and to do it would take very little time to do > it! > > On 20 November 2016 at 19:54, Doug Ewell wrote: > >> Mats, >> >> I think you are genuinely underestimating the time and effort that this >> project would take. >> >> -- >> Doug Ewell | Thornton, CO, US | ewellic.org >> >> > > _______________________________________________ > CLDR-Users mailing list > CLDR-Users at unicode.org > http://unicode.org/mailman/listinfo/cldr-users > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Sun Nov 20 16:29:13 2016 From: doug at ewellic.org (Doug Ewell) Date: Sun, 20 Nov 2016 15:29:13 -0700 Subject: Dataset for all ISO639 code sorted by =?UTF-8?Q?country/territory=3F?= Message-ID: <20161120152913.665a7a7059d7ee80bb4d670165c8327d.e7fd92192a.wbe@email03.godaddy.com> Mats Blakstad wrote: > Why do not every language have a local in CLDR? And should they not > have? Um, because gathering this data takes a lot of time and effort? More than most people and organizations can justify for the 676th or 1,000th or 7,000th most commonly spoken language in the world? If this data were easy to gather and organize and there were few controversies surrounding the data, I imagine much of this work would have been done already. Suggesting that this data should be made "open source" -- which means, among other things, that anyone could change the data and the criteria for inclusion and release the changed version without restriction -- does not change the amount of effort required to do this right. There are surprisingly few people with the knowledge and expertise to collect and present this sort of information about a language spoken in a single remote village in Myanmar. > Are the locals used not just those same used in the IANA subtag > register? There is a lot more to locale data than the language tag. Much, much more. That would be like saying if I know your name, I know everything about you. > What are the criteria for a language to be included in CLDR? You should start by reading the main CLDR page (cldr.unicode.org) and the Process page. -- Doug Ewell | Thornton, CO, US | ewellic.org From richard.wordingham at ntlworld.com Sun Nov 20 18:50:09 2016 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Mon, 21 Nov 2016 00:50:09 +0000 Subject: Dataset for all ISO639 code sorted by country/territory? In-Reply-To: References: Message-ID: <20161121005009.4d250936@JRWUBU2> On Sun, 20 Nov 2016 22:53:54 +0100 Philippe Verdy wrote: > I think it just requires a minimal dataset: ask for it, submit the > data, it will be made available for vetting, and if vetting makes it > suitable for publication with the minimal core set of properties, it > will be added to the published list. The minimal data set can be difficult to collect, and may actually be impossible. There may be technical issues - can one actually specify that today's date is "a.d. XI Kal. Dec. a.u.c. MMDCCLXIX" in Classical Latin? It would be good to have a proper line-breaker for pi_TH, which is Pali written in the Thai script (as opposed to pi_Khmr_TH and pi_Lana_TH, which are used in old documents) but has spaces between the words, at least where crasis or similar has not occurred. I once sat down to assemble the minimum data needed for Latin - and found I was stumped. There just isn't much call for computer user interfaces in Latin - but support for document preparation in Latin would be handy. For some modern languages, some of the concepts may simply not exist - one would use another language for them. That is probably the real case for most language names in most languages. Even in the UK, there is a widespread conception that Pakistani immigrants speak 'Pakistani' in the home. I would also ask, what is en_TH? Is it the English used in Thailand by Thais, Britons, Australians or Americans? Currency and year number are the primary localisation requirements for the last three groups. Incidentally, most of the native English speakers resident in Thailand are not officially immigrants - they are present on extensions of stay granted by non-immigrant visas. For Britons resident in Thailand, the relevant locale is probably just en-GB-u-rg-thzzzz. That example would probably go for most immigrant groups. For that matter, how well defined is es_US? Richard. From verdy_p at wanadoo.fr Mon Nov 21 13:08:29 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Mon, 21 Nov 2016 20:08:29 +0100 Subject: Dataset for all ISO639 code sorted by country/territory? In-Reply-To: <20161121005009.4d250936@JRWUBU2> References: <20161121005009.4d250936@JRWUBU2> Message-ID: 2016-11-21 1:50 GMT+01:00 Richard Wordingham < richard.wordingham at ntlworld.com>: > On Sun, 20 Nov 2016 22:53:54 +0100 > Philippe Verdy wrote: > > > I think it just requires a minimal dataset: ask for it, submit the > > data, it will be made available for vetting, and if vetting makes it > > suitable for publication with the minimal core set of properties, it > > will be added to the published list. > > The minimal data set can be difficult to collect, and may actually be > impossible. There may be technical issues - can one actually specify > that today's date is "a.d. XI Kal. Dec. a.u.c. MMDCCLXIX" in > Classical Latin? > > If you speak about Classical Latin, "a.d. XI Kal. Dec. a.u.c. MMDCCLXIX" is the most accurate (but historic) form. But it is no longer used aince long in modern Latin (e.g. by the Vatican, which now uses the Gregorian calendar). In fact when Latin was official in many countries of Europe, the Gregorian Latin was already used and the Roman Republican calendar was already abandonned (it predates the effecitve christian era in Rome). The Roman Empire was christianizd in the 3rd century by Emperor Constantin, when Latin was not only an official adminsitration language but still a living language. How much it took in the Middle Age for the Julian calendar (and 14 centuries after , the Gregorian calendar when the Latin language was no longer a living or adminsitrative language except in the Episcopal States) to replace the Roman Republican calendar is another question, So the question for the Latin language would be to identify which calendar is official, but not how we can bring relevant and accurante calendar translation in Latin language for the three calendars. If you consider the "La" locale, it should be by default bound to the current modern epoch, so using the Gregorian calendar by default. For other historic periods, you'd need at least other sublocales, one for the Roman Republic, another for the Roman Empire starting at Emperor Julius Caesar, bound to the early Julian Calendar, another after Emperor Augustus (introducing changes in month lengths to create the month of August) bound to the modern Julian Calendar, another for the introduction of the Gregorian Calendar: it means 4 distinct locales in Latin. And you'd probably need further distinctions at linguistic level for the introduction of lowercase letters in the Middle-Age (early Classical Latin was unicameral): 5 distinguished locale variants only for this language in the same script ! You could as well extend this to earlier periods where Latin was still not the language of the whole Roman Empire, and had various regional "Italic" variants some of them still exhibiting classical Greek features. These language variants still persist today in modern Greek (polytonic or monotonic): monotonic Greek is a very recent introduction is now the official form for adminsitrative purpose, but many Greek people still love their polytonic features. But Classical Greek did not have these distinctions (and early Classical Greek was also unicameral, and had various regional variants or variants in how they wrote numbers, or simply in their alphabet, which had additional letters now extinct in modern Greek...). Here again how many variants will we encode in CLDR for Greek ? And in fact is Classical Greek really the same language (Classical Chinese for example uses another language code "lzh", dinctinguished from modern Mandarin, and where the "zh" code is now no longer a single language but a collections of languages that behave as a "macrolanguage" only in its written form; for the oral form, there's a clear need of distinction, notably for Cantonese and Taiwanese and other Southern Chinese languages, even if they are unified on their written form by a script variant under "zh-hant", whereas "Standard" Mandarin uses "zh-hans" and the "zh" language code maps by default to this implied "hans" form, also used outside China in Singapore, or in large minorities in the Indian Ocean or even those living in US !). There's no doubt however that the "hant" script variant is the only one relevant for Classical Mandarin ("lzh"), even if it also has multiple important variants which are very difficult to unify with the modern "Traditional" variant. For now let's remain in scope: CLDR must first address the needs for current modern variants, as they are used today. Many other locales (or sublocales) are possible in data but will never reach CLDR standardization, unless there's an active community and an autority still using the historic forms (e.g. for "nearly official" religious or ceremonial usage, which is IMHO a legitimate reason to encode them as, effectively, these historic forms are not really extinct). This remark will apply as well to Biblic Greek, Biblic/Masoretic Hebrew, Biblic Geez (in Ethiopia), Biblic Georgian, or Coranic Arabic that have significant and important differences with the vernacular modern "standard" languages for Greek, Hebrew, Geez, Georgian, and Arabic: these **living** religious variants should be IMHO encoded in CLDR. -------------- next part -------------- An HTML attachment was scrubbed... URL: From srl at icu-project.org Mon Nov 21 16:15:01 2016 From: srl at icu-project.org (Steven R. Loomis) Date: Mon, 21 Nov 2016 14:15:01 -0800 Subject: Dataset for all ISO639 code sorted by country/territory? In-Reply-To: References: <20161121005009.4d250936@JRWUBU2> Message-ID: El 11/21/16 11:08 AM, "CLDR-Users en nombre de Philippe Verdy" escribi?: 2016-11-21 1:50 GMT+01:00 Richard Wordingham : On Sun, 20 Nov 2016 22:53:54 +0100 Philippe Verdy wrote: > I think it just requires a minimal dataset: ask for it, submit the > data, it will be made available for vetting, and if vetting makes it > suitable for publication with the minimal core set of properties, it > will be added to the published list. The minimal data set can be difficult to collect, and may actually be impossible. There may be technical issues - can one actually specify that today's date is "a.d. XI Kal. Dec. a.u.c. MMDCCLXIX" in Classical Latin? Yes, you can use numbering system ?roman? (uppercase) http://www.unicode.org/repos/cldr/trunk/common/bcp47/number.xml ? For now let's remain in scope: CLDR must first address the needs for current modern variants, as they are used today. Many other locales (or sublocales) are possible in data but will never reach CLDR standardization, unless there's an active community and an autority still using the historic forms (e.g. for "nearly official" religious or ceremonial usage, which is IMHO a legitimate reason to encode them as, effectively, these historic forms are not really extinct). This remark will apply as well to Biblic Greek, Biblic/Masoretic Hebrew, Biblic Geez (in Ethiopia), Biblic Georgian, or Coranic Arabic that have significant and important differences with the vernacular modern "standard" languages for Greek, Hebrew, Geez, Georgian, and Arabic: these **living** religious variants should be IMHO encoded in CLDR. If someone provides data for them and maintains them, yes. -s -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Mon Nov 21 16:36:06 2016 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Mon, 21 Nov 2016 22:36:06 +0000 Subject: Dataset for all ISO639 code sorted by country/territory? In-Reply-To: References: <20161121005009.4d250936@JRWUBU2> Message-ID: <20161121223606.0aae163b@JRWUBU2> On Mon, 21 Nov 2016 20:08:29 +0100 Philippe Verdy wrote: > 2016-11-21 1:50 GMT+01:00 Richard Wordingham < > richard.wordingham at ntlworld.com>: > So the question for the Latin language would be to identify which > calendar is official, but not how we can bring relevant and accurante > calendar translation in Latin language for the three calendars. If > you consider the "La" locale, it should be by default bound to the > current modern epoch, so using the Gregorian calendar by default. For > other historic periods, you'd need at least other sublocales, one for > the Roman Republic, another for the Roman Empire starting at Emperor > Julius Caesar, bound to the early Julian Calendar, another after > Emperor Augustus (introducing changes in month lengths to create the > month of August) bound to the modern Julian Calendar, another for the > introduction of the Gregorian Calendar: it means 4 distinct locales > in Latin. You can qualify a locale by the calendar in use. I nearly referred to en_ca_buddhist_GB in my previous post, but then discovered there was a better way of doing it. The cycle of days and months is almost continuous for any region; the problems are to identify the switchover from Julian to Gregorian in each region, and that is not peculiar to Latin. The use of the AD system of dates owes a lot to the Carolingian Renaissance. Ideally, we ought to have lots of regnal lists, including lists of consuls. In practice, with one exception, I don't think these are needed for real man-machine interfaces. > And you'd probably need further distinctions at linguistic > level for the introduction of lowercase letters in the Middle-Age > (early Classical Latin was unicameral): 5 distinguished locale > variants only for this language in the same script ! This could be quite relevant for detecting sentence-boundaries. Of course, you also have the interpunct-no spacing-spacing evolution of the marking of word boundaries, and the disappearance of the apex. However, modern Classical Latin does use inter-word spaces, and editors usually do the hard work of determining sentence boundaries. (I think Unicode would have had a lot of trouble with the disunification of 'u' and 'v'.) I'm not sure of the relevance of the appearance of the macron and breve in teaching materials. For these, there also seems to be a switch from the marking of syllable quantity to the marking of vowel quantity. Perhaps these differences are outside the scope of CLDR, though they're not irrelevant to spelling and grammar checkers. > You could as > well extend this to earlier periods where Latin was still not the > language of the whole Roman Empire, It never was, even in the West. > and had various regional "Italic" > variants some of them still exhibiting classical Greek features. Richard. From richard.wordingham at ntlworld.com Mon Nov 21 16:58:02 2016 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Mon, 21 Nov 2016 22:58:02 +0000 Subject: Dataset for all ISO639 code sorted by country/territory? In-Reply-To: References: <20161121005009.4d250936@JRWUBU2> Message-ID: <20161121225802.53dc837f@JRWUBU2> On Mon, 21 Nov 2016 14:15:01 -0800 "Steven R. Loomis" wrote: > 2016-11-21 1:50 GMT+01:00 Richard Wordingham > >> The minimal data set can be difficult to collect, and may actually be >> impossible. There may be technical issues - can one actually specify >> that today's date is "a.d. XI Kal. Dec. a.u.c. MMDCCLXIX" in >> Classical Latin? > Yes, you can use numbering system ?roman? (uppercase) > http://www.unicode.org/repos/cldr/trunk/common/bcp47/number.xml It was the '11 days (inclusive) before the Calends of December' bit that had me worried. Is the following rule built into ICU? "In March, July, October, May The Nones are on the 7th day." > El 11/21/16 11:08 AM, "CLDR-Users en nombre de Philippe Verdy" > > escribi?: >> This >> remark will apply as well to Biblic Greek, Biblic/Masoretic Hebrew, >> Biblic Geez (in Ethiopia), Biblic Georgian, or Coranic Arabic that >> have significant and important differences with the vernacular modern >> "standard" languages for Greek, Hebrew, Geez, Georgian, and Arabic: >> these **living** religious variants should be IMHO encoded in CLDR. > If someone provides data for them and maintains them, yes. The most important data will be the locale-based data for text manipulation, rather than much of the other stuff. Richard. From srl at icu-project.org Mon Nov 21 17:33:10 2016 From: srl at icu-project.org (Steven R. Loomis) Date: Mon, 21 Nov 2016 15:33:10 -0800 Subject: Dataset for all ISO639 code sorted by country/territory? In-Reply-To: <20161121225802.53dc837f@JRWUBU2> References: <20161121005009.4d250936@JRWUBU2> <20161121225802.53dc837f@JRWUBU2> Message-ID: <2C4E6155-397D-4339-95C3-54D5C9196BB1@icu-project.org> El 11/21/16 2:58 PM, "CLDR-Users en nombre de Richard Wordingham" escribi?: >On Mon, 21 Nov 2016 14:15:01 -0800 >"Steven R. Loomis" wrote: > >> 2016-11-21 1:50 GMT+01:00 Richard Wordingham >> > >>> The minimal data set can be difficult to collect, and may actually be >>> impossible. There may be technical issues - can one actually specify >>> that today's date is "a.d. XI Kal. Dec. a.u.c. MMDCCLXIX" in >>> Classical Latin? > >> Yes, you can use numbering system ?roman? (uppercase) >> http://www.unicode.org/repos/cldr/trunk/common/bcp47/number.xml > >It was the '11 days (inclusive) before the Calends of December' bit >that had me worried. > >Is the following rule built into ICU? > >"In March, July, October, May > The Nones are on the 7th day." Not yet. >> El 11/21/16 11:08 AM, "CLDR-Users en nombre de Philippe Verdy" >> >> escribi?: > >>> This >>> remark will apply as well to Biblic Greek, Biblic/Masoretic Hebrew, >>> Biblic Geez (in Ethiopia), Biblic Georgian, or Coranic Arabic that >>> have significant and important differences with the vernacular modern >>> "standard" languages for Greek, Hebrew, Geez, Georgian, and Arabic: >>> these **living** religious variants should be IMHO encoded in CLDR. > >> If someone provides data for them and maintains them, yes. > >The most important data will be the locale-based data for text >manipulation, rather than much of the other stuff. Sure. I think the point is that ?here is some data for X? is probably more helpful to everyone than ?CLDR ought to include X? -s From srl at icu-project.org Mon Nov 21 18:00:59 2016 From: srl at icu-project.org (Steven R. Loomis) Date: Mon, 21 Nov 2016 16:00:59 -0800 Subject: Dataset for all ISO639 code sorted by country/territory? In-Reply-To: References: Message-ID: <488D0FBC-4540-4B62-968D-54537B85F919@icu-project.org> Mats, I replied to your tickets http://unicode.org/cldr/trac/ticket/9915 and http://unicode.org/cldr/trac/ticket/9916 ? thank you for the good ideas (as far as completeness goes), but it?s not really clear what the purpose of the ticket should be. El 11/20/16 11:35 AM, "CLDR-Users en nombre de Mats Blakstad" escribi?: I understand it would take a lot of time to collect the full data, but it also depends on how much engagement you manage to create for the work. On the other side: to simply allow users to start provide the data is first step in the process, and to do it would take very little time to do it! It?s not clear how users are hindered from providing data now? At present, the data is very meticulously collected from a number of sources, including feedback comments. Steven On 20 November 2016 at 19:54, Doug Ewell wrote: Mats, I think you are genuinely underestimating the time and effort that this project would take. -------------- next part -------------- An HTML attachment was scrubbed... URL: From mats.gbproject at gmail.com Mon Nov 21 21:06:50 2016 From: mats.gbproject at gmail.com (Mats Blakstad) Date: Tue, 22 Nov 2016 04:06:50 +0100 Subject: Dataset for all ISO639 code sorted by country/territory? In-Reply-To: <488D0FBC-4540-4B62-968D-54537B85F919@icu-project.org> References: <488D0FBC-4540-4B62-968D-54537B85F919@icu-project.org> Message-ID: Thanks for the replay Steven! Also thanks to Mark Davis for explaining more about calculation of language speakers within a territory. I'm interested to help provide data - however to me it is not clear if it is possible or what the criteria are. I initially wanted to use a language-country dataset from the Ethnologue: https://www.ethnologue.com/codes/download-code-tables I wanted to try play with this data, like filter out only living languages, merge it with data from IANA subtag register and CLDR locals to also map different variants and standards of languages and see if I could make some infographics or complie it with data from other sources. However, even though this data is free to download, it is licensed, you can't change it and you can't make it available for others to download. I contacted the Ethnologue to hear if I could use the data. After 1 months I get an answer that they want to see an example of the new dataset and then they can give me a price for it. As I see it this put a lot of constrains on me. I don't have money to buy that dataset from the Ethnologue and I don't want to go and ask them every time I want to make changes or try something new (and maybe need to wait 1 months every time for their answer). I guess this is also one of the advertised benefits of open source data; You can simply adapt and use it for your own purposes without needing to ask anyone. Then I asked here in the list if we could maybe manage to make a full language-territory mapping within CLDR, but the answers on this list until now is that such mapping would be very subjective (even though it is also stated that it is not needed as Ethnologue made a good dataset already). So I suggested that if so we could go for purely objective criteria, we map languages to territories based on evidences of the amount of people speaking the language in the territory, with this approach it doesn't matter how big or small the population is, and anyone using the data can extract the data they need based on their own criteria (e.g. only use languages with more than 5% of speakers withing a territory). Then it's been said that the data for the smaller languages is not useful and that it is unrealistic as not all languages have locale data, but of course these subjective comments doesn't clarify what the objective criteria are. I understand that it is not just a 1-2-3 to collect a full dataset, but it should be developed some clear criteria that applies to all languages so data can be structured to facilitate that it can be done in the long run: - What is the minimum of data needed to add support for languages in CLDR? - Can any language be included? And if not, what are the criteria we operate with? As example, I would like to add Elfdalian , it is pretty straight forward, 2000 speakers in Sweden in Dalarna (subdivision SE-W). Can I just open a ticket and get this data added to CLDR once it's been reviewed? - What criteria is applied for language-territory mapping? For instance, in the Ethnologue there is a notion of "immigrant" languages. Should there be used objective or subjective criteria? http://unicode.org/cldr/trac/ticket/9897 http://unicode.org/cldr/trac/ticket/9915 The way I see it, to start with some language-territory mapping, especially including mapping with subdivisions, before we have reliable sources of accurate population, could also help generate more data in long run, as it is much easier to try collect the data once it have been geographically mapped. About language status I would be happy to start add data, but maybe it should be clarified exactly which categorize that are most feasible? http://unicode.org/cldr/trac/ticket/9856 http://unicode.org/cldr/trac/ticket/9916 Mats On 22 November 2016 at 01:00, Steven R. Loomis wrote: > Mats, > I replied to your tickets http://unicode.org/cldr/trac/ticket/9915 and > http://unicode.org/cldr/trac/ticket/9916 ? thank you for the good ideas > (as far as completeness goes), but it?s not really clear what the purpose > of the ticket should be. > > El 11/20/16 11:35 AM, "CLDR-Users en nombre de Mats Blakstad" < > cldr-users-bounces at unicode.org en nombre de mats.gbproject at gmail.com> > escribi?: > > I understand it would take a lot of time to collect the full data, but it > also depends on how much engagement you manage to create for the work. > > On the other side: to simply allow users to start provide the data is > first step in the process, and to do it would take very little time to do > it! > > > It?s not clear how users are hindered from providing data now? At > present, the data is very meticulously collected from a number of sources, > including feedback comments. > > Steven > > > On 20 November 2016 at 19:54, Doug Ewell wrote: > >> Mats, >> >> I think you are genuinely underestimating the time and effort that this >> project would take. >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hugh_paterson at sil.org Tue Nov 22 02:05:01 2016 From: hugh_paterson at sil.org (Hugh Paterson) Date: Tue, 22 Nov 2016 00:05:01 -0800 Subject: Dataset for all ISO639 code sorted by country/territory? In-Reply-To: References: <488D0FBC-4540-4B62-968D-54537B85F919@icu-project.org> Message-ID: Mats, Just a thought, What do you gain by using the Ethnologue tables (ISO 8859-1 encoded tables) over just using the open licensed ISO 639-3 tables (in UTF-8)? http://www-01.sil.org/iso639-3/download.asp I have noticed some differences in the names of languages in these two files. I would stick with the UTF-8 tables. The UTF-8 tables are the source of the Ethnologue data, not the other way round. The Ethnologue does provide a country correspondence, and this is not necessarily changeable (due to license). However, there is another project called Glottalog http://glottolog.org which does propose a GPS coordinate for most languages http://glottolog.org/glottolog/language (their definition of a "language" is different than ISO 639-3's definition, but their data includes the ISO 639-3 set of language distinctions). Glottalog data is a bit more open than the Ethnologue data. Glottalog 2.7 data is licensed under Creative Commons 3.0 Attribution-ShareAlike, and is available on github. https://github.com/clld/glottolog-data Now we can't just go out and build upon the Ethnologue's data tables, but with a GPS coordinate in an open data table, a query of of the GEOhack API would return a county code and a secondary administrative unit for a political entity for a GPS coordinate. Here is an example of using the coordinates for Frankfurt a. M. Germany. https://tools.wmflabs.org/geohack/geohack.php?pagename=Frankfurt¶ms=50_7_N_8_41_E_type:city(732688)_region:DE-HE Now, the accessible Ethnologue tables could be used to verify GPS point data obtained from Glottalog. If there were a discrepancy between the two data sets one would have to determine how to make an editorial choice between the two sources. However, essentially, the functionality of the language-country correspondence would be replicated, albeit from different sources, and merely verified to be congruent with Ethnologue data tables. I agree with you that there is great value in open data sets. all the best, Hugh Paterson III On Mon, Nov 21, 2016 at 7:06 PM, Mats Blakstad wrote: > Thanks for the replay Steven! > Also thanks to Mark Davis for explaining more about calculation of > language speakers within a territory. > > I'm interested to help provide data - however to me it is not clear if it > is possible or what the criteria are. > > I initially wanted to use a language-country dataset from the Ethnologue: > https://www.ethnologue.com/codes/download-code-tables > I wanted to try play with this data, like filter out only living > languages, merge it with data from IANA subtag register and CLDR locals to > also map different variants and standards of languages and see if I could > make some infographics or complie it with data from other sources. > > However, even though this data is free to download, it is licensed, you > can't change it and you can't make it available for others to download. > > I contacted the Ethnologue to hear if I could use the data. After 1 months > I get an answer that they want to see an example of the new dataset and > then they can give me a price for it. > As I see it this put a lot of constrains on me. I don't have money to buy > that dataset from the Ethnologue and I don't want to go and ask them every > time I want to make changes or try something new (and maybe need to wait 1 > months every time for their answer). I guess this is also one of the > advertised benefits of open source data; You can simply adapt and use it > for your own purposes without needing to ask anyone. > > Then I asked here in the list if we could maybe manage to make a full > language-territory mapping within CLDR, but the answers on this list until > now is that such mapping would be very subjective (even though it is also > stated that it is not needed as Ethnologue made a good dataset already). > > So I suggested that if so we could go for purely objective criteria, we > map languages to territories based on evidences of the amount of people > speaking the language in the territory, with this approach it doesn't > matter how big or small the population is, and anyone using the data can > extract the data they need based on their own criteria (e.g. only use > languages with more than 5% of speakers withing a territory). Then it's > been said that the data for the smaller languages is not useful and that it > is unrealistic as not all languages have locale data, but of course these > subjective comments doesn't clarify what the objective criteria are. > > I understand that it is not just a 1-2-3 to collect a full dataset, but it > should be developed some clear criteria that applies to all languages so > data can be structured to facilitate that it can be done in the long run: > - What is the minimum of data needed to add support for languages in CLDR? > - Can any language be included? And if not, what are the criteria we > operate with? As example, I would like to add Elfdalian > , it is pretty straight forward, > 2000 speakers in Sweden in Dalarna (subdivision SE-W). Can I just open a > ticket and get this data added to CLDR once it's been reviewed? > - What criteria is applied for language-territory mapping? For instance, > in the Ethnologue there is a notion of "immigrant" languages. Should there > be used objective or subjective criteria? > http://unicode.org/cldr/trac/ticket/9897 > http://unicode.org/cldr/trac/ticket/9915 > > The way I see it, to start with some language-territory mapping, > especially including mapping with subdivisions, before we have reliable > sources of accurate population, could also help generate more data in long > run, as it is much easier to try collect the data once it have been > geographically mapped. > > About language status I would be happy to start add data, but maybe it > should be clarified exactly which categorize that are most feasible? > http://unicode.org/cldr/trac/ticket/9856 > http://unicode.org/cldr/trac/ticket/9916 > > Mats > > On 22 November 2016 at 01:00, Steven R. Loomis > wrote: > >> Mats, >> I replied to your tickets http://unicode.org/cldr/trac/ticket/9915 and >> http://unicode.org/cldr/trac/ticket/9916 ? thank you for the good ideas >> (as far as completeness goes), but it?s not really clear what the purpose >> of the ticket should be. >> >> El 11/20/16 11:35 AM, "CLDR-Users en nombre de Mats Blakstad" < >> cldr-users-bounces at unicode.org en nombre de mats.gbproject at gmail.com> >> escribi?: >> >> I understand it would take a lot of time to collect the full data, but it >> also depends on how much engagement you manage to create for the work. >> >> On the other side: to simply allow users to start provide the data is >> first step in the process, and to do it would take very little time to do >> it! >> >> >> It?s not clear how users are hindered from providing data now? At >> present, the data is very meticulously collected from a number of sources, >> including feedback comments. >> >> Steven >> >> >> On 20 November 2016 at 19:54, Doug Ewell wrote: >> >>> Mats, >>> >>> I think you are genuinely underestimating the time and effort that this >>> project would take. >>> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mats.gbproject at gmail.com Tue Nov 22 02:50:47 2016 From: mats.gbproject at gmail.com (Mats Blakstad) Date: Tue, 22 Nov 2016 09:50:47 +0100 Subject: Dataset for all ISO639 code sorted by country/territory? In-Reply-To: References: <488D0FBC-4540-4B62-968D-54537B85F919@icu-project.org> Message-ID: On 22 Nov 2016 9:05 am, "Hugh Paterson" wrote: > > Mats, > > Just a thought, > > What do you gain by using the Ethnologue tables (ISO 8859-1 encoded tables) over just using the open licensed ISO 639-3 tables (in UTF-8)? http://www-01.sil.org/iso639-3/download.asp I have noticed some differences in the names of languages in these two files. I would stick with the UTF-8 tables. The UTF-8 tables are the source of the Ethnologue data, not the other way round. > > The Ethnologue does provide a country correspondence, and this is not necessarily changeable (due to license). However, there is another project called Glottalog http://glottolog.org which does propose a GPS coordinate for most languages http://glottolog.org/glottolog/language (their definition of a "language" is different than ISO 639-3's definition, but their data includes the ISO 639-3 set of language distinctions). Glottalog data is a bit more open than the Ethnologue data. Glottalog 2.7 data is licensed under Creative Commons 3.0 Attribution-ShareAlike, and is available on github. https://github.com/clld/glottolog-data > > Now we can't just go out and build upon the Ethnologue's data tables, but with a GPS coordinate in an open data table, a query of of the GEOhack API would return a county code and a secondary administrative unit for a political entity for a GPS coordinate. Here is an example of using the coordinates for Frankfurt a. M. Germany. > > https://tools.wmflabs.org/geohack/geohack.php?pagename=Frankfurt¶ms=50_7_N_8_41_E_type:city(732688)_region:DE-HE > > Now, the accessible Ethnologue tables could be used to verify GPS point data obtained from Glottalog. If there were a discrepancy between the two data sets one would have to determine how to make an editorial choice between the two sources. However, essentially, the functionality of the language-country correspondence would be replicated, albeit from different sources, and merely verified to be congruent with Ethnologue data tables. This is a great idea! I did check on the data at Glottalog, it is not complete and of course many languages are spoken in more areas than one GPS-coordinate, but it could be a really good starting point for creating an initial dataset! I guess the language-territory mapping already inside CLDR could be used as a third referance source to compare with. > > I agree with you that there is great value in open data sets. > > all the best, > > Hugh Paterson III > > On Mon, Nov 21, 2016 at 7:06 PM, Mats Blakstad wrote: >> >> Thanks for the replay Steven! >> Also thanks to Mark Davis for explaining more about calculation of language speakers within a territory. >> >> I'm interested to help provide data - however to me it is not clear if it is possible or what the criteria are. >> >> I initially wanted to use a language-country dataset from the Ethnologue: >> https://www.ethnologue.com/codes/download-code-tables >> I wanted to try play with this data, like filter out only living languages, merge it with data from IANA subtag register and CLDR locals to also map different variants and standards of languages and see if I could make some infographics or complie it with data from other sources. >> >> However, even though this data is free to download, it is licensed, you can't change it and you can't make it available for others to download. >> >> I contacted the Ethnologue to hear if I could use the data. After 1 months I get an answer that they want to see an example of the new dataset and then they can give me a price for it. >> As I see it this put a lot of constrains on me. I don't have money to buy that dataset from the Ethnologue and I don't want to go and ask them every time I want to make changes or try something new (and maybe need to wait 1 months every time for their answer). I guess this is also one of the advertised benefits of open source data; You can simply adapt and use it for your own purposes without needing to ask anyone. >> >> Then I asked here in the list if we could maybe manage to make a full language-territory mapping within CLDR, but the answers on this list until now is that such mapping would be very subjective (even though it is also stated that it is not needed as Ethnologue made a good dataset already). >> >> So I suggested that if so we could go for purely objective criteria, we map languages to territories based on evidences of the amount of people speaking the language in the territory, with this approach it doesn't matter how big or small the population is, and anyone using the data can extract the data they need based on their own criteria (e.g. only use languages with more than 5% of speakers withing a territory). Then it's been said that the data for the smaller languages is not useful and that it is unrealistic as not all languages have locale data, but of course these subjective comments doesn't clarify what the objective criteria are. >> >> I understand that it is not just a 1-2-3 to collect a full dataset, but it should be developed some clear criteria that applies to all languages so data can be structured to facilitate that it can be done in the long run: >> - What is the minimum of data needed to add support for languages in CLDR? >> - Can any language be included? And if not, what are the criteria we operate with? As example, I would like to add Elfdalian, it is pretty straight forward, 2000 speakers in Sweden in Dalarna (subdivision SE-W). Can I just open a ticket and get this data added to CLDR once it's been reviewed? >> - What criteria is applied for language-territory mapping? For instance, in the Ethnologue there is a notion of "immigrant" languages. Should there be used objective or subjective criteria? >> http://unicode.org/cldr/trac/ticket/9897 >> http://unicode.org/cldr/trac/ticket/9915 >> >> The way I see it, to start with some language-territory mapping, especially including mapping with subdivisions, before we have reliable sources of accurate population, could also help generate more data in long run, as it is much easier to try collect the data once it have been geographically mapped. >> >> About language status I would be happy to start add data, but maybe it should be clarified exactly which categorize that are most feasible? >> http://unicode.org/cldr/trac/ticket/9856 >> http://unicode.org/cldr/trac/ticket/9916 >> >> Mats >> >> On 22 November 2016 at 01:00, Steven R. Loomis wrote: >>> >>> Mats, >>> I replied to your tickets http://unicode.org/cldr/trac/ticket/9915 and http://unicode.org/cldr/trac/ticket/9916 ? thank you for the good ideas (as far as completeness goes), but it?s not really clear what the purpose of the ticket should be. >>> >>> El 11/20/16 11:35 AM, "CLDR-Users en nombre de Mats Blakstad" < cldr-users-bounces at unicode.org en nombre de mats.gbproject at gmail.com> escribi?: >>> >>>> I understand it would take a lot of time to collect the full data, but it also depends on how much engagement you manage to create for the work. >>>> >>>> On the other side: to simply allow users to start provide the data is first step in the process, and to do it would take very little time to do it! >>> >>> >>> It?s not clear how users are hindered from providing data now? At present, the data is very meticulously collected from a number of sources, including feedback comments. >>> >>> Steven >>> >>>> >>>> On 20 November 2016 at 19:54, Doug Ewell wrote: >>>>> >>>>> Mats, >>>>> >>>>> I think you are genuinely underestimating the time and effort that this project would take. >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From srl at icu-project.org Tue Nov 22 13:24:52 2016 From: srl at icu-project.org (Steven R. Loomis) Date: Tue, 22 Nov 2016 11:24:52 -0800 Subject: Dataset for all ISO639 code sorted by country/territory? In-Reply-To: References: <488D0FBC-4540-4B62-968D-54537B85F919@icu-project.org> Message-ID: <520C6D97-128E-405F-BCAF-FAFA126DD244@icu-project.org> El 11/21/16 7:06 PM, "Mats Blakstad" escribi?: Thanks for the replay Steven! Also thanks to Mark Davis for explaining more about calculation of language speakers within a territory. I'm interested to help provide data - however to me it is not clear if it is possible or what the criteria are. If you are talking about locale data ? the criteria are here http://cldr.unicode.org/index/bug-reports#New_Locales If you are talking about supplemental data (such as population figures, etc) it would be important to know what you are actually trying to do with the data, and where it is insufficient. Adding more data to add more data is not a sufficient reason. I do want to see better support for all languages, certainly. But that is a time consuming process, involving individual people and languages? not bulk datasets. ? Then I asked here in the list if we could maybe manage to make a full language-territory mapping within CLDR, but the answers on this list until now is that such mapping would be very subjective (even though it is also stated that it is not needed as Ethnologue made a good dataset already). All of this is more of a discussion to have with the Ethnologue. I browse the Ethnologue somewhat frequently, but I do not see the benefit in simply importing it into the CLDR supplemental data. So I suggested that if so we could go for purely objective criteria, we map languages to territories based on evidences of the amount of people speaking the language in the territory, with this approach it doesn't matter how big or small the population is, and anyone using the data can extract the data they need based on their own criteria (e.g. only use languages with more than 5% of speakers withing a territory). Then it's been said that the data for the smaller languages is not useful and that it is unrealistic as not all languages have locale data, but of course these subjective comments doesn't clarify what the objective criteria are. What are your objective criteria? I understand that it is not just a 1-2-3 to collect a full dataset, but it should be developed some clear criteria that applies to all languages so data can be structured to facilitate that it can be done in the long run: - What is the minimum of data needed to add support for languages in CLDR? That information is at http://cldr.unicode.org/index/bug-reports#New_Locales - Can any language be included? Theoretically, yes. And if not, what are the criteria we operate with? As example, I would like to add Elfdalian, it is pretty straight forward, 2000 speakers in Sweden in Dalarna (subdivision SE-W). Can I just open a ticket and get this data added to CLDR once it's been reviewed? Yes. But, just as with ancient Latin, it?s all just an interesting thought exercise, unless a ticket is opened. - What criteria is applied for language-territory mapping? For instance, in the Ethnologue there is a notion of "immigrant" languages. Should there be used objective or subjective criteria? See http://cldr.unicode.org/translation/default-content and http://cldr.unicode.org/index/cldr-spec/minimaldata . The mapping is used to determine, for example, what territory is default for de (German) - is it Germany? Switzerland? The US? Malta? All of these are possible. Which one is chosen is a judgement call in the context of locale data. I see Ethnologue?s term defined at https://www.ethnologue.com/about/country-info ? I don?t think it?s relevant to CLDR. http://unicode.org/cldr/trac/ticket/9897 http://unicode.org/cldr/trac/ticket/9915 The way I see it, to start with some language-territory mapping, especially including mapping with subdivisions, before we have reliable sources of accurate population, could also help generate more data in long run, as it is much easier to try collect the data once it have been geographically mapped. I?ll ask again though, what is your use case? Is it to duplicate Ethnologue? It?s hard to see the data collection mentioned here or in the other thread (geolocation data) as being relevant to locale data ? which is the purpose of CLDR. About language status I would be happy to start add data, but maybe it should be clarified exactly which categorize that are most feasible? http://unicode.org/cldr/trac/ticket/9856 http://unicode.org/cldr/trac/ticket/9916 I think this might be best answered when the tickets are reviewed. Mats On 22 November 2016 at 01:00, Steven R. Loomis wrote: Mats, I replied to your tickets http://unicode.org/cldr/trac/ticket/9915 and http://unicode.org/cldr/trac/ticket/9916 ? thank you for the good ideas (as far as completeness goes), but it?s not really clear what the purpose of the ticket should be. El 11/20/16 11:35 AM, "CLDR-Users en nombre de Mats Blakstad" escribi?: I understand it would take a lot of time to collect the full data, but it also depends on how much engagement you manage to create for the work. On the other side: to simply allow users to start provide the data is first step in the process, and to do it would take very little time to do it! It?s not clear how users are hindered from providing data now? At present, the data is very meticulously collected from a number of sources, including feedback comments. Steven On 20 November 2016 at 19:54, Doug Ewell wrote: Mats, I think you are genuinely underestimating the time and effort that this project would take. -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Tue Nov 22 18:27:37 2016 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Wed, 23 Nov 2016 00:27:37 +0000 Subject: Dataset for all ISO639 code sorted by country/territory? In-Reply-To: <2C4E6155-397D-4339-95C3-54D5C9196BB1@icu-project.org> References: <20161121005009.4d250936@JRWUBU2> <20161121225802.53dc837f@JRWUBU2> <2C4E6155-397D-4339-95C3-54D5C9196BB1@icu-project.org> Message-ID: <20161123002737.27b8f8f0@JRWUBU2> On Mon, 21 Nov 2016 15:33:10 -0800 "Steven R. Loomis" wrote: > El 11/21/16 2:58 PM, "CLDR-Users en nombre de Richard Wordingham" > richard.wordingham at ntlworld.com> escribi?: > > >On Mon, 21 Nov 2016 14:15:01 -0800 > >"Steven R. Loomis" wrote: > >The most important data will be the locale-based data for text > >manipulation, rather than much of the other stuff. > Sure. > I think the point is that ?here is some data for X? is probably more > helpful to everyone than ?CLDR ought to include X? The problem is that general applications (which I believe are the main point of Unicode) are likely to insist on extracting the data from CLDR, perhaps even via ICU. Richard. From mats.gbproject at gmail.com Wed Nov 23 17:24:19 2016 From: mats.gbproject at gmail.com (Mats Blakstad) Date: Thu, 24 Nov 2016 00:24:19 +0100 Subject: Dataset for all ISO639 code sorted by country/territory? In-Reply-To: <520C6D97-128E-405F-BCAF-FAFA126DD244@icu-project.org> References: <488D0FBC-4540-4B62-968D-54537B85F919@icu-project.org> <520C6D97-128E-405F-BCAF-FAFA126DD244@icu-project.org> Message-ID: On 22 November 2016 at 20:24, Steven R. Loomis wrote: > El 11/21/16 7:06 PM, "Mats Blakstad" escribi?: > > Thanks for the replay Steven! > Also thanks to Mark Davis for explaining more about calculation of > language speakers within a territory. > > I'm interested to help provide data - however to me it is not clear if it > is possible or what the criteria are. > > > If you are talking about locale data ? the criteria are here > http://cldr.unicode.org/index/bug-reports#New_Locales > Thanks for info! It seems like there are several languages added inside supplementalData.xml that do not have locals so seem like we can easily add new supplemental data for languages without locals. Also looks like there are support for languages that have 0.0031% of the speakers so looks like several small languages are already supported. > > If you are talking about supplemental data (such as population figures, > etc) it would be important to know what you are actually trying to do with > the data, and where it is insufficient. Adding more data to add more data > is not a sufficient reason. > Yes I'm talking about the supplemental data. I don't only want to add data "to add more data" even though I definitely think building data that can help generate more data about, and support, more languages, is definetly a valid reason. I want to use the data for many things; More easily identify likely second language of speakers of "lesser known languages" based on HTTP Accept-Language and which territory or subdivision they are placed. Be able to present information in these languages and language swicther for these languages dependent of which territory/subdivision the user is from. Be able to offer users to help translate into local languages depending on their territory/sub-division. The bottom line is; be able to give a better user experience for people speaking "lesser known languages". With a language-territory mapping it will be possible for developers to use this data also in new creative ways to better support multilingualism. > I do want to see better support for all languages, certainly. But that is > a time consuming process, involving individual people and languages? not > bulk datasets. > I do not really understand why bulk datasets should not be accepted, to me it seems like data is added based in evidences. So wheater the data is added should depend on weather the data comes from a reliable source. Besides I'm an individual people and I'm ready to be involved! > > ? > Then I asked here in the list if we could maybe manage to make a full > language-territory mapping within CLDR, but the answers on this list until > now is that such mapping would be very subjective (even though it is also > stated that it is not needed as Ethnologue made a good dataset already). > > > All of this is more of a discussion to have with the Ethnologue. I browse > the Ethnologue somewhat frequently, but I do not see the benefit in simply > importing it into the CLDR supplemental data. > > So I suggested that if so we could go for purely objective criteria, we > map languages to territories based on evidences of the amount of people > speaking the language in the territory, with this approach it doesn't > matter how big or small the population is, and anyone using the data can > extract the data they need based on their own criteria (e.g. only use > languages with more than 5% of speakers withing a territory). Then it's > been said that the data for the smaller languages is not useful and that it > is unrealistic as not all languages have locale data, but of course these > subjective comments doesn't clarify what the objective criteria are. > > > What are your objective criteria? > I would say, we map any language with territory based on evidences, where we can document a number of speakers we add the language no matter what status it has. If we can't accurately say a number of speakers, but know that the territory is the primary place the languages is spoken, we map it even without accurate language population. As example; from Glottolog we can see that the language Tem is spoken in Benin, Ghana and Togo, this information can easily be verified with comparing the data from the Ethnologue: http://glottolog.org/resource/languoid/id/temm1241 https://www.ethnologue.com/language/kdh We can't copy the Ethnologue's data for population, but at least we know that 2 reliable sources are saying that this is the correct language-territory mapping. Based on this evidence we can now map Tem language with Benin, Ghana and Togo even though we do not have the exact data for the population. I guess in many cases the mapping in itself is enough to do many things to support "lesser known languages". Those not interested in this mapping can of course easily extract only the territory-language mappings that have indication of language population. > I understand that it is not just a 1-2-3 to collect a full dataset, but it > should be developed some clear criteria that applies to all languages so > data can be structured to facilitate that it can be done in the long run: > - What is the minimum of data needed to add support for languages in CLDR? > > > That information is at http://cldr.unicode.org/ > index/bug-reports#New_Locales > > - Can any language be included? > > > Theoretically, yes. > > And if not, what are the criteria we operate with? As example, I would > like to add Elfdalian , it is > pretty straight forward, 2000 speakers in Sweden in Dalarna (subdivision > SE-W). Can I just open a ticket and get this data added to CLDR once it's > been reviewed? > > > Yes. > > But, just as with ancient Latin, it?s all just an interesting thought > exercise, unless a ticket is opened. > Done: http://unicode.org/cldr/trac/ticket/9919 > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hugh_paterson at sil.org Thu Nov 24 10:42:56 2016 From: hugh_paterson at sil.org (Hugh Paterson) Date: Thu, 24 Nov 2016 08:42:56 -0800 Subject: Dataset for all ISO639 code sorted by country/territory? In-Reply-To: References: <488D0FBC-4540-4B62-968D-54537B85F919@icu-project.org> <520C6D97-128E-405F-BCAF-FAFA126DD244@icu-project.org> Message-ID: Mats, How do you know that Glottalog did not copy Ethnologe data, or that it's primary cited evidence is the Ethnologue? Glottalog should not be cited as a source itself, but rather treated as an aggregation of facts, which are in tern needing independent citations. Some types of Glottalog data are produced via scripted data extraction. In contrast, the editors of the Ethnologue do host workshops in various regions of the world and directly elicit data from language community members. (But not 100% of their data is collected this way, some come from language development workers, or academics who work in these communities.) So, qualitatively the two sources are very different, and deserve appropriate levels of respect. Just because we read a news story on the BBC and on Al Jazeria's websites does not mean that the story is accurate or even true. - Hugh On Wed, Nov 23, 2016 at 3:24 PM, Mats Blakstad wrote: > > > On 22 November 2016 at 20:24, Steven R. Loomis > wrote: > >> El 11/21/16 7:06 PM, "Mats Blakstad" escribi?: >> >> Thanks for the replay Steven! >> Also thanks to Mark Davis for explaining more about calculation of >> language speakers within a territory. >> >> I'm interested to help provide data - however to me it is not clear if it >> is possible or what the criteria are. >> >> >> If you are talking about locale data ? the criteria are here >> http://cldr.unicode.org/index/bug-reports#New_Locales >> > > Thanks for info! It seems like there are several languages added inside > supplementalData.xml that do not have locals so seem like we can easily add > new supplemental data for languages without locals. Also looks like there > are support for languages that have 0.0031% of the speakers so looks like > several small languages are already supported. > >> >> If you are talking about supplemental data (such as population figures, >> etc) it would be important to know what you are actually trying to do with >> the data, and where it is insufficient. Adding more data to add more data >> is not a sufficient reason. >> > > Yes I'm talking about the supplemental data. I don't only want to add data > "to add more data" even though I definitely think building data that can > help generate more data about, and support, more languages, is definetly a > valid reason. > > I want to use the data for many things; More easily identify likely second > language of speakers of "lesser known languages" based on HTTP > Accept-Language and which territory or subdivision they are placed. Be able > to present information in these languages and language swicther for these > languages dependent of which territory/subdivision the user is from. Be > able to offer users to help translate into local languages depending on > their territory/sub-division. The bottom line is; be able to give a better > user experience for people speaking "lesser known languages". With a > language-territory mapping it will be possible for developers to use this > data also in new creative ways to better support multilingualism. > > >> I do want to see better support for all languages, certainly. But that >> is a time consuming process, involving individual people and languages? not >> bulk datasets. >> > > I do not really understand why bulk datasets should not be accepted, to me > it seems like data is added based in evidences. So wheater the data is > added should depend on weather the data comes from a reliable source. > Besides I'm an individual people and I'm ready to be involved! > >> >> ? >> Then I asked here in the list if we could maybe manage to make a full >> language-territory mapping within CLDR, but the answers on this list until >> now is that such mapping would be very subjective (even though it is also >> stated that it is not needed as Ethnologue made a good dataset already). >> >> >> All of this is more of a discussion to have with the Ethnologue. I browse >> the Ethnologue somewhat frequently, but I do not see the benefit in simply >> importing it into the CLDR supplemental data. >> >> So I suggested that if so we could go for purely objective criteria, we >> map languages to territories based on evidences of the amount of people >> speaking the language in the territory, with this approach it doesn't >> matter how big or small the population is, and anyone using the data can >> extract the data they need based on their own criteria (e.g. only use >> languages with more than 5% of speakers withing a territory). Then it's >> been said that the data for the smaller languages is not useful and that it >> is unrealistic as not all languages have locale data, but of course these >> subjective comments doesn't clarify what the objective criteria are. >> >> >> What are your objective criteria? >> > > I would say, we map any language with territory based on evidences, where > we can document a number of speakers we add the language no matter what > status it has. > If we can't accurately say a number of speakers, but know that the > territory is the primary place the languages is spoken, we map it even > without accurate language population. As example; from Glottolog we can see > that the language Tem is spoken in Benin, Ghana and Togo, this information > can easily be verified with comparing the data from the Ethnologue: > http://glottolog.org/resource/languoid/id/temm1241 > https://www.ethnologue.com/language/kdh > We can't copy the Ethnologue's data for population, but at least we know > that 2 reliable sources are saying that this is the correct > language-territory mapping. > Based on this evidence we can now map Tem language with Benin, Ghana and > Togo even though we do not have the exact data for the population. > I guess in many cases the mapping in itself is enough to do many things to > support "lesser known languages". > Those not interested in this mapping can of course easily extract only the > territory-language mappings that have indication of language population. > > >> I understand that it is not just a 1-2-3 to collect a full dataset, but >> it should be developed some clear criteria that applies to all languages so >> data can be structured to facilitate that it can be done in the long run: >> - What is the minimum of data needed to add support for languages in CLDR? >> >> >> That information is at http://cldr.unicode.org/ind >> ex/bug-reports#New_Locales >> >> - Can any language be included? >> >> >> Theoretically, yes. >> >> And if not, what are the criteria we operate with? As example, I would >> like to add Elfdalian , it is >> pretty straight forward, 2000 speakers in Sweden in Dalarna (subdivision >> SE-W). Can I just open a ticket and get this data added to CLDR once it's >> been reviewed? >> >> >> Yes. >> >> But, just as with ancient Latin, it?s all just an interesting thought >> exercise, unless a ticket is opened. >> > > Done: > http://unicode.org/cldr/trac/ticket/9919 > >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From cjl at sugarlabs.org Thu Nov 24 12:28:26 2016 From: cjl at sugarlabs.org (Chris Leonard) Date: Thu, 24 Nov 2016 13:28:26 -0500 Subject: Dataset for all ISO639 code sorted by country/territory? In-Reply-To: References: <488D0FBC-4540-4B62-968D-54537B85F919@icu-project.org> <520C6D97-128E-405F-BCAF-FAFA126DD244@icu-project.org> Message-ID: Just so you know there are other sources of indigenous language data that is locally developed for First Languages Australia. http://firstlanguages.org.au/ at http://gambay.com.au/map but I'm not sure about getting access to lat/long data you'd have to talk to them. cjl From mats.gbproject at gmail.com Thu Nov 24 16:21:28 2016 From: mats.gbproject at gmail.com (Mats Blakstad) Date: Thu, 24 Nov 2016 23:21:28 +0100 Subject: Dataset for all ISO639 code sorted by country/territory? In-Reply-To: References: <488D0FBC-4540-4B62-968D-54537B85F919@icu-project.org> <520C6D97-128E-405F-BCAF-FAFA126DD244@icu-project.org> Message-ID: On 24 November 2016 at 17:42, Hugh Paterson wrote: > Mats, > How do you know that Glottalog did not copy Ethnologe data, or that it's > primary cited evidence is the Ethnologue? Glottalog should not be cited as > a source itself, but rather treated as an aggregation of facts, which are > in tern needing independent citations. Some types of Glottalog data are > produced via scripted data extraction. > I'm not sure, however the question is speculative, Glottalog is published as a creative commons so the license give right to copy the data. > > > In contrast, the editors of the Ethnologue do host workshops in various > regions of the world and directly elicit data from language community > members. (But not 100% of their data is collected this way, some come from > language development workers, or academics who work in these communities.) > I sent email to ask Glottolog and got answer from Harald Hammarstr?m which gave me permission to quote him: *For the language inventory Glottolog is more reliable than Ethnologue. While it's true that SIL has teams that do surveys and Glottolog does not, Glottolog cites such surveys (including those not done by SIL) and adjusts. It is correct that Ethnologue was used as a starting point of the Glottolog inventory and a lot of it turns out to be correct given the entire literature out there. If this is what is meant by "copy" then it is correct. In this sense basically every handbook (incl Ethnologue) has copied every preceding one and this is a good practice as long as it is cited. Around 10% of the Ethnologue inventory has been revised into what is now Glottolog. Glottolog does not cite Ethnologue every time an entry corresponds (though we do give the link), because Ethnologue does not provide sources itself, instead for every language there is at least one reference to the literature where one can go and find more information about the language from a book or paper which does explain how they got their data and so on. The dialect inventory in Glottolog, on the other hand, is not reliable. The language-country mappings (is this what you mean by language-territory mappings?) are trivial as soon as the identity of the language is established and should be the same as in Ethnologue whenever the language identity is parallel, with the exception that Glottolog is more restrictive in adding the country of an immigrant community (+ various misc revisions). I do not consider language-country mappings a well-defined problem in the age of globalization when you can have a majority of a speaker community living in the capital of a country different from that of their home community, so the language-country mappings are reviewed only to the degree that the country/ies listed by Glottolog are a subset of those where the speakers live or lived at the first eyewitness ethnographic documentation time. * > So, qualitatively the two sources are very different, and deserve > appropriate levels of respect. Just because we read a news story on the BBC > and on Al Jazeria's websites does not mean that the story is accurate or > even true. > I'm not really sure if Ethnologue have better quality of language-terrirtory mapping than Glottalog. However Glottalog is something that can be built on as it is Creative Commons, so it is the only viable starting point. Will however be interesting to comparison of the two data sets to see how much they diverge. > > - Hugh > > > On Wed, Nov 23, 2016 at 3:24 PM, Mats Blakstad > wrote: > >> >> >> On 22 November 2016 at 20:24, Steven R. Loomis >> wrote: >> >>> El 11/21/16 7:06 PM, "Mats Blakstad" >>> escribi?: >>> >>> Thanks for the replay Steven! >>> Also thanks to Mark Davis for explaining more about calculation of >>> language speakers within a territory. >>> >>> I'm interested to help provide data - however to me it is not clear if >>> it is possible or what the criteria are. >>> >>> >>> If you are talking about locale data ? the criteria are here >>> http://cldr.unicode.org/index/bug-reports#New_Locales >>> >> >> Thanks for info! It seems like there are several languages added inside >> supplementalData.xml that do not have locals so seem like we can easily add >> new supplemental data for languages without locals. Also looks like there >> are support for languages that have 0.0031% of the speakers so looks like >> several small languages are already supported. >> >>> >>> If you are talking about supplemental data (such as population figures, >>> etc) it would be important to know what you are actually trying to do with >>> the data, and where it is insufficient. Adding more data to add more data >>> is not a sufficient reason. >>> >> >> Yes I'm talking about the supplemental data. I don't only want to add >> data "to add more data" even though I definitely think building data that >> can help generate more data about, and support, more languages, is >> definetly a valid reason. >> >> I want to use the data for many things; More easily identify likely >> second language of speakers of "lesser known languages" based on HTTP >> Accept-Language and which territory or subdivision they are placed. Be able >> to present information in these languages and language swicther for these >> languages dependent of which territory/subdivision the user is from. Be >> able to offer users to help translate into local languages depending on >> their territory/sub-division. The bottom line is; be able to give a better >> user experience for people speaking "lesser known languages". With a >> language-territory mapping it will be possible for developers to use this >> data also in new creative ways to better support multilingualism. >> >> >>> I do want to see better support for all languages, certainly. But that >>> is a time consuming process, involving individual people and languages? not >>> bulk datasets. >>> >> >> I do not really understand why bulk datasets should not be accepted, to >> me it seems like data is added based in evidences. So wheater the data is >> added should depend on weather the data comes from a reliable source. >> Besides I'm an individual people and I'm ready to be involved! >> >>> >>> ? >>> Then I asked here in the list if we could maybe manage to make a full >>> language-territory mapping within CLDR, but the answers on this list until >>> now is that such mapping would be very subjective (even though it is also >>> stated that it is not needed as Ethnologue made a good dataset already). >>> >>> >>> All of this is more of a discussion to have with the Ethnologue. I >>> browse the Ethnologue somewhat frequently, but I do not see the benefit in >>> simply importing it into the CLDR supplemental data. >>> >>> So I suggested that if so we could go for purely objective criteria, we >>> map languages to territories based on evidences of the amount of people >>> speaking the language in the territory, with this approach it doesn't >>> matter how big or small the population is, and anyone using the data can >>> extract the data they need based on their own criteria (e.g. only use >>> languages with more than 5% of speakers withing a territory). Then it's >>> been said that the data for the smaller languages is not useful and that it >>> is unrealistic as not all languages have locale data, but of course these >>> subjective comments doesn't clarify what the objective criteria are. >>> >>> >>> What are your objective criteria? >>> >> >> I would say, we map any language with territory based on evidences, where >> we can document a number of speakers we add the language no matter what >> status it has. >> If we can't accurately say a number of speakers, but know that the >> territory is the primary place the languages is spoken, we map it even >> without accurate language population. As example; from Glottolog we can see >> that the language Tem is spoken in Benin, Ghana and Togo, this information >> can easily be verified with comparing the data from the Ethnologue: >> http://glottolog.org/resource/languoid/id/temm1241 >> https://www.ethnologue.com/language/kdh >> We can't copy the Ethnologue's data for population, but at least we know >> that 2 reliable sources are saying that this is the correct >> language-territory mapping. >> Based on this evidence we can now map Tem language with Benin, Ghana and >> Togo even though we do not have the exact data for the population. >> I guess in many cases the mapping in itself is enough to do many things >> to support "lesser known languages". >> Those not interested in this mapping can of course easily extract only >> the territory-language mappings that have indication of language population. >> >> >>> I understand that it is not just a 1-2-3 to collect a full dataset, but >>> it should be developed some clear criteria that applies to all languages so >>> data can be structured to facilitate that it can be done in the long run: >>> - What is the minimum of data needed to add support for languages in >>> CLDR? >>> >>> >>> That information is at http://cldr.unicode.org/ind >>> ex/bug-reports#New_Locales >>> >>> - Can any language be included? >>> >>> >>> Theoretically, yes. >>> >>> And if not, what are the criteria we operate with? As example, I would >>> like to add Elfdalian , it is >>> pretty straight forward, 2000 speakers in Sweden in Dalarna (subdivision >>> SE-W). Can I just open a ticket and get this data added to CLDR once it's >>> been reviewed? >>> >>> >>> Yes. >>> >>> But, just as with ancient Latin, it?s all just an interesting thought >>> exercise, unless a ticket is opened. >>> >> >> Done: >> http://unicode.org/cldr/trac/ticket/9919 >> >>> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mats.gbproject at gmail.com Thu Nov 24 16:31:05 2016 From: mats.gbproject at gmail.com (Mats Blakstad) Date: Thu, 24 Nov 2016 23:31:05 +0100 Subject: Dataset for all ISO639 code sorted by country/territory? In-Reply-To: References: <488D0FBC-4540-4B62-968D-54537B85F919@icu-project.org> <520C6D97-128E-405F-BCAF-FAFA126DD244@icu-project.org> Message-ID: On 24 November 2016 at 19:28, Chris Leonard wrote: > Just so you know there are other sources of indigenous language data > that is locally developed for First Languages Australia. > > http://firstlanguages.org.au/ > > at > > http://gambay.com.au/map > > > Thank you for this tips! I also started to check around for other data sets that can be used to try validate or elaborate on the data from Glottalog, so other suggestions are also helpful. -------------- next part -------------- An HTML attachment was scrubbed... URL: