Dataset for all ISO639 code sorted by country/territory?

Mon Nov 21 21:06:50 CST 2016

Thanks for the replay Steven!
Also thanks to Mark Davis for explaining more about calculation of language
speakers within a territory.

I'm interested to help provide data - however to me it is not clear if it
is possible or what the criteria are.

I initially wanted to use a language-country dataset from the Ethnologue:
https://www.ethnologue.com/codes/download-code-tables
I wanted to try play with this data, like filter out only living languages,
merge it with data from IANA subtag register and CLDR locals to also map
different variants and standards of languages and see if I could make some
infographics or complie it with data from other sources.

However, even though this data is free to download, it is licensed, you
can't change it and you can't make it available for others to download.

I contacted the Ethnologue to hear if I could use the data. After 1 months
I get an answer that they want to see an example of the new dataset and
then they can give me a price for it.
As I see it this put a lot of constrains on me. I don't have money to buy
that dataset from the Ethnologue and I don't want to go and ask them every
time I want to make changes or try something new (and maybe need to wait 1
months every time for their answer). I guess this is also one of the
advertised benefits of open source data; You can simply adapt and use it
for your own purposes without needing to ask anyone.

Then I asked here in the list if we could maybe manage to make a full
language-territory mapping within CLDR, but the answers on this list until
now is that such mapping would be very subjective (even though it is also
stated that it is not needed as Ethnologue made a good dataset already).

So I suggested that if so we could go for purely objective criteria, we map
languages to territories based on evidences of the amount of people
speaking the language in the territory, with this approach it doesn't
matter how big or small the population is, and anyone using the data can
extract the data they need based on their own criteria (e.g. only use
languages with more than 5% of speakers withing a territory). Then it's
been said that the data for the smaller languages is not useful and that it
is unrealistic as not all languages have locale data, but of course these
subjective comments doesn't clarify what the objective criteria are.

I understand that it is not just a 1-2-3 to collect a full dataset, but it
should be developed some clear criteria that applies to all languages so
data can be structured to facilitate that it can be done in the long run:
- What is the minimum of data needed to add support for languages in CLDR?
- Can any language be included? And if not, what are the criteria we
operate with? As example, I would like to add Elfdalian
<https://en.wikipedia.org/wiki/Elfdalian>, it is pretty straight forward,
2000 speakers in Sweden in Dalarna (subdivision SE-W). Can I just open a
ticket and get this data added to CLDR once it's been reviewed?
- What criteria is applied for language-territory mapping? For instance, in
the Ethnologue there is a notion of "immigrant" languages. Should there be
used objective or subjective criteria?
http://unicode.org/cldr/trac/ticket/9897
http://unicode.org/cldr/trac/ticket/9915

The way I see it, to start with some language-territory mapping, especially
including mapping with subdivisions, before we have reliable sources of
accurate population, could also help generate more data in long run, as it
is much easier to try collect the data once it have been geographically
mapped.

About language status I would be happy to start add data, but maybe it
should be clarified exactly which categorize that are most feasible?
http://unicode.org/cldr/trac/ticket/9856
http://unicode.org/cldr/trac/ticket/9916

Mats

On 22 November 2016 at 01:00, Steven R. Loomis <srl at icu-project.org> wrote:

> Mats,
>  I replied to your tickets http://unicode.org/cldr/trac/ticket/9915 and
> http://unicode.org/cldr/trac/ticket/9916 – thank you for the good ideas
> (as far as completeness goes), but it’s not really clear what the purpose
> of the ticket should be.
>
> El 11/20/16 11:35 AM, "CLDR-Users en nombre de Mats Blakstad" <
> cldr-users-bounces at unicode.org en nombre de mats.gbproject at gmail.com>
> escribió:
>
> I understand it would take a lot of time to collect the full data, but it
> also depends on how much engagement you manage to create for the work.
>
> On the other side: to simply allow users to start provide the data is
> first step in the process, and to do it would take very little time to do
> it!
>
>
> It’s not clear how users are hindered from providing data now?  At
> present, the data is very meticulously collected from a number of sources,
> including feedback comments.
>
> Steven
>
>
> On 20 November 2016 at 19:54, Doug Ewell <doug at ewellic.org> wrote:
>
>> Mats,
>>
>> I think you are genuinely underestimating the time and effort that this
>> project would take.
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/cldr-users/attachments/20161122/fa4fa097/attachment.html>