Dataset for all ISO639 code sorted by country/territory?

Tue Nov 22 02:50:47 CST 2016

On 22 Nov 2016 9:05 am, "Hugh Paterson" <hugh_paterson at sil.org> wrote:
>
> Mats,
>
> Just a thought,
>
> What do you gain by using the Ethnologue tables (ISO 8859-1 encoded
tables) over just using the open licensed ISO 639-3 tables (in UTF-8)?
http://www-01.sil.org/iso639-3/download.asp I have noticed some differences
in the names of languages in these two files. I would stick with the UTF-8
tables. The UTF-8 tables are the source of the Ethnologue data, not the
other way round.
>
> The Ethnologue does provide a country correspondence, and this is not
necessarily changeable (due to license). However, there is another project
called Glottalog http://glottolog.org which does propose a GPS coordinate
for most languages http://glottolog.org/glottolog/language (their
definition of a "language" is different than ISO 639-3's definition, but
their data includes the ISO 639-3 set of language distinctions). Glottalog
data is a bit more open than the Ethnologue data. Glottalog 2.7 data is
licensed under Creative Commons 3.0 Attribution-ShareAlike, and is
available on github. https://github.com/clld/glottolog-data
>
> Now we can't just go out and build upon the Ethnologue's data tables, but
with a GPS coordinate in an open data table, a query of of the GEOhack API
would return a county code and a secondary administrative unit for a
political entity for a GPS  coordinate. Here is an example of using the
coordinates for Frankfurt a. M. Germany.
>
>
https://tools.wmflabs.org/geohack/geohack.php?pagename=Frankfurt&params=50_7_N_8_41_E_type:city(732688)_region:DE-HE
>
> Now, the accessible Ethnologue tables could be used to verify GPS point
data obtained from Glottalog. If there were a discrepancy between the two
data sets one would have to determine how to make an editorial choice
between the two sources. However, essentially, the functionality of the
language-country correspondence would be replicated, albeit from different
sources, and merely verified to be congruent with Ethnologue data tables.

This is a great idea! I did check on the data at Glottalog, it is not
complete and of course many languages are spoken in more areas than one
GPS-coordinate, but it could be a really good starting point for creating
an initial dataset! I guess the language-territory mapping already inside
CLDR could be used as a third referance source to compare with.
>
> I agree with you that there is great value in open data sets.
>
> all the best,
>
> Hugh Paterson III
>
> On Mon, Nov 21, 2016 at 7:06 PM, Mats Blakstad <mats.gbproject at gmail.com>
wrote:
>>
>> Thanks for the replay Steven!
>> Also thanks to Mark Davis for explaining more about calculation of
language speakers within a territory.
>>
>> I'm interested to help provide data - however to me it is not clear if
it is possible or what the criteria are.
>>
>> I initially wanted to use a language-country dataset from the Ethnologue:
>> https://www.ethnologue.com/codes/download-code-tables
>> I wanted to try play with this data, like filter out only living
languages, merge it with data from IANA subtag register and CLDR locals to
also map different variants and standards of languages and see if I could
make some infographics or complie it with data from other sources.
>>
>> However, even though this data is free to download, it is licensed, you
can't change it and you can't make it available for others to download.
>>
>> I contacted the Ethnologue to hear if I could use the data. After 1
months I get an answer that they want to see an example of the new dataset
and then they can give me a price for it.
>> As I see it this put a lot of constrains on me. I don't have money to
buy that dataset from the Ethnologue and I don't want to go and ask them
every time I want to make changes or try something new (and maybe need to
wait 1 months every time for their answer). I guess this is also one of the
advertised benefits of open source data; You can simply adapt and use it
for your own purposes without needing to ask anyone.
>>
>> Then I asked here in the list if we could maybe manage to make a full
language-territory mapping within CLDR, but the answers on this list until
now is that such mapping would be very subjective (even though it is also
stated that it is not needed as Ethnologue made a good dataset already).
>>
>> So I suggested that if so we could go for purely objective criteria, we
map languages to territories based on evidences of the amount of people
speaking the language in the territory, with this approach it doesn't
matter how big or small the population is, and anyone using the data can
extract the data they need based on their own criteria (e.g. only use
languages with more than 5% of speakers withing a territory). Then it's
been said that the data for the smaller languages is not useful and that it
is unrealistic as not all languages have locale data, but of course these
subjective comments doesn't clarify what the objective criteria are.
>>
>> I understand that it is not just a 1-2-3 to collect a full dataset, but
it should be developed some clear criteria that applies to all languages so
data can be structured to facilitate that it can be done in the long run:
>> - What is the minimum of data needed to add support for languages in
CLDR?
>> - Can any language be included? And if not, what are the criteria we
operate with? As example, I would like to add Elfdalian, it is pretty
straight forward, 2000 speakers in Sweden in Dalarna (subdivision SE-W).
Can I just open a ticket and get this data added to CLDR once it's been
reviewed?
>> - What criteria is applied for language-territory mapping? For instance,
in the Ethnologue there is a notion of "immigrant" languages. Should there
be used objective or subjective criteria?
>> http://unicode.org/cldr/trac/ticket/9897
>> http://unicode.org/cldr/trac/ticket/9915
>>
>> The way I see it, to start with some language-territory mapping,
especially including mapping with subdivisions, before we have reliable
sources of accurate population, could also help generate more data in long
run, as it is much easier to try collect the data once it have been
geographically mapped.
>>
>> About language status I would be happy to start add data, but maybe it
should be clarified exactly which categorize that are most feasible?
>> http://unicode.org/cldr/trac/ticket/9856
>> http://unicode.org/cldr/trac/ticket/9916
>>
>> Mats
>>
>> On 22 November 2016 at 01:00, Steven R. Loomis <srl at icu-project.org>
wrote:
>>>
>>> Mats,
>>>  I replied to your tickets http://unicode.org/cldr/trac/ticket/9915 and
http://unicode.org/cldr/trac/ticket/9916 – thank you for the good ideas (as
far as completeness goes), but it’s not really clear what the purpose of
the ticket should be.
>>>
>>> El 11/20/16 11:35 AM, "CLDR-Users en nombre de Mats Blakstad" <
cldr-users-bounces at unicode.org en nombre de mats.gbproject at gmail.com>
escribió:
>>>
>>>> I understand it would take a lot of time to collect the full data, but
it also depends on how much engagement you manage to create for the work.
>>>>
>>>> On the other side: to simply allow users to start provide the data is
first step in the process, and to do it would take very little time to do
it!
>>>
>>>
>>> It’s not clear how users are hindered from providing data now?  At
present, the data is very meticulously collected from a number of sources,
including feedback comments.
>>>
>>> Steven
>>>
>>>>
>>>> On 20 November 2016 at 19:54, Doug Ewell <doug at ewellic.org> wrote:
>>>>>
>>>>> Mats,
>>>>>
>>>>> I think you are genuinely underestimating the time and effort that
this project would take.
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/cldr-users/attachments/20161122/0a36a194/attachment-0001.html>