Dataset for all ISO639 code sorted by country/territory?

Hugh Paterson hugh_paterson at sil.org
Tue Nov 22 02:05:01 CST 2016


Mats,

Just a thought,

What do you gain by using the Ethnologue tables (ISO 8859-1 encoded tables)
over just using the open licensed ISO 639-3 tables (in UTF-8)?
http://www-01.sil.org/iso639-3/download.asp I have noticed some differences
in the names of languages in these two files. I would stick with the UTF-8
tables. The UTF-8 tables are the source of the Ethnologue data, not the
other way round.

The Ethnologue does provide a country correspondence, and this is not
necessarily changeable (due to license). However, there is another project
called Glottalog http://glottolog.org which does propose a GPS coordinate
for most languages http://glottolog.org/glottolog/language (their
definition of a "language" is different than ISO 639-3's definition, but
their data includes the ISO 639-3 set of language distinctions). Glottalog
data is a bit more open than the Ethnologue data. Glottalog 2.7 data is
licensed under Creative Commons 3.0 Attribution-ShareAlike, and is
available on github. https://github.com/clld/glottolog-data

Now we can't just go out and build upon the Ethnologue's data tables, but
with a GPS coordinate in an open data table, a query of of the GEOhack API
would return a county code and a secondary administrative unit for a
political entity for a GPS  coordinate. Here is an example of using the
coordinates for Frankfurt a. M. Germany.

https://tools.wmflabs.org/geohack/geohack.php?pagename=Frankfurt&params=50_7_N_8_41_E_type:city(732688)_region:DE-HE

Now, the accessible Ethnologue tables could be used to verify GPS point
data obtained from Glottalog. If there were a discrepancy between the two
data sets one would have to determine how to make an editorial choice
between the two sources. However, essentially, the functionality of the
language-country correspondence would be replicated, albeit from different
sources, and merely verified to be congruent with Ethnologue data tables.

I agree with you that there is great value in open data sets.

all the best,

Hugh Paterson III

On Mon, Nov 21, 2016 at 7:06 PM, Mats Blakstad <mats.gbproject at gmail.com>
wrote:

> Thanks for the replay Steven!
> Also thanks to Mark Davis for explaining more about calculation of
> language speakers within a territory.
>
> I'm interested to help provide data - however to me it is not clear if it
> is possible or what the criteria are.
>
> I initially wanted to use a language-country dataset from the Ethnologue:
> https://www.ethnologue.com/codes/download-code-tables
> I wanted to try play with this data, like filter out only living
> languages, merge it with data from IANA subtag register and CLDR locals to
> also map different variants and standards of languages and see if I could
> make some infographics or complie it with data from other sources.
>
> However, even though this data is free to download, it is licensed, you
> can't change it and you can't make it available for others to download.
>
> I contacted the Ethnologue to hear if I could use the data. After 1 months
> I get an answer that they want to see an example of the new dataset and
> then they can give me a price for it.
> As I see it this put a lot of constrains on me. I don't have money to buy
> that dataset from the Ethnologue and I don't want to go and ask them every
> time I want to make changes or try something new (and maybe need to wait 1
> months every time for their answer). I guess this is also one of the
> advertised benefits of open source data; You can simply adapt and use it
> for your own purposes without needing to ask anyone.
>
> Then I asked here in the list if we could maybe manage to make a full
> language-territory mapping within CLDR, but the answers on this list until
> now is that such mapping would be very subjective (even though it is also
> stated that it is not needed as Ethnologue made a good dataset already).
>
> So I suggested that if so we could go for purely objective criteria, we
> map languages to territories based on evidences of the amount of people
> speaking the language in the territory, with this approach it doesn't
> matter how big or small the population is, and anyone using the data can
> extract the data they need based on their own criteria (e.g. only use
> languages with more than 5% of speakers withing a territory). Then it's
> been said that the data for the smaller languages is not useful and that it
> is unrealistic as not all languages have locale data, but of course these
> subjective comments doesn't clarify what the objective criteria are.
>
> I understand that it is not just a 1-2-3 to collect a full dataset, but it
> should be developed some clear criteria that applies to all languages so
> data can be structured to facilitate that it can be done in the long run:
> - What is the minimum of data needed to add support for languages in CLDR?
> - Can any language be included? And if not, what are the criteria we
> operate with? As example, I would like to add Elfdalian
> <https://en.wikipedia.org/wiki/Elfdalian>, it is pretty straight forward,
> 2000 speakers in Sweden in Dalarna (subdivision SE-W). Can I just open a
> ticket and get this data added to CLDR once it's been reviewed?
> - What criteria is applied for language-territory mapping? For instance,
> in the Ethnologue there is a notion of "immigrant" languages. Should there
> be used objective or subjective criteria?
> http://unicode.org/cldr/trac/ticket/9897
> http://unicode.org/cldr/trac/ticket/9915
>
> The way I see it, to start with some language-territory mapping,
> especially including mapping with subdivisions, before we have reliable
> sources of accurate population, could also help generate more data in long
> run, as it is much easier to try collect the data once it have been
> geographically mapped.
>
> About language status I would be happy to start add data, but maybe it
> should be clarified exactly which categorize that are most feasible?
> http://unicode.org/cldr/trac/ticket/9856
> http://unicode.org/cldr/trac/ticket/9916
>
> Mats
>
> On 22 November 2016 at 01:00, Steven R. Loomis <srl at icu-project.org>
> wrote:
>
>> Mats,
>>  I replied to your tickets http://unicode.org/cldr/trac/ticket/9915 and
>> http://unicode.org/cldr/trac/ticket/9916 – thank you for the good ideas
>> (as far as completeness goes), but it’s not really clear what the purpose
>> of the ticket should be.
>>
>> El 11/20/16 11:35 AM, "CLDR-Users en nombre de Mats Blakstad" <
>> cldr-users-bounces at unicode.org en nombre de mats.gbproject at gmail.com>
>> escribió:
>>
>> I understand it would take a lot of time to collect the full data, but it
>> also depends on how much engagement you manage to create for the work.
>>
>> On the other side: to simply allow users to start provide the data is
>> first step in the process, and to do it would take very little time to do
>> it!
>>
>>
>> It’s not clear how users are hindered from providing data now?  At
>> present, the data is very meticulously collected from a number of sources,
>> including feedback comments.
>>
>> Steven
>>
>>
>> On 20 November 2016 at 19:54, Doug Ewell <doug at ewellic.org> wrote:
>>
>>> Mats,
>>>
>>> I think you are genuinely underestimating the time and effort that this
>>> project would take.
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/cldr-users/attachments/20161122/da5c144c/attachment.html>


More information about the CLDR-Users mailing list