Dataset for all ISO639 code sorted by country/territory?

Mats Blakstad mats.gbproject at gmail.com
Thu Nov 24 16:21:28 CST 2016


On 24 November 2016 at 17:42, Hugh Paterson <hugh_paterson at sil.org> wrote:

> Mats,
> How do you know that Glottalog did not copy Ethnologe data, or that it's
> primary cited evidence is the Ethnologue? Glottalog should not be cited as
> a source itself, but rather treated as an aggregation of facts, which are
> in tern needing independent citations. Some types of Glottalog data are
> produced via scripted data extraction.
>

I'm not sure, however the question is speculative, Glottalog is published
as a creative commons so the license give right to copy the data.

>
>
> In contrast, the editors of the Ethnologue do host workshops in various
> regions of the world and directly elicit data from language community
> members. (But not 100% of their data is collected this way, some come from
> language development workers, or academics who work in these communities.)
>

I sent email to ask Glottolog and got answer from Harald Hammarström which
gave me permission to quote him:

*For the language inventory Glottolog is more reliable than Ethnologue.
While it's true that SIL has teams that do surveys and Glottolog does not,
Glottolog cites such surveys (including those not done by SIL) and adjusts.
It is correct that Ethnologue was used as a starting point of the Glottolog
inventory and a lot of it turns out to be correct given the entire
literature out there. If this is what is meant by "copy" then it is
correct. In this sense basically every handbook (incl Ethnologue) has
copied every preceding one and this is a good practice as long as it is
cited. Around 10% of the Ethnologue inventory has been revised into what is
now Glottolog. Glottolog does not cite Ethnologue every time an entry
corresponds (though we do give the link), because Ethnologue does not
provide sources itself, instead for every language there is at least one
reference to the literature where one can go and find more information
about the language from a book or paper which does explain how they got
their data and so on. The dialect inventory in Glottolog, on the other
hand, is not reliable. The language-country mappings (is this what you mean
by language-territory mappings?) are trivial as soon as the identity of the
language is established and should be the same as in Ethnologue whenever
the language identity is parallel, with the exception that Glottolog is
more restrictive in adding the country of an immigrant community (+ various
misc revisions). I do not consider language-country mappings a well-defined
problem in the age of globalization when you can have a majority of a
speaker community living in the capital of a country different from that of
their home community, so the language-country mappings are reviewed only to
the degree that the country/ies listed by Glottolog are a subset of those
where the speakers live or lived at the first eyewitness ethnographic
documentation time. *


> So, qualitatively the two sources are very different, and deserve
> appropriate levels of respect. Just because we read a news story on the BBC
> and on Al Jazeria's websites does not mean that the story is accurate or
> even true.
>

I'm not really sure if Ethnologue have better quality of
language-terrirtory mapping than Glottalog. However Glottalog is something
that can be built on as it is Creative Commons, so it is the only viable
starting point. Will however be interesting to comparison of the two data
sets to see how much they diverge.

>
> - Hugh
>
>
> On Wed, Nov 23, 2016 at 3:24 PM, Mats Blakstad <mats.gbproject at gmail.com>
> wrote:
>
>>
>>
>> On 22 November 2016 at 20:24, Steven R. Loomis <srl at icu-project.org>
>> wrote:
>>
>>> El 11/21/16 7:06 PM, "Mats Blakstad" <mats.gbproject at gmail.com>
>>> escribió:
>>>
>>> Thanks for the replay Steven!
>>> Also thanks to Mark Davis for explaining more about calculation of
>>> language speakers within a territory.
>>>
>>> I'm interested to help provide data - however to me it is not clear if
>>> it is possible or what the criteria are.
>>>
>>>
>>> If you are talking about locale data – the criteria are here
>>> http://cldr.unicode.org/index/bug-reports#New_Locales
>>>
>>
>> Thanks for info! It seems like there are several languages added inside
>> supplementalData.xml that do not have locals so seem like we can easily add
>> new supplemental data for languages without locals. Also looks like there
>> are support for languages that have 0.0031% of the speakers so looks like
>> several small languages are already supported.
>>
>>>
>>> If you are talking about supplemental data (such as population figures,
>>> etc) it would be important to know what you are actually trying to do with
>>> the data, and where it is insufficient. Adding more data to add more data
>>> is not a sufficient reason.
>>>
>>
>> Yes I'm talking about the supplemental data. I don't only want to add
>> data "to add more data" even though I definitely think building data that
>> can help generate more data about, and support, more languages, is
>> definetly a valid reason.
>>
>> I want to use the data for many things; More easily identify likely
>> second language of speakers of "lesser known languages" based on HTTP
>> Accept-Language and which territory or subdivision they are placed. Be able
>> to present information in these languages and language swicther for these
>> languages dependent of which territory/subdivision the user is from. Be
>> able to offer users to help translate into local languages depending on
>> their territory/sub-division. The bottom line is; be able to give a better
>> user experience for people speaking "lesser known languages". With a
>> language-territory mapping it will be possible for developers to use this
>> data also in new creative ways to better support multilingualism.
>>
>>
>>>  I do want to see better support for all languages, certainly. But that
>>> is a time consuming process, involving individual people and languages— not
>>> bulk datasets.
>>>
>>
>> I do not really understand why bulk datasets should not be accepted, to
>> me it seems like data is added based in evidences. So wheater the data is
>> added should depend on weather the data comes from a reliable source.
>> Besides I'm an individual people and I'm ready to be involved!
>>
>>>
>>>>>> Then I asked here in the list if we could maybe manage to make a full
>>> language-territory mapping within CLDR, but the answers on this list until
>>> now is that such mapping would be very subjective (even though it is also
>>> stated that it is not needed as Ethnologue made a good dataset already).
>>>
>>>
>>> All of this is more of a discussion to have with the Ethnologue. I
>>> browse the Ethnologue somewhat frequently, but I do not see the benefit in
>>> simply importing it into the CLDR supplemental data.
>>>
>>> So I suggested that if so we could go for purely objective criteria, we
>>> map languages to territories based on evidences of the amount of people
>>> speaking the language in the territory, with this approach it doesn't
>>> matter how big or small the population is, and anyone using the data can
>>> extract the data they need based on their own criteria (e.g. only use
>>> languages with more than 5% of speakers withing a territory). Then it's
>>> been said that the data for the smaller languages is not useful and that it
>>> is unrealistic as not all languages have locale data, but of course these
>>> subjective comments doesn't clarify what the objective criteria are.
>>>
>>>
>>> What are your objective criteria?
>>>
>>
>> I would say, we map any language with territory based on evidences, where
>> we can document a number of speakers we add the language no matter what
>> status it has.
>> If we can't accurately say a number of speakers, but know that the
>> territory is the primary place the languages is spoken, we map it even
>> without accurate language population. As example; from Glottolog we can see
>> that the language Tem is spoken in Benin, Ghana and Togo, this information
>> can easily be verified with comparing the data from the Ethnologue:
>> http://glottolog.org/resource/languoid/id/temm1241
>> https://www.ethnologue.com/language/kdh
>> We can't copy the Ethnologue's data for population, but at least we know
>> that 2 reliable sources are saying that this is the correct
>> language-territory mapping.
>> Based on this evidence we can now map Tem language with Benin, Ghana and
>> Togo even though we do not have the exact data for the population.
>> I guess in many cases the mapping in itself is enough to do many things
>> to support "lesser known languages".
>> Those not interested in this mapping can of course easily extract only
>> the territory-language mappings that have indication of language population.
>>
>>
>>> I understand that it is not just a 1-2-3 to collect a full dataset, but
>>> it should be developed some clear criteria that applies to all languages so
>>> data can be structured to facilitate that it can be done in the long run:
>>> - What is the minimum of data needed to add support for languages in
>>> CLDR?
>>>
>>>
>>> That information is at http://cldr.unicode.org/ind
>>> ex/bug-reports#New_Locales
>>>
>>> - Can any language be included?
>>>
>>>
>>> Theoretically, yes.
>>>
>>> And if not, what are the criteria we operate with? As example, I would
>>> like to add Elfdalian <https://en.wikipedia.org/wiki/Elfdalian>, it is
>>> pretty straight forward, 2000 speakers in Sweden in Dalarna (subdivision
>>> SE-W). Can I just open a ticket and get this data added to CLDR once it's
>>> been reviewed?
>>>
>>>
>>> Yes.
>>>
>>> But, just as with ancient Latin, it’s all just an interesting thought
>>> exercise, unless a ticket is opened.
>>>
>>
>> Done:
>> http://unicode.org/cldr/trac/ticket/9919
>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/cldr-users/attachments/20161124/c597aefb/attachment.html>


More information about the CLDR-Users mailing list