Dataset for all ISO639 code sorted by country/territory?

Hugh Paterson hugh_paterson at sil.org
Thu Nov 24 10:42:56 CST 2016


Mats,
How do you know that Glottalog did not copy Ethnologe data, or that it's
primary cited evidence is the Ethnologue? Glottalog should not be cited as
a source itself, but rather treated as an aggregation of facts, which are
in tern needing independent citations. Some types of Glottalog data are
produced via scripted data extraction.

In contrast, the editors of the Ethnologue do host workshops in various
regions of the world and directly elicit data from language community
members. (But not 100% of their data is collected this way, some come from
language development workers, or academics who work in these communities.)

So, qualitatively the two sources are very different, and deserve
appropriate levels of respect. Just because we read a news story on the BBC
and on Al Jazeria's websites does not mean that the story is accurate or
even true.

- Hugh


On Wed, Nov 23, 2016 at 3:24 PM, Mats Blakstad <mats.gbproject at gmail.com>
wrote:

>
>
> On 22 November 2016 at 20:24, Steven R. Loomis <srl at icu-project.org>
> wrote:
>
>> El 11/21/16 7:06 PM, "Mats Blakstad" <mats.gbproject at gmail.com> escribió:
>>
>> Thanks for the replay Steven!
>> Also thanks to Mark Davis for explaining more about calculation of
>> language speakers within a territory.
>>
>> I'm interested to help provide data - however to me it is not clear if it
>> is possible or what the criteria are.
>>
>>
>> If you are talking about locale data – the criteria are here
>> http://cldr.unicode.org/index/bug-reports#New_Locales
>>
>
> Thanks for info! It seems like there are several languages added inside
> supplementalData.xml that do not have locals so seem like we can easily add
> new supplemental data for languages without locals. Also looks like there
> are support for languages that have 0.0031% of the speakers so looks like
> several small languages are already supported.
>
>>
>> If you are talking about supplemental data (such as population figures,
>> etc) it would be important to know what you are actually trying to do with
>> the data, and where it is insufficient. Adding more data to add more data
>> is not a sufficient reason.
>>
>
> Yes I'm talking about the supplemental data. I don't only want to add data
> "to add more data" even though I definitely think building data that can
> help generate more data about, and support, more languages, is definetly a
> valid reason.
>
> I want to use the data for many things; More easily identify likely second
> language of speakers of "lesser known languages" based on HTTP
> Accept-Language and which territory or subdivision they are placed. Be able
> to present information in these languages and language swicther for these
> languages dependent of which territory/subdivision the user is from. Be
> able to offer users to help translate into local languages depending on
> their territory/sub-division. The bottom line is; be able to give a better
> user experience for people speaking "lesser known languages". With a
> language-territory mapping it will be possible for developers to use this
> data also in new creative ways to better support multilingualism.
>
>
>>  I do want to see better support for all languages, certainly. But that
>> is a time consuming process, involving individual people and languages— not
>> bulk datasets.
>>
>
> I do not really understand why bulk datasets should not be accepted, to me
> it seems like data is added based in evidences. So wheater the data is
> added should depend on weather the data comes from a reliable source.
> Besides I'm an individual people and I'm ready to be involved!
>
>>
>>>> Then I asked here in the list if we could maybe manage to make a full
>> language-territory mapping within CLDR, but the answers on this list until
>> now is that such mapping would be very subjective (even though it is also
>> stated that it is not needed as Ethnologue made a good dataset already).
>>
>>
>> All of this is more of a discussion to have with the Ethnologue. I browse
>> the Ethnologue somewhat frequently, but I do not see the benefit in simply
>> importing it into the CLDR supplemental data.
>>
>> So I suggested that if so we could go for purely objective criteria, we
>> map languages to territories based on evidences of the amount of people
>> speaking the language in the territory, with this approach it doesn't
>> matter how big or small the population is, and anyone using the data can
>> extract the data they need based on their own criteria (e.g. only use
>> languages with more than 5% of speakers withing a territory). Then it's
>> been said that the data for the smaller languages is not useful and that it
>> is unrealistic as not all languages have locale data, but of course these
>> subjective comments doesn't clarify what the objective criteria are.
>>
>>
>> What are your objective criteria?
>>
>
> I would say, we map any language with territory based on evidences, where
> we can document a number of speakers we add the language no matter what
> status it has.
> If we can't accurately say a number of speakers, but know that the
> territory is the primary place the languages is spoken, we map it even
> without accurate language population. As example; from Glottolog we can see
> that the language Tem is spoken in Benin, Ghana and Togo, this information
> can easily be verified with comparing the data from the Ethnologue:
> http://glottolog.org/resource/languoid/id/temm1241
> https://www.ethnologue.com/language/kdh
> We can't copy the Ethnologue's data for population, but at least we know
> that 2 reliable sources are saying that this is the correct
> language-territory mapping.
> Based on this evidence we can now map Tem language with Benin, Ghana and
> Togo even though we do not have the exact data for the population.
> I guess in many cases the mapping in itself is enough to do many things to
> support "lesser known languages".
> Those not interested in this mapping can of course easily extract only the
> territory-language mappings that have indication of language population.
>
>
>> I understand that it is not just a 1-2-3 to collect a full dataset, but
>> it should be developed some clear criteria that applies to all languages so
>> data can be structured to facilitate that it can be done in the long run:
>> - What is the minimum of data needed to add support for languages in CLDR?
>>
>>
>> That information is at http://cldr.unicode.org/ind
>> ex/bug-reports#New_Locales
>>
>> - Can any language be included?
>>
>>
>> Theoretically, yes.
>>
>> And if not, what are the criteria we operate with? As example, I would
>> like to add Elfdalian <https://en.wikipedia.org/wiki/Elfdalian>, it is
>> pretty straight forward, 2000 speakers in Sweden in Dalarna (subdivision
>> SE-W). Can I just open a ticket and get this data added to CLDR once it's
>> been reviewed?
>>
>>
>> Yes.
>>
>> But, just as with ancient Latin, it’s all just an interesting thought
>> exercise, unless a ticket is opened.
>>
>
> Done:
> http://unicode.org/cldr/trac/ticket/9919
>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/cldr-users/attachments/20161124/31797939/attachment-0001.html>


More information about the CLDR-Users mailing list