Dataset for all ISO639 code sorted by country/territory?

Mats Blakstad mats.gbproject at gmail.com
Sun Nov 20 12:30:25 CST 2016


On 20 Nov 2016 7:09 pm, "Shawn Steele" <Shawn.Steele at microsoft.com> wrote:
>
> Knowing “official” languages at the city level doesn’t seem that
interesting to me.  How do people/software developers use the data?
>

I agree that it is not nessecary with data on city level. What I suggest
was to provide it for subdivisions. One use case could be provide
translations in a regional language to users from that region (provide
Catalan translations to people from Catalunia.

>
>
> Ethnologue shows more Finnish speakers than Creek speakers in the US.
Certainly, the languages that are spoken only within a region have a
special relationship (but some seem missing?), but how do the other
“immigrant” languages like Korean get chosen?

In the tickets Ive opened Ive not suggested any definition of immigrant
languages. As long as we have a data source we could add population for a
language to a territory.

More than xx% of the speakers?  More than a million speakers?  Also, the
percentages seem pretty different than Ethnologue, does CLDR have a better
source?
>

I also wonder about the sources!
>
>
> Tagalog isn’t even listed for US (even in Ethnologue?) so having a date
range, particularly for that, seems silly.
>

That the data is not there today is a poor argument for not providing it.
>
>
> But, again, how is this data used?
>
>
>
> -Shawn
>
>
>
> From: Mats Blakstad [mailto:mats.gbproject at gmail.com]
> Sent: Sunday, November 20, 2016 9:41 AM
> To: Hugh Paterson <hugh_paterson at sil.org>
> Cc: Shawn Steele <Shawn.Steele at microsoft.com>; cldr-users <
cldr-users at unicode.org>; Doug Ewell <doug at ewellic.org>
> Subject: Re: Dataset for all ISO639 code sorted by country/territory?
>
>
>
> I think it would be good to be able add years to the language data so if
Tagalog was not offical because it became to expensive for Calefornia we
could say it was official until 2016.
>
>
>
> I think also this would be helpful to add for language population as this
can be collected from different years, and it can be easier to see if the
numbers are really outdated:
>
http://www.unicode.org/cldr/charts/latest/supplemental/language_territory_information.html
>
>
>
> I opened two tickets in CLDR:
> http://unicode.org/cldr/trac/ticket/9916
>
> http://unicode.org/cldr/trac/ticket/9915
>
>
>
> On 16 November 2016 at 18:42, Hugh Paterson <hugh_paterson at sil.org> wrote:
>>
>> Also, after thinking about this some more: If as is the stated case
with San Francisco,
>>
>> "San Francisco requires documents in 4 languages but provides telephone
help for 200 languages.  Where's the line?"
>>
>>
>>
>> How would you propose that Unicode database maintainers,
de-list institutional support for languages when institutional support
ceases.
>>
>>
>>
>> i.e. lets say that San Francisco falls on some hard times finically, and
can not afford to operate in 4 languages, and reduces their support to two
languages, How is this to be reflected in this proposal?
>>
>>
>>
>> - Hugh Paterson III
>>
>>
>>
>> On Thu, Nov 10, 2016 at 2:54 PM, Mats Blakstad <mats.gbproject at gmail.com>
wrote:
>>>
>>> I'm continuing the discussion I started on unicode at unicode.org here;
>>> http://unicode.org/pipermail/unicode/2016-September/003964.html
>>>
>>> Sorry for posting in wrong email list!
>>>
>>>
>>>
>>> On 10 November 2016 at 20:34, Shawn Steele <Shawn.Steele at microsoft.com>
wrote:
>>>>
>>>> I didn't really say anything because this is kinda a hopeless task,
but it seems like some realities are being overlooked.  I'm as curious
about cataloguing everything as the next OCD guy, but a general solution
doesn't seem practical.
>>>
>>> Maybe in addition to number of speakers we could give each language
different values for the different territories like official / unofficial,
lingua franca / home language, recognized / not recognized, etc
>>>
>>> Maybe we could manage to work out some more objective categories?
>>> Then the dataset could cover more different needs from those that want
to use it to extract the list they want, as example they could make a list
of only the official languages in the world sorted by country/territory, or
maybe a list of all non-recognized languages in different countries.
>>>
>>>
>>>>
>>>> * There are a *lot* of languages
>>>
>>> Yes :) We would not get all in the start, but if we could start add
data for all the languages it can be done a little by little.
>>>
>>> For myself I have many contacts that I think could be interested to
help add information.
>>>
>>>
>>>>
>>>> * Many countries have speakers of several languages.
>>>>         * In the US it's "obvious" that a list of languages for the US
should include "English"
>>>
>>> For sure! The amount of speakers and that it is the primary language
used speakse for it.
>>>
>>> Beside, is not "US English" considered a variant of English?
>>>
>>>>
>>>>         * Spanish in the US is less obvious, however it is often
considered important.
>>>
>>> It is interesting issue. Wasn't Spanish the primary language in
southern US while being a part of Mexico?
>>>
>>> And is there not a lot of Spanish newspapsers/media in the US?
>>>
>>>
>>>>
>>>>         * However, that's a slippery slope as there are many other
languages with large groups of speakers in the US.  If such a list includes
Spanish, should it not include some of the others?  San Francisco requires
documents in 4 languages but provides telephone help for 200 languages.
Where's the line?
>>>> * Some languages happen in many places.  There are a disproportionate
# of Englishes in CLDR, however Chinese is also spoken in lots of the
countries that have English available in CLDR.  Yet CLDR doesn't provide
data for those.
>>>
>>> Could you elaborate a little bit on this?
>>>
>>>
>>>>
>>>> * Some language/region combinations could encounter geopolitical
issues.  Like "it's not legal for that language to be spoken in XX" (but it
happens).  Or "that language isn't YY country's language, it's ours!!!"
>>>
>>> We could add documented amount of speakers and tag it as "not
recognized"
>>>>
>>>>
>>>> * The requirement "where the language has been spoken traditionally"
is really, really subjective.  "Traditionally" the US is an English
speaking country.  However, "Traditionally", there are hundreds of
languages that have been spoken in the US.  What could be more
"traditional" than the native American languages?  Yet those often have low
numbers of speakers in the modern world, many are even dying languages.
There are also a number of "traditional" languages spoken by the original
settlers.  Which differ than the set of languages spoken by modern
immigrants.  So your data is going to be very skewed depending on the
person collecting the data's definition of "traditional".
>>>
>>> I agree "traditional" is not a good way to collect the data.
>>>
>>> Native american languages should of course be mapped with territories
despite having few speakers. The point is to map all languages.
>>>
>>> We could also map languages with years, English is then spoken in what
is USA today since 1607.
>>>
>>> Urdu is spoken in what is today Norway since the 1970th.
>>>
>>>
>>>>
>>>>
>>>> Ethnologue has done a decent job of identifying languages and the
number of speakers in various areas, but it would be very difficult to draw
a line that selected "English and Spanish in the US" and was consistent
with similar real-life impacts across the other languages.  Do you pick the
top n languages for each country?  Languages with > x million speakers
(that would be very different in small and big countries).  Languages with
> y% of the speakers in the different countries?
>>>
>>>
>>>
>>> If Ethnologue have done it, I guess it should also be possible for CLDR
also?
>>>
>>> However they operate with a category "Immigrant Languages", I'm not
sure what that means, ss exmaple Turkish, the second most spoken language
of Germany, is marked it as "Immigrant Language", I'm not sure how they
make that distinction.
>>>
>>>
>>>>
>>>>
>>>> And then you end up with each application having to figure out it's
own bar.  Applications will have different market considerations and other
reasons to target different regions/languages.  That would skew any list
for their purposes.
>>>
>>>
>>>
>>> Okay, at least it could be possible to add number of speakers for other
"6,300 lesser-known living languages", or why do we cut the list to 675
languages?
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> CLDR-Users mailing list
>>> CLDR-Users at unicode.org
>>> http://unicode.org/mailman/listinfo/cldr-users
>>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/cldr-users/attachments/20161120/76633745/attachment.html>


More information about the CLDR-Users mailing list