Dataset for all ISO639 code sorted by country/territory?

Mats Blakstad mats.gbproject at gmail.com
Sun Nov 20 11:41:10 CST 2016


I think it would be good to be able add years to the language data so if
Tagalog was not offical because it became to expensive for Calefornia we
could say it was official until 2016.

I think also this would be helpful to add for language population as this
can be collected from different years, and it can be easier to see if the
numbers are really outdated:
http://www.unicode.org/cldr/charts/latest/supplemental/language_territory_information.html

I opened two tickets in CLDR:
http://unicode.org/cldr/trac/ticket/9916
http://unicode.org/cldr/trac/ticket/9915

On 16 November 2016 at 18:42, Hugh Paterson <hugh_paterson at sil.org> wrote:

> Also, after thinking about this some more: If as is the stated case with San
> Francisco,
> "San Francisco requires documents in 4 languages but provides telephone
> help for 200 languages.  Where's the line?"
>
> How would you propose that Unicode database maintainers,
> de-list institutional support for languages when institutional support
> ceases.
>
> i.e. lets say that San Francisco falls on some hard times finically, and
> can not afford to operate in 4 languages, and reduces their support to two
> languages, How is this to be reflected in this proposal?
>
> - Hugh Paterson III
>
> On Thu, Nov 10, 2016 at 2:54 PM, Mats Blakstad <mats.gbproject at gmail.com>
> wrote:
>
>> I'm continuing the discussion I started on unicode at unicode.org here;
>> http://unicode.org/pipermail/unicode/2016-September/003964.html
>> Sorry for posting in wrong email list!
>>
>> On 10 November 2016 at 20:34, Shawn Steele <Shawn.Steele at microsoft.com>
>> wrote:
>>
>>> I didn't really say anything because this is kinda a hopeless task, but
>>> it seems like some realities are being overlooked.  I'm as curious about
>>> cataloguing everything as the next OCD guy, but a general solution doesn't
>>> seem practical.
>>>
>>> Maybe in addition to number of speakers we could give each language
>> different values for the different territories like official / unofficial,
>> lingua franca / home language, recognized / not recognized, etc
>> Maybe we could manage to work out some more objective categories?
>> Then the dataset could cover more different needs from those that want to
>> use it to extract the list they want, as example they could make a list of
>> only the official languages in the world sorted by country/territory, or
>> maybe a list of all non-recognized languages in different countries.
>>
>>
>>> * There are a *lot* of languages
>>>
>> Yes :) We would not get all in the start, but if we could start add data
>> for all the languages it can be done a little by little.
>> For myself I have many contacts that I think could be interested to help
>> add information.
>>
>>
>>> * Many countries have speakers of several languages.
>>>         * In the US it's "obvious" that a list of languages for the US
>>> should include "English"
>>>
>> For sure! The amount of speakers and that it is the primary language used
>> speakse for it.
>> Beside, is not "US English" considered a variant of English?
>>
>>
>>>         * Spanish in the US is less obvious, however it is often
>>> considered important.
>>>
>> It is interesting issue. Wasn't Spanish the primary language in southern
>> US while being a part of Mexico?
>> And is there not a lot of Spanish newspapsers/media in the US?
>>
>>
>>>         * However, that's a slippery slope as there are many other
>>> languages with large groups of speakers in the US.  If such a list includes
>>> Spanish, should it not include some of the others?  San Francisco requires
>>> documents in 4 languages but provides telephone help for 200 languages.
>>> Where's the line?
>>> * Some languages happen in many places.  There are a disproportionate #
>>> of Englishes in CLDR, however Chinese is also spoken in lots of the
>>> countries that have English available in CLDR.  Yet CLDR doesn't provide
>>> data for those.
>>>
>> Could you elaborate a little bit on this?
>>
>>
>>> * Some language/region combinations could encounter geopolitical
>>> issues.  Like "it's not legal for that language to be spoken in XX" (but it
>>> happens).  Or "that language isn't YY country's language, it's ours!!!"
>>>
>> We could add documented amount of speakers and tag it as "not recognized"
>>
>>>
>>> * The requirement "where the language has been spoken traditionally" is
>>> really, really subjective.  "Traditionally" the US is an English speaking
>>> country.  However, "Traditionally", there are hundreds of languages that
>>> have been spoken in the US.  What could be more "traditional" than the
>>> native American languages?  Yet those often have low numbers of speakers in
>>> the modern world, many are even dying languages.  There are also a number
>>> of "traditional" languages spoken by the original settlers.  Which differ
>>> than the set of languages spoken by modern immigrants.  So your data is
>>> going to be very skewed depending on the person collecting the data's
>>> definition of "traditional".
>>>
>> I agree "traditional" is not a good way to collect the data.
>> Native american languages should of course be mapped with territories
>> despite having few speakers. The point is to map all languages.
>> We could also map languages with years, English is then spoken in what is
>> USA today since 1607.
>> Urdu is spoken in what is today Norway since the 1970th.
>>
>>
>>>
>>> Ethnologue has done a decent job of identifying languages and the number
>>> of speakers in various areas, but it would be very difficult to draw a line
>>> that selected "English and Spanish in the US" and was consistent with
>>> similar real-life impacts across the other languages.  Do you pick the top
>>> n languages for each country?  Languages with > x million speakers (that
>>> would be very different in small and big countries).  Languages with > y%
>>> of the speakers in the different countries?
>>>
>>
>> If Ethnologue have done it, I guess it should also be possible for CLDR
>> also?
>> However they operate with a category "Immigrant Languages", I'm not sure
>> what that means, ss exmaple Turkish, the second most spoken language of
>> Germany, is marked it as "Immigrant Language", I'm not sure how they make
>> that distinction.
>>
>>
>>>
>>> And then you end up with each application having to figure out it's own
>>> bar.  Applications will have different market considerations and other
>>> reasons to target different regions/languages.  That would skew any list
>>> for their purposes.
>>>
>>
>> Okay, at least it could be possible to add number of speakers for other
>> "6,300 lesser-known living languages", or why do we cut the list to 675
>> languages?
>>
>>
>>
>> _______________________________________________
>> CLDR-Users mailing list
>> CLDR-Users at unicode.org
>> http://unicode.org/mailman/listinfo/cldr-users
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/cldr-users/attachments/20161120/173d95db/attachment-0001.html>


More information about the CLDR-Users mailing list