Dataset for all ISO639 code sorted by country/territory?
Shawn.Steele at microsoft.com
Thu Nov 10 13:34:53 CST 2016
I didn't really say anything because this is kinda a hopeless task, but it seems like some realities are being overlooked. I'm as curious about cataloguing everything as the next OCD guy, but a general solution doesn't seem practical.
* There are a *lot* of languages
* Many countries have speakers of several languages.
* In the US it's "obvious" that a list of languages for the US should include "English"
* Spanish in the US is less obvious, however it is often considered important.
* However, that's a slippery slope as there are many other languages with large groups of speakers in the US. If such a list includes Spanish, should it not include some of the others? San Francisco requires documents in 4 languages but provides telephone help for 200 languages. Where's the line?
* Some languages happen in many places. There are a disproportionate # of Englishes in CLDR, however Chinese is also spoken in lots of the countries that have English available in CLDR. Yet CLDR doesn't provide data for those.
* Some language/region combinations could encounter geopolitical issues. Like "it's not legal for that language to be spoken in XX" (but it happens). Or "that language isn't YY country's language, it's ours!!!"
* The requirement "where the language has been spoken traditionally" is really, really subjective. "Traditionally" the US is an English speaking country. However, "Traditionally", there are hundreds of languages that have been spoken in the US. What could be more "traditional" than the native American languages? Yet those often have low numbers of speakers in the modern world, many are even dying languages. There are also a number of "traditional" languages spoken by the original settlers. Which differ than the set of languages spoken by modern immigrants. So your data is going to be very skewed depending on the person collecting the data's definition of "traditional".
Ethnologue has done a decent job of identifying languages and the number of speakers in various areas, but it would be very difficult to draw a line that selected "English and Spanish in the US" and was consistent with similar real-life impacts across the other languages. Do you pick the top n languages for each country? Languages with > x million speakers (that would be very different in small and big countries). Languages with > y% of the speakers in the different countries?
And then you end up with each application having to figure out it's own bar. Applications will have different market considerations and other reasons to target different regions/languages. That would skew any list for their purposes.
More information about the Unicode