Dataset for all ISO639 code sorted by country/territory?

Doug Ewell doug at ewellic.org
Thu Nov 10 11:56:58 CST 2016


Mats Blakstad wrote:

> For myself I was not actually considering the amount of speakers in
> each country, but to map languages with countries/territories where
> the language originated or have been spoken traditionally.

And that is where I think you'll have disagreement on the details.

> So I guess what matters is which language people mostly expect to find
> under the country/territory.

Yep, that's the challenge.

> Would it be possible to extend this dataset to all languages and start
> build an open source data set for language-territory mapping?
> http://www.unicode.org/cldr/charts/latest/supplemental/language_territory_information.html 

That's a good question for the CLDR folks, who have their own mailing
list.

Keep in mind that the CLDR table documents 675 of the world's best-known
languages, counting variants such as three different orthographies of
Uzbek. While anything is possible, extending this to "all languages,"
e.g. the other 6,300 lesser-known living languages, might require a bit
of time and money.

There is also a resource in the "UDHR in Unicode" project that might be
worth investigating, though it too is an imperfect match with what you
seem to be looking for.

--
Doug Ewell | Thornton, CO, US | ewellic.org




More information about the Unicode mailing list