adding all of iso639_3 to either en.xml or root.xml

Philippe Verdy verdy_p at wanadoo.fr
Tue Jul 15 09:39:37 CDT 2014


The problem with the ISO 639 registry is that these names are not really
reviewed as the best ones withing multiple candidates, even if just for
English. These names are artly descriptive and lack references for cases
where there are homonymies that need disambiguation. Sometimes names have
been chosen only to avoid homonyms, but are very uncommon, and some other
language names should have been fixed as well (but were not, keeping their
existing ambiguity).
The IANA subtag registry for BCP 47 adds another complexity because it
favors maintaing stability and backward compatibility (somthing that ISO
639 does not care much about). But it's still better for use within CLDR
data meant to be used in applications that already have normative links to
BCP47 (and notably web standards like HTML, XML, SVG, and protocols like
HTTP or MIME, or programming languages)
However the scope of CLDR is not to replace the BCP47 standard, but **use
it** to build a common set of data based on it, according to best
interoperable practices, in order to provide the platform for
translation/localisation.
The IANA registry is already implicitly referenced by the CLDR "root" for
the encoding of its selection keys. However the "root" locale should not
contain any real name for languages, it should only render them as their
code.

The CLDR data can however import the English names from IANA subtags, but
these names will still need vetting (and the result of this vetting in
English should be backported to the IANA registry). Basically I am
convinced that the CLDR etting for English names should not work the same
way as other data, but should be coordinated with those that maintain the
IANA registry and the ISO 639 standard, in a joint commitee/wordking group.
Those data should be marked as locked for normal editing, accepting only
comments to be sent via the CLDR forum or bug reports (these comments will
be coordinated and not be decided by the CLDR TC alone).

This does not mean that more names may be proposed for translations (only
in the "comprehensive" level), and most of them will remain in draft"
status for long (but they will be still usable in applications like
Wikimedia sites (note that these sites already receive a lot of
contributions and comments in their respective editions (287 languages are
open, more are only in very draft stages with very few contributors but at
least all these 287 languages should have an open data set in CLDR;
possibly more if there are other working groups, such as Ubuntu translator
groups, or any reasonnably active linguistic group such as university
linguistic departments for their searchers, or national libraries that also
need and use translations for their bibliographic classification).

The need to open a dataset for some locale in CLDR must be demonstrated by
the active desire of a sizeable community to interchange their data and
coordinate them. This means that they must accept to give these data with
open and free licences that can be freely exclanged without asking payments
or nominative exclusive licences. Many of thee groups are already
contributing these data via Wikimedia sites (because it is faster and
simpler to do than with CLDR vetting or the very length IANA and ISO
processes). This allows usages to be developed there, are rapidly
stabilized (early disagreements are rapidly solved, the choices made there
are spreading rapidly on other sites and applications as "common
practices", and they are simple to discuss there, with decisions finally
being taken, for the localization of MediaWiki itself and then distribution
in all wikis, by the Wikimedia Language comittee).

I am also convinced that for most minority languages, the Wikiemdai way for
working is more efficient and costs less. It allows collecting many
references of uses, and testing choices or detecting where there are
disagreements and require more investigation. The site platform is also
much more funded, and much larger with excellent performances most of the
time (the technical platform of CLDR, including for discussions on this
list, is much more modest and very slow it does not scale enough to attract
enough comments and vetters; the IANA platform is almost inexistant and not
funded at all, and the ISO platform is both costly and extremely
slow/inefficient).

Let's be pragmatic and use the best tools. Even if you don't like Wikipedia
itself for its content (or the tone of its local discussions), it does not
mean eveything is bad, I personnly like this diversity which permits
technical innovations to appear, and very bright things like Wikitionary,
that evolves at the same time as people in the world using the languages we
would like to coordinate.

However the Wikimedia content (or MediaWiki localisation) is published in a
too restrictive licence for using its database directly. Instead, small
items are decided isolately and can be coordinated with another more open
database such as CLDR, by small incremental steps, allowing other CLDR
users to benefit of the best practices.

Things would be facilitated if Wikimedia took a collective decision to
allow republication of a limited part of the localization data of MEdiaWiki
to be interchanged using another licence (this would basically consist in
the dataset contained in a limited directory of its open source repository,
on which the Language committee has a decision role and can use it to
ensute a reasobbable quality and work best with other standards.

In my opnion, the Wikiemdia Language Committee, the CLDR TC, the ISO 639 WG
should have regular contacts to solve their interoperability problems. A
few more international entities may participate as well (e.g. Ubuntu
translators). They could also meet each other in some events about I18N,
L10N and translation. This does not necessarly mean creating a new
administrative body, as long as each participant contacting the others are
already cooperating and rporting in his own local community with thier
refular communication channels.

For now even the CLDR lacks both technical resources to scale up, and
attract enough contributors (often not more than an handful ones per
language, even for major languages like English, French, Spanish,
Portuguese, German, Russian, Chinese, Indonesian, Hindi and Tamil)


2014-07-15 8:54 GMT+02:00 Mark Davis ☕️ <mark at macchiato.com>:

> I'm not sure it would be worth it. People can always pick up a copy of the
> language subtag registry and use it to back-fill.
>
> We do keep a copy of the registry in our tooling data directory, and
> that's what we do in our tooling, such as myCldrFile.getName(language).
>
>
> Mark <https://google.com/+MarkDavis>
>
>  *— Il meglio è l’inimico del bene —*
>
>
> On Tue, Jul 15, 2014 at 6:52 AM, Steven R. Loomis <srl at icu-project.org>
> wrote:
>
>> If anything should be in en and not root.
>>
>> Wonder if it could go into seed/en or something.
>>
>> It's not in en right now because of translation burden. But I'd think we
>> could set controls via coverage.
>>
>> En.xml is hand curated now, that would be another distinction.
>>
>> Steven
>>
>> Enviado desde nuestro iPhone.
>>
>> El jul 14, 2014, a las 9:47 PM, Martin Hosken <martin_hosken at sil.org>
>> escribió:
>>
>> Dear All,
>>
>> I notice that en.xml only contains localeDisplayNames/languages/language
>> entries for a subset of iso639-3. Is there a case for filling out the list
>> based on iso639-3 reference names so that people don't have to fallback to
>> data not in the CLDR? Or, given iso639 has these reference names, is there
>> a case for putting them into the root. I realise it's a bit odd to put what
>> amounts to English names into root.xml. OTOH these are the official
>> reference names and so act as fallback for all languages, so perhaps it
>> would be appropriate. I'm happy either way. But I think CLDR would benefit
>> from having the complete reference name mapping of iso639-3 in it.
>>
>> Yours,
>> Martin
>> _______________________________________________
>> CLDR-Users mailing list
>> CLDR-Users at unicode.org
>> http://unicode.org/mailman/listinfo/cldr-users
>>
>>
>> _______________________________________________
>> CLDR-Users mailing list
>> CLDR-Users at unicode.org
>> http://unicode.org/mailman/listinfo/cldr-users
>>
>>
>
> _______________________________________________
> CLDR-Users mailing list
> CLDR-Users at unicode.org
> http://unicode.org/mailman/listinfo/cldr-users
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/cldr-users/attachments/20140715/456d3790/attachment-0001.html>


More information about the CLDR-Users mailing list