From martin_hosken at sil.org Mon Jul 14 23:47:26 2014 From: martin_hosken at sil.org (Martin Hosken) Date: Tue, 15 Jul 2014 11:47:26 +0700 Subject: adding all of iso639_3 to either en.xml or root.xml Message-ID: <20140715114726.69b846a9@sil-mh6> Dear All, I notice that en.xml only contains localeDisplayNames/languages/language entries for a subset of iso639-3. Is there a case for filling out the list based on iso639-3 reference names so that people don't have to fallback to data not in the CLDR? Or, given iso639 has these reference names, is there a case for putting them into the root. I realise it's a bit odd to put what amounts to English names into root.xml. OTOH these are the official reference names and so act as fallback for all languages, so perhaps it would be appropriate. I'm happy either way. But I think CLDR would benefit from having the complete reference name mapping of iso639-3 in it. Yours, Martin From srl at icu-project.org Mon Jul 14 23:52:52 2014 From: srl at icu-project.org (Steven R. Loomis) Date: Mon, 14 Jul 2014 21:52:52 -0700 Subject: adding all of iso639_3 to either en.xml or root.xml In-Reply-To: <20140715114726.69b846a9@sil-mh6> References: <20140715114726.69b846a9@sil-mh6> Message-ID: <3DA66F3F-0A87-4798-A6AA-BC9E1636630A@icu-project.org> If anything should be in en and not root. Wonder if it could go into seed/en or something. It's not in en right now because of translation burden. But I'd think we could set controls via coverage. En.xml is hand curated now, that would be another distinction. Steven Enviado desde nuestro iPhone. > El jul 14, 2014, a las 9:47 PM, Martin Hosken escribi?: > > Dear All, > > I notice that en.xml only contains localeDisplayNames/languages/language entries for a subset of iso639-3. Is there a case for filling out the list based on iso639-3 reference names so that people don't have to fallback to data not in the CLDR? Or, given iso639 has these reference names, is there a case for putting them into the root. I realise it's a bit odd to put what amounts to English names into root.xml. OTOH these are the official reference names and so act as fallback for all languages, so perhaps it would be appropriate. I'm happy either way. But I think CLDR would benefit from having the complete reference name mapping of iso639-3 in it. > > Yours, > Martin > _______________________________________________ > CLDR-Users mailing list > CLDR-Users at unicode.org > http://unicode.org/mailman/listinfo/cldr-users -------------- next part -------------- An HTML attachment was scrubbed... URL: From mark at macchiato.com Tue Jul 15 01:54:06 2014 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Tue, 15 Jul 2014 08:54:06 +0200 Subject: adding all of iso639_3 to either en.xml or root.xml In-Reply-To: <3DA66F3F-0A87-4798-A6AA-BC9E1636630A@icu-project.org> References: <20140715114726.69b846a9@sil-mh6> <3DA66F3F-0A87-4798-A6AA-BC9E1636630A@icu-project.org> Message-ID: I'm not sure it would be worth it. People can always pick up a copy of the language subtag registry and use it to back-fill. We do keep a copy of the registry in our tooling data directory, and that's what we do in our tooling, such as myCldrFile.getName(language). Mark *? Il meglio ? l?inimico del bene ?* On Tue, Jul 15, 2014 at 6:52 AM, Steven R. Loomis wrote: > If anything should be in en and not root. > > Wonder if it could go into seed/en or something. > > It's not in en right now because of translation burden. But I'd think we > could set controls via coverage. > > En.xml is hand curated now, that would be another distinction. > > Steven > > Enviado desde nuestro iPhone. > > El jul 14, 2014, a las 9:47 PM, Martin Hosken > escribi?: > > Dear All, > > I notice that en.xml only contains localeDisplayNames/languages/language > entries for a subset of iso639-3. Is there a case for filling out the list > based on iso639-3 reference names so that people don't have to fallback to > data not in the CLDR? Or, given iso639 has these reference names, is there > a case for putting them into the root. I realise it's a bit odd to put what > amounts to English names into root.xml. OTOH these are the official > reference names and so act as fallback for all languages, so perhaps it > would be appropriate. I'm happy either way. But I think CLDR would benefit > from having the complete reference name mapping of iso639-3 in it. > > Yours, > Martin > _______________________________________________ > CLDR-Users mailing list > CLDR-Users at unicode.org > http://unicode.org/mailman/listinfo/cldr-users > > > _______________________________________________ > CLDR-Users mailing list > CLDR-Users at unicode.org > http://unicode.org/mailman/listinfo/cldr-users > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From srl at icu-project.org Tue Jul 15 09:33:47 2014 From: srl at icu-project.org (Steven R. Loomis) Date: Tue, 15 Jul 2014 07:33:47 -0700 Subject: adding all of iso639_3 to either en.xml or root.xml In-Reply-To: References: <20140715114726.69b846a9@sil-mh6> <3DA66F3F-0A87-4798-A6AA-BC9E1636630A@icu-project.org> Message-ID: Maybe we could have a cldrmodify pass that fills in en could work, then someone could fill out any missing translations before doing other processing. But, cldr wouldn't come that way out of the box. Then this is about ldml(interchange) and not cldr (common data). Enviado desde nuestro iPhone. > El jul 14, 2014, a las 11:54 PM, Mark Davis ?? escribi?: > > I'm not sure it would be worth it. People can always pick up a copy of the language subtag registry and use it to back-fill. > > We do keep a copy of the registry in our tooling data directory, and that's what we do in our tooling, such as myCldrFile.getName(language). > > > Mark > > ? Il meglio ? l?inimico del bene ? > > >> On Tue, Jul 15, 2014 at 6:52 AM, Steven R. Loomis wrote: >> If anything should be in en and not root. >> >> Wonder if it could go into seed/en or something. >> >> It's not in en right now because of translation burden. But I'd think we could set controls via coverage. >> >> En.xml is hand curated now, that would be another distinction. >> >> Steven >> >> Enviado desde nuestro iPhone. >> >>> El jul 14, 2014, a las 9:47 PM, Martin Hosken escribi?: >>> >> >>> Dear All, >>> >>> I notice that en.xml only contains localeDisplayNames/languages/language entries for a subset of iso639-3. Is there a case for filling out the list based on iso639-3 reference names so that people don't have to fallback to data not in the CLDR? Or, given iso639 has these reference names, is there a case for putting them into the root. I realise it's a bit odd to put what amounts to English names into root.xml. OTOH these are the official reference names and so act as fallback for all languages, so perhaps it would be appropriate. I'm happy either way. But I think CLDR would benefit from having the complete reference name mapping of iso639-3 in it. >>> >>> Yours, >>> Martin >>> _______________________________________________ >>> CLDR-Users mailing list >>> CLDR-Users at unicode.org >>> http://unicode.org/mailman/listinfo/cldr-users >> >> _______________________________________________ >> CLDR-Users mailing list >> CLDR-Users at unicode.org >> http://unicode.org/mailman/listinfo/cldr-users > > _______________________________________________ > CLDR-Users mailing list > CLDR-Users at unicode.org > http://unicode.org/mailman/listinfo/cldr-users -------------- next part -------------- An HTML attachment was scrubbed... URL: From mark at macchiato.com Tue Jul 15 09:35:27 2014 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Tue, 15 Jul 2014 16:35:27 +0200 Subject: adding all of iso639_3 to either en.xml or root.xml In-Reply-To: References: <20140715114726.69b846a9@sil-mh6> <3DA66F3F-0A87-4798-A6AA-BC9E1636630A@icu-project.org> Message-ID: We can talk about this at the meeting, if you want to put it on the menu. However, I had strong doubts about it, because it means adding about 24,000 lines to en.xml, when it is not that hard to process the registry document. Mark *? Il meglio ? l?inimico del bene ?* On Tue, Jul 15, 2014 at 4:33 PM, Steven R. Loomis wrote: > Maybe we could have a cldrmodify pass that fills in en could work, then > someone could fill out any missing translations before doing other > processing. But, cldr wouldn't come that way out of the box. Then this is > about ldml(interchange) and not cldr (common data). > > Enviado desde nuestro iPhone. > > El jul 14, 2014, a las 11:54 PM, Mark Davis ?? > escribi?: > > I'm not sure it would be worth it. People can always pick up a copy of the > language subtag registry and use it to back-fill. > > We do keep a copy of the registry in our tooling data directory, and > that's what we do in our tooling, such as myCldrFile.getName(language). > > > Mark > > *? Il meglio ? l?inimico del bene ?* > > > On Tue, Jul 15, 2014 at 6:52 AM, Steven R. Loomis > wrote: > >> If anything should be in en and not root. >> >> Wonder if it could go into seed/en or something. >> >> It's not in en right now because of translation burden. But I'd think we >> could set controls via coverage. >> >> En.xml is hand curated now, that would be another distinction. >> >> Steven >> >> Enviado desde nuestro iPhone. >> >> El jul 14, 2014, a las 9:47 PM, Martin Hosken >> escribi?: >> >> Dear All, >> >> I notice that en.xml only contains localeDisplayNames/languages/language >> entries for a subset of iso639-3. Is there a case for filling out the list >> based on iso639-3 reference names so that people don't have to fallback to >> data not in the CLDR? Or, given iso639 has these reference names, is there >> a case for putting them into the root. I realise it's a bit odd to put what >> amounts to English names into root.xml. OTOH these are the official >> reference names and so act as fallback for all languages, so perhaps it >> would be appropriate. I'm happy either way. But I think CLDR would benefit >> from having the complete reference name mapping of iso639-3 in it. >> >> Yours, >> Martin >> _______________________________________________ >> CLDR-Users mailing list >> CLDR-Users at unicode.org >> http://unicode.org/mailman/listinfo/cldr-users >> >> >> _______________________________________________ >> CLDR-Users mailing list >> CLDR-Users at unicode.org >> http://unicode.org/mailman/listinfo/cldr-users >> >> > _______________________________________________ > CLDR-Users mailing list > CLDR-Users at unicode.org > http://unicode.org/mailman/listinfo/cldr-users > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Tue Jul 15 09:39:37 2014 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Tue, 15 Jul 2014 16:39:37 +0200 Subject: adding all of iso639_3 to either en.xml or root.xml In-Reply-To: References: <20140715114726.69b846a9@sil-mh6> <3DA66F3F-0A87-4798-A6AA-BC9E1636630A@icu-project.org> Message-ID: The problem with the ISO 639 registry is that these names are not really reviewed as the best ones withing multiple candidates, even if just for English. These names are artly descriptive and lack references for cases where there are homonymies that need disambiguation. Sometimes names have been chosen only to avoid homonyms, but are very uncommon, and some other language names should have been fixed as well (but were not, keeping their existing ambiguity). The IANA subtag registry for BCP 47 adds another complexity because it favors maintaing stability and backward compatibility (somthing that ISO 639 does not care much about). But it's still better for use within CLDR data meant to be used in applications that already have normative links to BCP47 (and notably web standards like HTML, XML, SVG, and protocols like HTTP or MIME, or programming languages) However the scope of CLDR is not to replace the BCP47 standard, but **use it** to build a common set of data based on it, according to best interoperable practices, in order to provide the platform for translation/localisation. The IANA registry is already implicitly referenced by the CLDR "root" for the encoding of its selection keys. However the "root" locale should not contain any real name for languages, it should only render them as their code. The CLDR data can however import the English names from IANA subtags, but these names will still need vetting (and the result of this vetting in English should be backported to the IANA registry). Basically I am convinced that the CLDR etting for English names should not work the same way as other data, but should be coordinated with those that maintain the IANA registry and the ISO 639 standard, in a joint commitee/wordking group. Those data should be marked as locked for normal editing, accepting only comments to be sent via the CLDR forum or bug reports (these comments will be coordinated and not be decided by the CLDR TC alone). This does not mean that more names may be proposed for translations (only in the "comprehensive" level), and most of them will remain in draft" status for long (but they will be still usable in applications like Wikimedia sites (note that these sites already receive a lot of contributions and comments in their respective editions (287 languages are open, more are only in very draft stages with very few contributors but at least all these 287 languages should have an open data set in CLDR; possibly more if there are other working groups, such as Ubuntu translator groups, or any reasonnably active linguistic group such as university linguistic departments for their searchers, or national libraries that also need and use translations for their bibliographic classification). The need to open a dataset for some locale in CLDR must be demonstrated by the active desire of a sizeable community to interchange their data and coordinate them. This means that they must accept to give these data with open and free licences that can be freely exclanged without asking payments or nominative exclusive licences. Many of thee groups are already contributing these data via Wikimedia sites (because it is faster and simpler to do than with CLDR vetting or the very length IANA and ISO processes). This allows usages to be developed there, are rapidly stabilized (early disagreements are rapidly solved, the choices made there are spreading rapidly on other sites and applications as "common practices", and they are simple to discuss there, with decisions finally being taken, for the localization of MediaWiki itself and then distribution in all wikis, by the Wikimedia Language comittee). I am also convinced that for most minority languages, the Wikiemdai way for working is more efficient and costs less. It allows collecting many references of uses, and testing choices or detecting where there are disagreements and require more investigation. The site platform is also much more funded, and much larger with excellent performances most of the time (the technical platform of CLDR, including for discussions on this list, is much more modest and very slow it does not scale enough to attract enough comments and vetters; the IANA platform is almost inexistant and not funded at all, and the ISO platform is both costly and extremely slow/inefficient). Let's be pragmatic and use the best tools. Even if you don't like Wikipedia itself for its content (or the tone of its local discussions), it does not mean eveything is bad, I personnly like this diversity which permits technical innovations to appear, and very bright things like Wikitionary, that evolves at the same time as people in the world using the languages we would like to coordinate. However the Wikimedia content (or MediaWiki localisation) is published in a too restrictive licence for using its database directly. Instead, small items are decided isolately and can be coordinated with another more open database such as CLDR, by small incremental steps, allowing other CLDR users to benefit of the best practices. Things would be facilitated if Wikimedia took a collective decision to allow republication of a limited part of the localization data of MEdiaWiki to be interchanged using another licence (this would basically consist in the dataset contained in a limited directory of its open source repository, on which the Language committee has a decision role and can use it to ensute a reasobbable quality and work best with other standards. In my opnion, the Wikiemdia Language Committee, the CLDR TC, the ISO 639 WG should have regular contacts to solve their interoperability problems. A few more international entities may participate as well (e.g. Ubuntu translators). They could also meet each other in some events about I18N, L10N and translation. This does not necessarly mean creating a new administrative body, as long as each participant contacting the others are already cooperating and rporting in his own local community with thier refular communication channels. For now even the CLDR lacks both technical resources to scale up, and attract enough contributors (often not more than an handful ones per language, even for major languages like English, French, Spanish, Portuguese, German, Russian, Chinese, Indonesian, Hindi and Tamil) 2014-07-15 8:54 GMT+02:00 Mark Davis ?? : > I'm not sure it would be worth it. People can always pick up a copy of the > language subtag registry and use it to back-fill. > > We do keep a copy of the registry in our tooling data directory, and > that's what we do in our tooling, such as myCldrFile.getName(language). > > > Mark > > *? Il meglio ? l?inimico del bene ?* > > > On Tue, Jul 15, 2014 at 6:52 AM, Steven R. Loomis > wrote: > >> If anything should be in en and not root. >> >> Wonder if it could go into seed/en or something. >> >> It's not in en right now because of translation burden. But I'd think we >> could set controls via coverage. >> >> En.xml is hand curated now, that would be another distinction. >> >> Steven >> >> Enviado desde nuestro iPhone. >> >> El jul 14, 2014, a las 9:47 PM, Martin Hosken >> escribi?: >> >> Dear All, >> >> I notice that en.xml only contains localeDisplayNames/languages/language >> entries for a subset of iso639-3. Is there a case for filling out the list >> based on iso639-3 reference names so that people don't have to fallback to >> data not in the CLDR? Or, given iso639 has these reference names, is there >> a case for putting them into the root. I realise it's a bit odd to put what >> amounts to English names into root.xml. OTOH these are the official >> reference names and so act as fallback for all languages, so perhaps it >> would be appropriate. I'm happy either way. But I think CLDR would benefit >> from having the complete reference name mapping of iso639-3 in it. >> >> Yours, >> Martin >> _______________________________________________ >> CLDR-Users mailing list >> CLDR-Users at unicode.org >> http://unicode.org/mailman/listinfo/cldr-users >> >> >> _______________________________________________ >> CLDR-Users mailing list >> CLDR-Users at unicode.org >> http://unicode.org/mailman/listinfo/cldr-users >> >> > > _______________________________________________ > CLDR-Users mailing list > CLDR-Users at unicode.org > http://unicode.org/mailman/listinfo/cldr-users > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Tue Jul 15 09:43:44 2014 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Tue, 15 Jul 2014 16:43:44 +0200 Subject: adding all of iso639_3 to either en.xml or root.xml In-Reply-To: References: <20140715114726.69b846a9@sil-mh6> <3DA66F3F-0A87-4798-A6AA-BC9E1636630A@icu-project.org> Message-ID: May be the CLDR can prepare itself the content of IANA registry in a supplemental file in LDML format (to be used as an optional supplementary fallback file for the English locale before the root). OR it could conveince the IANA database maintainers to publish it in LDML format instead of its legacy format described in the existing RFCs. 2014-07-15 16:35 GMT+02:00 Mark Davis ?? : > We can talk about this at the meeting, if you want to put it on the menu. > However, I had strong doubts about it, because it means adding about 24,000 > lines to en.xml, when it is not that hard to process the registry document. > > > Mark > > *? Il meglio ? l?inimico del bene ?* > > > On Tue, Jul 15, 2014 at 4:33 PM, Steven R. Loomis > wrote: > >> Maybe we could have a cldrmodify pass that fills in en could work, then >> someone could fill out any missing translations before doing other >> processing. But, cldr wouldn't come that way out of the box. Then this is >> about ldml(interchange) and not cldr (common data). >> >> Enviado desde nuestro iPhone. >> >> El jul 14, 2014, a las 11:54 PM, Mark Davis ?? >> escribi?: >> >> I'm not sure it would be worth it. People can always pick up a copy of >> the language subtag registry and use it to back-fill. >> >> We do keep a copy of the registry in our tooling data directory, and >> that's what we do in our tooling, such as myCldrFile.getName(language). >> >> >> Mark >> >> *? Il meglio ? l?inimico del bene ?* >> >> >> On Tue, Jul 15, 2014 at 6:52 AM, Steven R. Loomis >> wrote: >> >>> If anything should be in en and not root. >>> >>> Wonder if it could go into seed/en or something. >>> >>> It's not in en right now because of translation burden. But I'd think we >>> could set controls via coverage. >>> >>> En.xml is hand curated now, that would be another distinction. >>> >>> Steven >>> >>> Enviado desde nuestro iPhone. >>> >>> El jul 14, 2014, a las 9:47 PM, Martin Hosken >>> escribi?: >>> >>> Dear All, >>> >>> I notice that en.xml only contains localeDisplayNames/languages/language >>> entries for a subset of iso639-3. Is there a case for filling out the list >>> based on iso639-3 reference names so that people don't have to fallback to >>> data not in the CLDR? Or, given iso639 has these reference names, is there >>> a case for putting them into the root. I realise it's a bit odd to put what >>> amounts to English names into root.xml. OTOH these are the official >>> reference names and so act as fallback for all languages, so perhaps it >>> would be appropriate. I'm happy either way. But I think CLDR would benefit >>> from having the complete reference name mapping of iso639-3 in it. >>> >>> Yours, >>> Martin >>> _______________________________________________ >>> CLDR-Users mailing list >>> CLDR-Users at unicode.org >>> http://unicode.org/mailman/listinfo/cldr-users >>> >>> >>> _______________________________________________ >>> CLDR-Users mailing list >>> CLDR-Users at unicode.org >>> http://unicode.org/mailman/listinfo/cldr-users >>> >>> >> _______________________________________________ >> CLDR-Users mailing list >> CLDR-Users at unicode.org >> http://unicode.org/mailman/listinfo/cldr-users >> >> > > _______________________________________________ > CLDR-Users mailing list > CLDR-Users at unicode.org > http://unicode.org/mailman/listinfo/cldr-users > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mimckenna at paypal.com Tue Jul 15 02:06:05 2014 From: mimckenna at paypal.com (Mckenna, Mike) Date: Tue, 15 Jul 2014 07:06:05 +0000 Subject: adding all of iso639_3 to either en.xml or root.xml In-Reply-To: References: <20140715114726.69b846a9@sil-mh6> <3DA66F3F-0A87-4798-A6AA-BC9E1636630A@icu-project.org>, Message-ID: <57D9C161-3037-44F2-8877-45B083787628@paypal.com> I know we at PayPal would certainly be fans of getting all of iso639-3 in CLDR. We are currently cobbling lists together in English and then translating to target languages. I would have no problem with having the English names in root since these and the French are the official ISO entries. We use the lists for pull-downs on postal address entry forms and need to present them in user language for selection, local language for domestic delivery and English for international postal mail. Thanks, Mike___ Sent from my iPhone On Jul 14, 2014, at 11:56 PM, "Mark Davis ??" > wrote: I'm not sure it would be worth it. People can always pick up a copy of the language subtag registry and use it to back-fill. We do keep a copy of the registry in our tooling data directory, and that's what we do in our tooling, such as myCldrFile.getName(language). Mark ? Il meglio ? l?inimico del bene ? On Tue, Jul 15, 2014 at 6:52 AM, Steven R. Loomis > wrote: If anything should be in en and not root. Wonder if it could go into seed/en or something. It's not in en right now because of translation burden. But I'd think we could set controls via coverage. En.xml is hand curated now, that would be another distinction. Steven Enviado desde nuestro iPhone. El jul 14, 2014, a las 9:47 PM, Martin Hosken > escribi?: Dear All, I notice that en.xml only contains localeDisplayNames/languages/language entries for a subset of iso639-3. Is there a case for filling out the list based on iso639-3 reference names so that people don't have to fallback to data not in the CLDR? Or, given iso639 has these reference names, is there a case for putting them into the root. I realise it's a bit odd to put what amounts to English names into root.xml. OTOH these are the official reference names and so act as fallback for all languages, so perhaps it would be appropriate. I'm happy either way. But I think CLDR would benefit from having the complete reference name mapping of iso639-3 in it. Yours, Martin _______________________________________________ CLDR-Users mailing list CLDR-Users at unicode.org http://unicode.org/mailman/listinfo/cldr-users _______________________________________________ CLDR-Users mailing list CLDR-Users at unicode.org http://unicode.org/mailman/listinfo/cldr-users _______________________________________________ CLDR-Users mailing list CLDR-Users at unicode.org http://unicode.org/mailman/listinfo/cldr-users -------------- next part -------------- An HTML attachment was scrubbed... URL: From emmo at us.ibm.com Tue Jul 15 11:16:35 2014 From: emmo at us.ibm.com (John Emmons) Date: Tue, 15 Jul 2014 11:16:35 -0500 Subject: adding all of iso639_3 to either en.xml or root.xml In-Reply-To: <57D9C161-3037-44F2-8877-45B083787628@paypal.com> References: <20140715114726.69b846a9@sil-mh6> <3DA66F3F-0A87-4798-A6AA-BC9E1636630A@icu-project.org>, <57D9C161-3037-44F2-8877-45B083787628@paypal.com> Message-ID: Another potential problem here is that en.xml and iso639-3 don't always agree 100% on the names. Maybe in root - but I think it is definitely going to be hard to maintain. I put it on the agenda for the next TC mtg. Regards, John C. Emmons Globalization Architect & Unicode CLDR TC Chairman IBM Software Group Internet: emmo at us.ibm.com From: "Mckenna, Mike" To: Mark Davis ?? Cc: Martin Hosken , "cldr-users at unicode.org" , "Steven R. Loomis" Date: 07/15/2014 10:30 AM Subject: Re: adding all of iso639_3 to either en.xml or root.xml Sent by: "CLDR-Users" I know we at PayPal would certainly be fans of getting all of iso639-3 in CLDR. We are currently cobbling lists together in English and then translating to target languages. I would have no problem with having the English names in root since these and the French are the official ISO entries. We use the lists for pull-downs on postal address entry forms and need to present them in user language for selection, local language for domestic delivery and English for international postal mail. Thanks, Mike___ Sent from my iPhone On Jul 14, 2014, at 11:56 PM, "Mark Davis ??" wrote: I'm not sure it would be worth it. People can always pick up a copy of the language subtag registry and use it to back-fill. We do keep a copy of the registry in our tooling data directory, and that's what we do in our tooling, such as myCldrFile.getName (language). Mark ? Il meglio ? l?inimico del bene ? On Tue, Jul 15, 2014 at 6:52 AM, Steven R. Loomis < srl at icu-project.org> wrote: If anything should be in en and not root. Wonder if it could go into seed/en or something. It's not in en right now because of translation burden. But I'd think we could set controls via coverage. En.xml is hand curated now, that would be another distinction. Steven Enviado desde nuestro iPhone. El jul 14, 2014, a las 9:47 PM, Martin Hosken < martin_hosken at sil.org> escribi?: Dear All, I notice that en.xml only contains localeDisplayNames/languages/language entries for a subset of iso639-3. Is there a case for filling out the list based on iso639-3 reference names so that people don't have to fallback to data not in the CLDR? Or, given iso639 has these reference names, is there a case for putting them into the root. I realise it's a bit odd to put what amounts to English names into root.xml. OTOH these are the official reference names and so act as fallback for all languages, so perhaps it would be appropriate. I'm happy either way. But I think CLDR would benefit from having the complete reference name mapping of iso639-3 in it. Yours, Martin _______________________________________________ CLDR-Users mailing list CLDR-Users at unicode.org http://unicode.org/mailman/listinfo/cldr-users _______________________________________________ CLDR-Users mailing list CLDR-Users at unicode.org http://unicode.org/mailman/listinfo/cldr-users _______________________________________________ CLDR-Users mailing list CLDR-Users at unicode.org http://unicode.org/mailman/listinfo/cldr-users _______________________________________________ CLDR-Users mailing list CLDR-Users at unicode.org http://unicode.org/mailman/listinfo/cldr-users -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From petercon at microsoft.com Tue Jul 15 11:50:39 2014 From: petercon at microsoft.com (Peter Constable) Date: Tue, 15 Jul 2014 16:50:39 +0000 Subject: adding all of iso639_3 to either en.xml or root.xml In-Reply-To: References: <20140715114726.69b846a9@sil-mh6> <3DA66F3F-0A87-4798-A6AA-BC9E1636630A@icu-project.org>, <57D9C161-3037-44F2-8877-45B083787628@paypal.com> Message-ID: <1a4ccf7d598b4a18b510c7a3f0fd6106@CY1PR0301MB0698.namprd03.prod.outlook.com> One thing to note: names in ISO 639 are intended primarily for purposes of identifying the concepts that are encoded, not providing recommended display names. Especially in 639-3, the reference names are not claimed to be English names for the languages; they are merely names that have been used in literature documenting languages. It was simply not feasible to get a review of 7000+ language concepts to determine names that are specifically English or French, or specifically autonyms, or to review for suitability as display names in UI (in whatever language). If CLDR TC or any of its participants can provide vetted English / French / whatever names for any of the entries in 639-3, then I expect the RA would be open to considering changes. The RA is not constrained to providing only reference names of undetermined language origin; they were only required to provide unique and unambiguous reference names. Peter From: CLDR-Users [mailto:cldr-users-bounces at unicode.org] On Behalf Of John Emmons Sent: Tuesday, July 15, 2014 9:17 AM To: Mckenna, Mike Cc: Mark Davis ??; Steven R. Loomis; cldr-users at unicode.org; Martin Hosken Subject: Re: adding all of iso639_3 to either en.xml or root.xml Another potential problem here is that en.xml and iso639-3 don't always agree 100% on the names. Maybe in root - but I think it is definitely going to be hard to maintain. I put it on the agenda for the next TC mtg. Regards, John C. Emmons Globalization Architect & Unicode CLDR TC Chairman IBM Software Group Internet: emmo at us.ibm.com [Inactive hide details for "Mckenna, Mike" ---07/15/2014 10:30:29 AM---I know we at PayPal would certainly be fans of getting al]"Mckenna, Mike" ---07/15/2014 10:30:29 AM---I know we at PayPal would certainly be fans of getting all of iso639-3 in CLDR. We are currently cob From: "Mckenna, Mike" > To: Mark Davis ?? > Cc: Martin Hosken >, "cldr-users at unicode.org" >, "Steven R. Loomis" > Date: 07/15/2014 10:30 AM Subject: Re: adding all of iso639_3 to either en.xml or root.xml Sent by: "CLDR-Users" > ________________________________ I know we at PayPal would certainly be fans of getting all of iso639-3 in CLDR. We are currently cobbling lists together in English and then translating to target languages. I would have no problem with having the English names in root since these and the French are the official ISO entries. We use the lists for pull-downs on postal address entry forms and need to present them in user language for selection, local language for domestic delivery and English for international postal mail. Thanks, Mike___ Sent from my iPhone On Jul 14, 2014, at 11:56 PM, "Mark Davis ??" > wrote: I'm not sure it would be worth it. People can always pick up a copy of the language subtag registry and use it to back-fill. We do keep a copy of the registry in our tooling data directory, and that's what we do in our tooling, such as myCldrFile.getName(language). Mark ? Il meglio ? l?inimico del bene ? On Tue, Jul 15, 2014 at 6:52 AM, Steven R. Loomis > wrote: If anything should be in en and not root. Wonder if it could go into seed/en or something. It's not in en right now because of translation burden. But I'd think we could set controls via coverage. En.xml is hand curated now, that would be another distinction. Steven Enviado desde nuestro iPhone. El jul 14, 2014, a las 9:47 PM, Martin Hosken > escribi?: Dear All, I notice that en.xml only contains localeDisplayNames/languages/language entries for a subset of iso639-3. Is there a case for filling out the list based on iso639-3 reference names so that people don't have to fallback to data not in the CLDR? Or, given iso639 has these reference names, is there a case for putting them into the root. I realise it's a bit odd to put what amounts to English names into root.xml. OTOH these are the official reference names and so act as fallback for all languages, so perhaps it would be appropriate. I'm happy either way. But I think CLDR would benefit from having the complete reference name mapping of iso639-3 in it. Yours, Martin _______________________________________________ CLDR-Users mailing list CLDR-Users at unicode.org http://unicode.org/mailman/listinfo/cldr-users _______________________________________________ CLDR-Users mailing list CLDR-Users at unicode.org http://unicode.org/mailman/listinfo/cldr-users _______________________________________________ CLDR-Users mailing list CLDR-Users at unicode.org http://unicode.org/mailman/listinfo/cldr-users_______________________________________________ CLDR-Users mailing list CLDR-Users at unicode.org http://unicode.org/mailman/listinfo/cldr-users -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.gif Type: image/gif Size: 105 bytes Desc: image001.gif URL: From verdy_p at wanadoo.fr Tue Jul 15 12:20:06 2014 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Tue, 15 Jul 2014 19:20:06 +0200 Subject: adding all of iso639_3 to either en.xml or root.xml In-Reply-To: References: <20140715114726.69b846a9@sil-mh6> <3DA66F3F-0A87-4798-A6AA-BC9E1636630A@icu-project.org> <57D9C161-3037-44F2-8877-45B083787628@paypal.com> Message-ID: ISO639-3 does not matter for our goal. What we need is the names in the IANA subtags registry (which does not necessarily agree with ISO 639-3 as well); CLDR focuses on "locales", not really "languages" under the ISO 639 definition, so CLDR (like almost all computing and networking protocols and languages) is based on BCP47. Leave ISO 639 only for bibliographic classifications, it is not stable enough for our goals and not sufficiently evolutive with backward compatibility and clear paths for data migrations (if possible) or for handling ambiguities remaining across epochs and evolutions of languages and their so-called "dialects". It is not usable for localisation and preserving data tagging in archived documents (even many bibliophiles do not like ISO 639 as it requires them too much maintenance). Sometimes it's hard to ket people know that ISO639 is not important. BCP47 is less known because it has too often been referenced by its evolving RFC numbers. ISO639 is wrllknown for its complete lack of interoperability and its own contradictiosn and instability; it's best to forget it here completely for the CLDR project (notably because like BCP47 we will ignore many incoherent parts of ISO639). 2014-07-15 18:16 GMT+02:00 John Emmons : > Another potential problem here is that en.xml and iso639-3 don't always > agree 100% on the names. Maybe in root - but I think it is definitely > going to be hard to maintain. > I put it on the agenda for the next TC mtg. > > > Regards, > > John C. Emmons > Globalization Architect & Unicode CLDR TC Chairman > IBM Software Group > Internet: emmo at us.ibm.com > > > [image: Inactive hide details for "Mckenna, Mike" ---07/15/2014 10:30:29 > AM---I know we at PayPal would certainly be fans of getting al]"Mckenna, > Mike" ---07/15/2014 10:30:29 AM---I know we at PayPal would certainly be > fans of getting all of iso639-3 in CLDR. We are currently cob > > From: "Mckenna, Mike" > To: Mark Davis ?? > Cc: Martin Hosken , "cldr-users at unicode.org" < > cldr-users at unicode.org>, "Steven R. Loomis" > Date: 07/15/2014 10:30 AM > Subject: Re: adding all of iso639_3 to either en.xml or root.xml > Sent by: "CLDR-Users" > ------------------------------ > > > > I know we at PayPal would certainly be fans of getting all of iso639-3 in > CLDR. We are currently cobbling lists together in English and then > translating to target languages. I would have no problem with having the > English names in root since these and the French are the official ISO > entries. > > We use the lists for pull-downs on postal address entry forms and need to > present them in user language for selection, local language for domestic > delivery and English for international postal mail. > > Thanks, > > Mike___ > > Sent from my iPhone > > > On Jul 14, 2014, at 11:56 PM, "Mark Davis ??" <*mark at macchiato.com* > > wrote: > > I'm not sure it would be worth it. People can always pick up a copy of > the language subtag registry and use it to back-fill. > > We do keep a copy of the registry in our tooling data directory, and > that's what we do in our tooling, such as myCldrFile.getName(language). > > > *Mark* > > *? Il meglio ? l?inimico del bene ?* > > > On Tue, Jul 15, 2014 at 6:52 AM, Steven R. Loomis < > *srl at icu-project.org* > wrote: > If anything should be in en and not root. > > Wonder if it could go into seed/en or something. > > It's not in en right now because of translation burden. But I'd > think we could set controls via coverage. > > En.xml is hand curated now, that would be another distinction. > > Steven > > Enviado desde nuestro iPhone. > > El jul 14, 2014, a las 9:47 PM, Martin Hosken < > *martin_hosken at sil.org* > escribi?: > Dear All, > > I notice that en.xml only contains > localeDisplayNames/languages/language entries for a subset of iso639-3. Is > there a case for filling out the list based on iso639-3 reference names so > that people don't have to fallback to data not in the CLDR? Or, given > iso639 has these reference names, is there a case for putting them into the > root. I realise it's a bit odd to put what amounts to English names into > root.xml. OTOH these are the official reference names and so act as > fallback for all languages, so perhaps it would be appropriate. I'm happy > either way. But I think CLDR would benefit from having the complete > reference name mapping of iso639-3 in it. > > Yours, > Martin > _______________________________________________ > CLDR-Users mailing list > *CLDR-Users at unicode.org* > *http://unicode.org/mailman/listinfo/cldr-users* > > > _______________________________________________ > CLDR-Users mailing list > *CLDR-Users at unicode.org* > *http://unicode.org/mailman/listinfo/cldr-users* > > > _______________________________________________ > CLDR-Users mailing list > *CLDR-Users at unicode.org* > *http://unicode.org/mailman/listinfo/cldr-users* > > _______________________________________________ > CLDR-Users mailing list > CLDR-Users at unicode.org > http://unicode.org/mailman/listinfo/cldr-users > > > > _______________________________________________ > CLDR-Users mailing list > CLDR-Users at unicode.org > http://unicode.org/mailman/listinfo/cldr-users > > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From verdy_p at wanadoo.fr Tue Jul 15 12:34:07 2014 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Tue, 15 Jul 2014 19:34:07 +0200 Subject: adding all of iso639_3 to either en.xml or root.xml In-Reply-To: <1a4ccf7d598b4a18b510c7a3f0fd6106@CY1PR0301MB0698.namprd03.prod.outlook.com> References: <20140715114726.69b846a9@sil-mh6> <3DA66F3F-0A87-4798-A6AA-BC9E1636630A@icu-project.org> <57D9C161-3037-44F2-8877-45B083787628@paypal.com> <1a4ccf7d598b4a18b510c7a3f0fd6106@CY1PR0301MB0698.namprd03.prod.outlook.com> Message-ID: Some problem same solution: use the following fallback sources in that order: - local application data where available (as long as they need to remain differentiated from CLDR, or if applications need some extra private codes or conflicting codes like "nrm" in Wikipedia used for "Norman", instead of the unrelated language encoded in BCP47 and ISO639-3, both of them still having no code for "Norman" itself) - local application fallbacks (if needed to handle legacy data for an interim migration period) - CLDR data for the effective locale if available - CLDR date for fallbacks documented in CLDR - CLDR "en" or "fr" vetted data - CLDR "en" or "fr" draft" data (optional) - BCP47 English names (for tlocales using English as a fallback) - ISO639 names (if ther contradict between ISO 639 parts, use -1, then -2 then -3) if a matching ISO639 code can reasonably be mapped from a locale code (I doubt it will find anything useful after the BCP47 step). - display just the locale code itself (possibly with some application-specific surrounding "decoration", like punctuations or rendering styles) So a CLDR application would just require two supplementary LDML data files, but many CLDR applications won't have it). This will save us of vetting CLDR data for many languages for which we still have not seen any need for interoperability between sources using other names than those from BCP47 or ISO636 or from their own local community and applications. 2014-07-15 18:50 GMT+02:00 Peter Constable : > One thing to note: names in ISO 639 are intended primarily for purposes > of identifying the concepts that are encoded, not providing recommended > display names. Especially in 639-3, the reference names are not claimed to > be English names for the languages; they are merely names that have been > used in literature documenting languages. It was simply not feasible to get > a review of 7000+ language concepts to determine names that are > specifically English or French, or specifically autonyms, or to review for > suitability as display names in UI (in whatever language). > > > > If CLDR TC or any of its participants can provide vetted English / French > / whatever names for any of the entries in 639-3, then I expect the RA > would be open to considering changes. The RA is not constrained to > providing only reference names of undetermined language origin; they were > only required to provide unique and unambiguous reference names. > > > > > > Peter > > > > *From:* CLDR-Users [mailto:cldr-users-bounces at unicode.org] *On Behalf Of *John > Emmons > *Sent:* Tuesday, July 15, 2014 9:17 AM > *To:* Mckenna, Mike > *Cc:* Mark Davis ??; Steven R. Loomis; cldr-users at unicode.org; Martin > Hosken > > *Subject:* Re: adding all of iso639_3 to either en.xml or root.xml > > > > Another potential problem here is that en.xml and iso639-3 don't always > agree 100% on the names. Maybe in root - but I think it is definitely > going to be hard to maintain. > I put it on the agenda for the next TC mtg. > > > Regards, > > John C. Emmons > Globalization Architect & Unicode CLDR TC Chairman > IBM Software Group > Internet: emmo at us.ibm.com > > > [image: Inactive hide details for "Mckenna, Mike" ---07/15/2014 10:30:29 > AM---I know we at PayPal would certainly be fans of getting al]"Mckenna, > Mike" ---07/15/2014 10:30:29 AM---I know we at PayPal would certainly be > fans of getting all of iso639-3 in CLDR. We are currently cob > > From: "Mckenna, Mike" > To: Mark Davis ?? > Cc: Martin Hosken , "cldr-users at unicode.org" < > cldr-users at unicode.org>, "Steven R. Loomis" > Date: 07/15/2014 10:30 AM > Subject: Re: adding all of iso639_3 to either en.xml or root.xml > Sent by: "CLDR-Users" > ------------------------------ > > > > > I know we at PayPal would certainly be fans of getting all of iso639-3 in > CLDR. We are currently cobbling lists together in English and then > translating to target languages. I would have no problem with having the > English names in root since these and the French are the official ISO > entries. > > We use the lists for pull-downs on postal address entry forms and need to > present them in user language for selection, local language for domestic > delivery and English for international postal mail. > > Thanks, > > Mike___ > > Sent from my iPhone > > On Jul 14, 2014, at 11:56 PM, "Mark Davis ??" wrote: > > I'm not sure it would be worth it. People can always pick up a copy of the > language subtag registry and use it to back-fill. > > We do keep a copy of the registry in our tooling data directory, and > that's what we do in our tooling, such as myCldrFile.getName(language). > > > Mark > > *? Il meglio ? l?inimico del bene ?* > > > On Tue, Jul 15, 2014 at 6:52 AM, Steven R. Loomis > wrote: > > If anything should be in en and not root. > > Wonder if it could go into seed/en or something. > > It's not in en right now because of translation burden. But I'd think we > could set controls via coverage. > > En.xml is hand curated now, that would be another distinction. > > Steven > > Enviado desde nuestro iPhone. > > El jul 14, 2014, a las 9:47 PM, Martin Hosken > escribi?: > > Dear All, > > I notice that en.xml only contains localeDisplayNames/languages/language > entries for a subset of iso639-3. Is there a case for filling out the list > based on iso639-3 reference names so that people don't have to fallback to > data not in the CLDR? Or, given iso639 has these reference names, is there > a case for putting them into the root. I realise it's a bit odd to put what > amounts to English names into root.xml. OTOH these are the official > reference names and so act as fallback for all languages, so perhaps it > would be appropriate. I'm happy either way. But I think CLDR would benefit > from having the complete reference name mapping of iso639-3 in it. > > Yours, > Martin > _______________________________________________ > CLDR-Users mailing list > CLDR-Users at unicode.org > http://unicode.org/mailman/listinfo/cldr-users > > > _______________________________________________ > CLDR-Users mailing list > CLDR-Users at unicode.org > http://unicode.org/mailman/listinfo/cldr-users > > > _______________________________________________ > CLDR-Users mailing list > CLDR-Users at unicode.org > http://unicode.org/mailman/listinfo/cldr-users > _______________________________________________ > CLDR-Users mailing list > CLDR-Users at unicode.org > http://unicode.org/mailman/listinfo/cldr-users > > _______________________________________________ > CLDR-Users mailing list > CLDR-Users at unicode.org > http://unicode.org/mailman/listinfo/cldr-users > > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.gif Type: image/gif Size: 105 bytes Desc: not available URL: From srloomis at us.ibm.com Tue Jul 15 11:40:51 2014 From: srloomis at us.ibm.com (Steven R Loomis) Date: Tue, 15 Jul 2014 09:40:51 -0700 Subject: adding all of iso639_3 to either en.xml or root.xml In-Reply-To: References: <20140715114726.69b846a9@sil-mh6> <3DA66F3F-0A87-4798-A6AA-BC9E1636630A@icu-project.org>, <57D9C161-3037-44F2-8877-45B083787628@paypal.com> Message-ID: Right, there are lots of structural and data issues. I filed a ticket to provide a tool ( could be just sample code) to generate en.xml from ISO639, but not as part of CLDR's process. It's easy to do (as Mark noted). http://unicode.org/cldr/trac/ticket/7698 -s From: John Emmons/Austin/IBM at IBMUS To: "Mckenna, Mike" Cc: Mark Davis ?? , "Steven R. Loomis" , "cldr-users at unicode.org" , Martin Hosken Date: 07/15/2014 09:37 AM Subject: Re: adding all of iso639_3 to either en.xml or root.xml Sent by: "CLDR-Users" Another potential problem here is that en.xml and iso639-3 don't always agree 100% on the names. Maybe in root - but I think it is definitely going to be hard to maintain. I put it on the agenda for the next TC mtg. Regards, John C. Emmons Globalization Architect & Unicode CLDR TC Chairman IBM Software Group Internet: emmo at us.ibm.com Inactive hide details for "Mckenna, Mike" ---07/15/2014 10:30:29 AM---I know we at PayPal would certainly be fans of getting al"Mckenna, Mike" ---07/15/2014 10:30:29 AM---I know we at PayPal would certainly be fans of getting all of iso639-3 in CLDR. We are currently cob From: "Mckenna, Mike" To: Mark Davis ?? Cc: Martin Hosken , "cldr-users at unicode.org" , "Steven R. Loomis" Date: 07/15/2014 10:30 AM Subject: Re: adding all of iso639_3 to either en.xml or root.xml Sent by: "CLDR-Users" I know we at PayPal would certainly be fans of getting all of iso639-3 in CLDR. We are currently cobbling lists together in English and then translating to target languages. I would have no problem with having the English names in root since these and the French are the official ISO entries. We use the lists for pull-downs on postal address entry forms and need to present them in user language for selection, local language for domestic delivery and English for international postal mail. Thanks, Mike___ Sent from my iPhone On Jul 14, 2014, at 11:56 PM, "Mark Davis ??" wrote: I'm not sure it would be worth it. People can always pick up a copy of the language subtag registry and use it to back-fill. We do keep a copy of the registry in our tooling data directory, and that's what we do in our tooling, such as myCldrFile.getName (language). Mark ? Il meglio ? l?inimico del bene ? On Tue, Jul 15, 2014 at 6:52 AM, Steven R. Loomis < srl at icu-project.org> wrote: If anything should be in en and not root. Wonder if it could go into seed/en or something. It's not in en right now because of translation burden. But I'd think we could set controls via coverage. En.xml is hand curated now, that would be another distinction. Steven Enviado desde nuestro iPhone. El jul 14, 2014, a las 9:47 PM, Martin Hosken < martin_hosken at sil.org> escribi?: Dear All, I notice that en.xml only contains localeDisplayNames/languages/language entries for a subset of iso639-3. Is there a case for filling out the list based on iso639-3 reference names so that people don't have to fallback to data not in the CLDR? Or, given iso639 has these reference names, is there a case for putting them into the root. I realise it's a bit odd to put what amounts to English names into root.xml. OTOH these are the official reference names and so act as fallback for all languages, so perhaps it would be appropriate. I'm happy either way. But I think CLDR would benefit from having the complete reference name mapping of iso639-3 in it. Yours, Martin _______________________________________________ CLDR-Users mailing list CLDR-Users at unicode.org http://unicode.org/mailman/listinfo/cldr-users _______________________________________________ CLDR-Users mailing list CLDR-Users at unicode.org http://unicode.org/mailman/listinfo/cldr-users _______________________________________________ CLDR-Users mailing list CLDR-Users at unicode.org http://unicode.org/mailman/listinfo/cldr-users _______________________________________________ CLDR-Users mailing list CLDR-Users at unicode.org http://unicode.org/mailman/listinfo/cldr-users _______________________________________________ CLDR-Users mailing list CLDR-Users at unicode.org http://unicode.org/mailman/listinfo/cldr-users -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From doug at ewellic.org Tue Jul 15 13:56:01 2014 From: doug at ewellic.org (Doug Ewell) Date: Tue, 15 Jul 2014 11:56:01 -0700 Subject: adding all of =?UTF-8?Q?iso=36=33=39=5F=33=20to=20either=20en=2Exml?= =?UTF-8?Q?=20or=20root=2Exml?= Message-ID: <20140715115601.665a7a7059d7ee80bb4d670165c8327d.024e652be1.wbe@email03.secureserver.net> For what it's worth, BCP 47 says explicitly (Section 3.1.5) that "'Description' fields don't necessarily represent the actual native name of the item in the record, nor are any of the descriptions guaranteed to be in any particular language (such as English or French, for example)." Indeed, there are many Description fields in the Language Subtag Registry, mostly for sign languages but some others as well, that are very clearly not in English. -- Doug Ewell | Thornton, CO, USA http://ewellic.org | @DougEwell From doug at ewellic.org Tue Jul 15 14:50:29 2014 From: doug at ewellic.org (Doug Ewell) Date: Tue, 15 Jul 2014 12:50:29 -0700 Subject: adding all of =?UTF-8?Q?iso=36=33=39=5F=33=20to=20either=20en=2Exml?= =?UTF-8?Q?=20or=20root=2Exml?= Message-ID: <20140715125029.665a7a7059d7ee80bb4d670165c8327d.db38cf8317.wbe@email03.secureserver.net> Philippe Verdy wrote: > Let's be pragmatic and use the best tools. Even if you don't like > Wikipedia itself for its content (or the tone of its local > discussions), it does not mean eveything is bad, I personnly like this > diversity which permits technical innovations to appear, and very > bright things like Wikitionary, that evolves at the same time as > people in the world using the languages we would like to coordinate. Thank goodness CLDR doesn't apply the Wikipedia model of inventing new "Standard X" code elements that step on the reserved code space of Standard X, ignoring any private-use mechanism built into Standard X, as Wikipedia does with language codes and ISO 639-3 and BCP 47. The hijacking of 'nrm' by Wikipedia for Norman, described by Philippe earlier, is a perfect example of this. In ISO 639-3 and in BCP 47, 'nrm' is the code element for Narom, spoken in Malaysia. I'm otherwise a fan of Wikipedia, but this example of "Wikipedia exceptionalism" is just about the worst possible approach for either stability or interoperability. -- Doug Ewell | Thornton, CO, USA http://ewellic.org | @DougEwell From verdy_p at wanadoo.fr Tue Jul 15 16:27:24 2014 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Tue, 15 Jul 2014 23:27:24 +0200 Subject: adding all of iso639_3 to either en.xml or root.xml In-Reply-To: <20140715125029.665a7a7059d7ee80bb4d670165c8327d.db38cf8317.wbe@email03.secureserver.net> References: <20140715125029.665a7a7059d7ee80bb4d670165c8327d.db38cf8317.wbe@email03.secureserver.net> Message-ID: And this problem is being progressively solved, code by code. This should have not been so long for that language to fix it with a private-use code or a code that does not conflict with standard BCP code formats (codes starting with subtags longer than 3 characters are reserved, bu Wikimedia). But this is not just a question of language code to fix, there's a need to maintain domain names for a while, fixing all other wikis to resolve the new code, fixing templates and pages in lots of places. But this Wiki code has spread further than expected, outside Wikimedia (e.g. in OSM databases, and in other projects translated in translatewiki.net). Like all major sites have have histories to preserve, this is a slow process, evey one has to manage his own legacy usages. That's why there's now a Language committee to approve new codes, and why admins no longer accept new project codes at the first request without reviewing it and looking for comments. Wikimedia is not alone, most users of ISO 639 have inventted their own local use (including for bibliographic purposes, before ISO 639-3 was published, and BCP47 was revized with stricter rules for extensions, e.g. "be-x-old" is still used instead of the newer be-tarask, but at least it is conforming to BC47 and causes little problems; same thing about "zh-classical" even if "lzh" is prefered, or "de-formal" instead of "de-x-formal", or "simple" instead of "en-x-simple", which cause little problems but are still used in domain names; "zh-yue" us still ised as domain name but the prefered "yue" code is also recognized as an alias, and both are valid, so the problem is solved). I'm not pragmantic up to the point to propose to adopt "nrm" as used in Wikimedia and translatewiki.net. But at least Wikimedia admins know the problem and have to solve it progressively with the community. This is the only severe conflict remaining (if it has still not been solved it's because the language is still not encoded in standards, and admins don't want to migrate the sites twice). I've asked them to request an allocation for Norman but they could not get decisive opinions about its dialects (and notably with Jersiais official as a language in Jersey). May be a separate code should be requested for Jersiais itself, even if Norman gets its own code, as a macrolanguage encompassing Jersiais, Guern?siais, Continental Norman. The other problem is that Continental Normal is still considered as a variant of French (unlike Picard which includes its Ch'timi variant in French Flanders and has close relation with Wallon in Belgium). Linguists have different point of views. But Picard is also considered a variant of French by the same people that think Norman is French, and associate Jersiais directly to French. May be "fr" should be considered a macrolanguage too, to encompass its regional or historic variants (including "frc" = "French Cajun" spoken in Louisiana, USA), and "standard modern Parisian French in France" would have then its own new code too within that macrolanguage. 2014-07-15 21:50 GMT+02:00 Doug Ewell : > Philippe Verdy wrote: > > > Let's be pragmatic and use the best tools. Even if you don't like > > Wikipedia itself for its content (or the tone of its local > > discussions), it does not mean eveything is bad, I personnly like this > > diversity which permits technical innovations to appear, and very > > bright things like Wikitionary, that evolves at the same time as > > people in the world using the languages we would like to coordinate. > > Thank goodness CLDR doesn't apply the Wikipedia model of inventing new > "Standard X" code elements that step on the reserved code space of > Standard X, ignoring any private-use mechanism built into Standard X, as > Wikipedia does with language codes and ISO 639-3 and BCP 47. > > The hijacking of 'nrm' by Wikipedia for Norman, described by Philippe > earlier, is a perfect example of this. In ISO 639-3 and in BCP 47, 'nrm' > is the code element for Narom, spoken in Malaysia. > > I'm otherwise a fan of Wikipedia, but this example of "Wikipedia > exceptionalism" is just about the worst possible approach for either > stability or interoperability. > > -- > Doug Ewell | Thornton, CO, USA > http://ewellic.org | @DougEwell > > > _______________________________________________ > CLDR-Users mailing list > CLDR-Users at unicode.org > http://unicode.org/mailman/listinfo/cldr-users > -------------- next part -------------- An HTML attachment was scrubbed... URL: From srl at icu-project.org Tue Jul 15 17:28:44 2014 From: srl at icu-project.org (Steven R. Loomis) Date: Tue, 15 Jul 2014 15:28:44 -0700 Subject: adding all of iso639_3 to either en.xml or root.xml In-Reply-To: References: <20140715125029.665a7a7059d7ee80bb4d670165c8327d.db38cf8317.wbe@email03.secureserver.net> Message-ID: <53C5AB1C.3040008@icu-project.org> Re Wikipedia, ISO639 and BCP47, this bug was just filed: http://unicode.org/cldr/trac/ticket/7699 Would be great to have that mapping table. CC'ing some DBPedia folks and Shervin. -- IBMer but all opinions are mine. https://www.ohloh.net/accounts/srl295 // fingerprint @ https://ssl.icu-project.org/trac/wiki/Srl From roozbeh at unicode.org Tue Jul 15 18:33:05 2014 From: roozbeh at unicode.org (Roozbeh Pournader) Date: Tue, 15 Jul 2014 16:33:05 -0700 Subject: Noto adds CJK, plus new user-facing website Message-ID: Please excuse the spam, but I think it would be interesting for people here to know that the Noto open source project now supports CJK, which brings it very close to the goal of supporting every major script (and several minor and historical ones). Here is the CJK announcement: http://googledevelopers.blogspot.com/2014/07/noto-cjk-font-that-is-complete.html Here is the new user-oriented Noto website: http://www.google.com/get/noto/ The data on the website is from the CLDR project, and the sample images are rendered using HarfBuzz and Pango. And more will be coming. (Of all the scripts used for CLDR languages, only three have not been released yet.) -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Tue Jul 15 20:21:12 2014 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Wed, 16 Jul 2014 03:21:12 +0200 Subject: Noto adds CJK, plus new user-facing website In-Reply-To: References: Message-ID: Thanks to get it known. Probably the Noto collection is the best drop in replacement for Android smartphones and tablets. And they will be useful to many websites. They will also fit very well with Linux distribs. Apple could feature the Adobe collection for MacOSX. Will Microsoft follow with a comparable collection for Windows? For languages like Burmese and languages of Africa this is a great announcement. Tibetan script still lacks some complete support (and Divehi as well even if it is much simpler than Arabic; but really ugly in existing fonts). Next step: building monospaced variants of these fonts for use in programmng languages and coding. Or may be just integrate a feature in these fonts to support a monospaced rendering (using one or several fixed-width cells in a row for each cluster), or facilitating data input with easier placement of input carets and easier text selection (the alternative being to use simplified glyphs and simpler joiners for cursive scripts, at least temporarily for the word under focus or an input tool showing the simplified rendering in a small window working like a magnifier when hovering some scripts with complex layouts; that tool could work also with IMEs; that alternative would deprecate monospace styles for many scripts where they are really ugly and not very easy to read fast, glyphs would be rendered with more natural sizes and positioning and more regular stroke weights). After that, this will be the turn for a comprehensive font for Maths formulas and pictograms for technical diagrams, and a font for pictograms (meteorology, astrology, games, cartographic symbols, arrows, clocks showing time, UI symbols, agendas, musical notations, emojis) And some other for old historic scripts (Linear A or B, old runic scripts), and experiments with new experimental scripts developed in the last half-century or just since the apparition of personal computers in the early 1980s (coincides with radical changes about how books/papers and other medias showing text are produced, with radical changes in orthographies for the remaining minority languages). The global public is just starting to rediscover the beauty of the historic scripts and how they could also be useful to complement their native alphabets that have suffered a lot since the advent of ASCII or early 8bit charsets in computers everywhere and the early development of Unicode and incompaticle charsets showing unreadable random results or just tofu (even today or modern languages like Burmese, or with "optional" diacritics rendered on the wrong letters in Russian with most commonly installed fonts). Another for SignWriting with specific features (if it is possible to design it to work with a stable orthographic convention for the layout, otherwise develop a standard layout UI control, or a simple schema for use in basic HTML or UI, rendering it with a subset of SVG using a set of component glyphs from a common font and a standard mapping). Let's just hope that OSes will support all these new scripts (Windows has always been leaving users behind if they did not use the lastest version whose linguistic support was frozen at least 2 years before the last release, with few extensions with OS or Office service packs, notably for the OpenType, GDI, 3D API, or .Net renderers and in i18n support APIs). 2014-07-16 1:33 GMT+02:00 Roozbeh Pournader : > Please excuse the spam, but I think it would be interesting for people > here to know that the Noto open source project now supports CJK, which > brings it very close to the goal of supporting every major script (and > several minor and historical ones). > > Here is the CJK announcement: > > http://googledevelopers.blogspot.com/2014/07/noto-cjk-font-that-is-complete.html > > Here is the new user-oriented Noto website: > http://www.google.com/get/noto/ > > The data on the website is from the CLDR project, and the sample images > are rendered using HarfBuzz and Pango. > > And more will be coming. (Of all the scripts used for CLDR languages, only > three have not been released yet.) > > _______________________________________________ > CLDR-Users mailing list > CLDR-Users at unicode.org > http://unicode.org/mailman/listinfo/cldr-users > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From roozbeh at unicode.org Tue Jul 15 20:45:14 2014 From: roozbeh at unicode.org (Roozbeh Pournader) Date: Tue, 15 Jul 2014 18:45:14 -0700 Subject: Noto adds CJK, plus new user-facing website In-Reply-To: References: Message-ID: The Noto Sans Symbols font already supports a lot of the symbol classes you mentioned. Linear B and Runic are also supported by Noto. Same with some of the newer experimental scripts (Osmanya, Deseret, Shavian, etc.) HarfBuzz has been trying its best to support ever character in Unicode as soon as possible. It is included in Android, ChromeOS, and I believe all modern Linux distributions. And last not the least, both HarfBuzz and Noto would appreciate any help in finding and fixing issues they may have. The bug fix turnaround is usually very quick, especially with HarfBuzz. On Tue, Jul 15, 2014 at 6:21 PM, Philippe Verdy wrote: > Thanks to get it known. > > Probably the Noto collection is the best drop in replacement for Android > smartphones and tablets. And they will be useful to many websites. They > will also fit very well with Linux distribs. > > Apple could feature the Adobe collection for MacOSX. Will Microsoft follow > with a comparable collection for Windows? > > For languages like Burmese and languages of Africa this is a great > announcement. Tibetan script still lacks some complete support (and Divehi > as well even if it is much simpler than Arabic; but really ugly in existing > fonts). > > Next step: building monospaced variants of these fonts for use in > programmng languages and coding. Or may be just integrate a feature in > these fonts to support a monospaced rendering (using one or several > fixed-width cells in a row for each cluster), or facilitating data input > with easier placement of input carets and easier text selection (the > alternative being to use simplified glyphs and simpler joiners for cursive > scripts, at least temporarily for the word under focus or an input tool > showing the simplified rendering in a small window working like a magnifier > when hovering some scripts with complex layouts; that tool could work also > with IMEs; that alternative would deprecate monospace styles for many > scripts where they are really ugly and not very easy to read fast, glyphs > would be rendered with more natural sizes and positioning and more regular > stroke weights). > > After that, this will be the turn for a comprehensive font for Maths > formulas and pictograms for technical diagrams, and a font for pictograms > (meteorology, astrology, games, cartographic symbols, arrows, clocks > showing time, UI symbols, agendas, musical notations, emojis) > > And some other for old historic scripts (Linear A or B, old runic > scripts), and experiments with new experimental scripts developed in the > last half-century or just since the apparition of personal computers in the > early 1980s (coincides with radical changes about how books/papers and > other medias showing text are produced, with radical changes in > orthographies for the remaining minority languages). > > The global public is just starting to rediscover the beauty of the > historic scripts and how they could also be useful to complement their > native alphabets that have suffered a lot since the advent of ASCII or > early 8bit charsets in computers everywhere and the early development of > Unicode and incompaticle charsets showing unreadable random results or just > tofu (even today or modern languages like Burmese, or with "optional" > diacritics rendered on the wrong letters in Russian with most commonly > installed fonts). > > Another for SignWriting with specific features (if it is possible to > design it to work with a stable orthographic convention for the layout, > otherwise develop a standard layout UI control, or a simple schema for use > in basic HTML or UI, rendering it with a subset of SVG using a set of > component glyphs from a common font and a standard mapping). > > Let's just hope that OSes will support all these new scripts (Windows has > always been leaving users behind if they did not use the lastest version > whose linguistic support was frozen at least 2 years before the last > release, with few extensions with OS or Office service packs, notably for > the OpenType, GDI, 3D API, or .Net renderers and in i18n support APIs). > > > > 2014-07-16 1:33 GMT+02:00 Roozbeh Pournader : > >> Please excuse the spam, but I think it would be interesting for people >> here to know that the Noto open source project now supports CJK, which >> brings it very close to the goal of supporting every major script (and >> several minor and historical ones). >> >> Here is the CJK announcement: >> >> http://googledevelopers.blogspot.com/2014/07/noto-cjk-font-that-is-complete.html >> >> Here is the new user-oriented Noto website: >> http://www.google.com/get/noto/ >> >> The data on the website is from the CLDR project, and the sample images >> are rendered using HarfBuzz and Pango. >> >> And more will be coming. (Of all the scripts used for CLDR languages, >> only three have not been released yet.) >> >> _______________________________________________ >> CLDR-Users mailing list >> CLDR-Users at unicode.org >> http://unicode.org/mailman/listinfo/cldr-users >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Wed Jul 16 06:03:43 2014 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Wed, 16 Jul 2014 13:03:43 +0200 Subject: Noto adds CJK, plus new user-facing website In-Reply-To: References: Message-ID: What does "Noto" mean? is it an abbreviation of "no (more) tofu" ? 2014-07-16 3:45 GMT+02:00 Roozbeh Pournader : > The Noto Sans Symbols font already supports a lot of the symbol classes > you mentioned. Linear B and Runic are also supported by Noto. Same with > some of the newer experimental scripts (Osmanya, Deseret, Shavian, etc.) > -------------- next part -------------- An HTML attachment was scrubbed... URL: From markus.icu at gmail.com Tue Jul 22 11:21:11 2014 From: markus.icu at gmail.com (Markus Scherer) Date: Tue, 22 Jul 2014 09:21:11 -0700 Subject: =?UTF-8?Q?Estonian_collation=3A_v_=E2=89=A0_w_=3F?= Message-ID: CLDR Estonian collation treats v and w as the same letter, with only a second-level difference (as between s and ? [long s]). http://unicode.org/cldr/trac/ticket/6701 says they should be different. The Estonian-language sources and http://en.wikipedia.org/wiki/Estonian_orthography seem to agree. W is not a letter of the Estonian alphabet. It is used to write foreign words and place names, like "W?rzburg". This seems to be like in Finnish, where the default collation has changed away from treating v and w as the same letter. Anyone opposed? markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From mark at macchiato.com Tue Jul 22 11:39:06 2014 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Tue, 22 Jul 2014 18:39:06 +0200 Subject: =?UTF-8?B?UmU6IEVzdG9uaWFuIGNvbGxhdGlvbjogdiDiiaAgdyA/?= In-Reply-To: References: Message-ID: It would be better to get confirmation from an Estonian speaker. Sasan should be able to help with that. Mark *? Il meglio ? l?inimico del bene ?* On Tue, Jul 22, 2014 at 6:21 PM, Markus Scherer wrote: > CLDR Estonian collation treats v and w as the same letter, with only a > second-level difference (as between s and ? [long s]). > > http://unicode.org/cldr/trac/ticket/6701 says they should be different. > The Estonian-language sources and > http://en.wikipedia.org/wiki/Estonian_orthography seem to agree. > > W is not a letter of the Estonian alphabet. It is used to write foreign > words and place names, like "W?rzburg". > > This seems to be like in Finnish, where the default collation has changed > away from treating v and w as the same letter. > > Anyone opposed? > > markus > > _______________________________________________ > CLDR-Users mailing list > CLDR-Users at unicode.org > http://unicode.org/mailman/listinfo/cldr-users > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From srl at icu-project.org Wed Jul 23 18:18:20 2014 From: srl at icu-project.org (Steven R. Loomis) Date: Wed, 23 Jul 2014 16:18:20 -0700 Subject: Parsers for the UnicodeSet notation? In-Reply-To: References: <53D035F2.5070101@adobe.com> Message-ID: <53D042BC.40103@icu-project.org> On 07/23/2014 03:28 PM, Roozbeh Pournader wrote: > On Wed, Jul 23, 2014 at 3:23 PM, Eric Muller > wrote: > > I would like to work with the exemplarCharacters data in the CLDR. > That uses the UnicodeSet notation. Is there somewhere a parser for > that notation, that would return me just the list of characters in > the set? > > > Note that it's a set of strings, not characters. > > I suspect that the exemplarCharacters use a restricted form of the > UnicodeSet notation (e.g. do not use property values). Is that > correct, and if so, what's the subset? > > > I have an Apache-licensed parser in Python here: > https://code.google.com/p/noto/source/browse/nototools/generate_website_data.py#180 > Nice, you should get those CLDR folks to add a link! I'm cross posting this to cldr-users, which may be more appropriate. Eric, to answer your second question, the TR35 spec does not say that exemplars are a restricted set, as per http://unicode.org/repos/cldr/trunk/specs/ldml/tr35-general.html#ExemplarSyntax - in practice, a restricted set is used, ranges are expanded. But there's no guarantee of this by the spec. -s -- IBMer but all opinions are mine. https://www.ohloh.net/accounts/srl295 // fingerprint @ https://ssl.icu-project.org/trac/wiki/Srl From emuller at adobe.com Thu Jul 24 01:51:15 2014 From: emuller at adobe.com (Eric Muller) Date: Wed, 23 Jul 2014 23:51:15 -0700 Subject: Parsers for the UnicodeSet notation? In-Reply-To: <53D035F2.5070101@adobe.com> References: <53D035F2.5070101@adobe.com> Message-ID: <53D0ACE3.5090304@adobe.com> Thanks for the answers. I take it from Steve's answer that Roozbeh's parser may work today but may break tomorrow. A couple of suggestions: - a full "parser" of UnicodeSet is non-trivial, since it involves having access to property values. That does not seem really necessary for exemplars, so may be it would be good restrict the UnicodeSet there. - alternatively, since the extent of a UnicodeSet can involve property values, it means that the extent can depend on the Unicode version from which those values come from. Which means that there ought to be a Unicode version number in the CLDR data; it would be nice for that number to be present in the data files (I don't see one in he.xml) > > Incidentally, I copy/pasted the punctuation exemplar characters for > he.xml into the utility, and it reported that the set contains 8,130 > code points, including the ascii letters. Somehow, that seems > incorrect. What did I do wrong? Sorry, I took the UnicodeSet straight out of he/characters.json, without handling the json serialization (or rather deserialization) of strings. Taking it straight out of he.xml (where there is no serialization effect) gives a much more reasonable set of twenty strings. XML wins again ;-) Eric. From mark at macchiato.com Thu Jul 24 02:10:01 2014 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Thu, 24 Jul 2014 09:10:01 +0200 Subject: Parsers for the UnicodeSet notation? In-Reply-To: <53D0ACE3.5090304@adobe.com> References: <53D035F2.5070101@adobe.com> <53D0ACE3.5090304@adobe.com> Message-ID: On Thu, Jul 24, 2014 at 8:51 AM, Eric Muller wrote: > - a full "parser" of UnicodeSet is non-trivial, since it involves having > access to property values. That does not seem really necessary for > exemplars, so may be it would be good restrict the UnicodeSet there. > > - alternatively, since the extent of a UnicodeSet can involve property > values, it means that the extent can depend on the Unicode version from > which those values come from. Which means that there ought to be a Unicode > version number in the CLDR data; it would be nice for that number to be > present in the data files (I don't see one in he.xml) > ?Can you file a cldr ticket on this?? Mark *? Il meglio ? l?inimico del bene ?* -------------- next part -------------- An HTML attachment was scrubbed... URL: From srl at icu-project.org Thu Jul 24 09:54:14 2014 From: srl at icu-project.org (Steven R. Loomis) Date: Thu, 24 Jul 2014 07:54:14 -0700 Subject: Parsers for the UnicodeSet notation? In-Reply-To: References: <53D035F2.5070101@adobe.com> <53D0ACE3.5090304@adobe.com> Message-ID: <53D11E16.9030009@icu-project.org> On 07/24/2014 12:10 AM, Mark Davis ?? wrote: > > On Thu, Jul 24, 2014 at 8:51 AM, Eric Muller > wrote: > > - a full "parser" of UnicodeSet is non-trivial, since it involves > having access to property values. That does not seem really > necessary for exemplars, so may be it would be good restrict the > UnicodeSet there. > > - alternatively, since the extent of a UnicodeSet can involve > property values, it means that the extent can depend on the > Unicode version from which those values come from. Which means > that there ought to be a Unicode version number in the CLDR data; > it would be nice for that number to be present in the data files > (I don't see one in he.xml) > > > ?Can you file a cldr ticket on this?? Sounds like two tickets. -- IBMer but all opinions are mine. https://www.ohloh.net/accounts/srl295 // fingerprint @ https://ssl.icu-project.org/trac/wiki/Srl From emuller at adobe.com Thu Jul 24 10:30:29 2014 From: emuller at adobe.com (Eric Muller) Date: Thu, 24 Jul 2014 08:30:29 -0700 Subject: Parsers for the UnicodeSet notation? In-Reply-To: <53D11E16.9030009@icu-project.org> References: <53D035F2.5070101@adobe.com> <53D0ACE3.5090304@adobe.com> <53D11E16.9030009@icu-project.org> Message-ID: <53D12695.8040900@adobe.com> > Sounds like two tickets. 7730, 7731. Eric. From rick at unicode.org Sat Jul 26 14:47:08 2014 From: rick at unicode.org (Rick McGowan) Date: Sat, 26 Jul 2014 12:47:08 -0700 Subject: [cldr-dev] CLDR - bug submission turned off In-Reply-To: <53D3F966.40702@unicode.org> References: <27609F2B-FE2A-4C4E-9F5F-89D76E99C313@icu-project.org> <53D3D6DE.2050607@unicode.org> <0302438A-2262-4391-BD61-8E31B7AEC691@icu-project.org> <53D3F966.40702@unicode.org> Message-ID: <53D405BC.6040705@unicode.org> Just FYI: There has been a wild upturn in spam tickets coming into the CLDR Trac system, so Steven and I have shut down anonymous ticket submissions for a bit until we find a better solution. If you're a registered developer with a Trac account, you can still file new tickets. Rick From srl at icu-project.org Wed Jul 30 00:53:25 2014 From: srl at icu-project.org (Steven R. Loomis) Date: Tue, 29 Jul 2014 22:53:25 -0700 Subject: Proposal: "compat" keywords Message-ID: <53D88855.9060904@icu-project.org> New keyword "compat" Example: ar-u-co-compat Meaning: "A compatibility collation tailoring, intended to use a previous version of the collation rules." In ticket http://unicode.org/cldr/trac/ticket/4207 a major change to Arabic collation is proposed. So that the previous tailoring can be retrieved, the "compat" keyword is proposed. Please reply to this list if you have comments or concerns. The ticket # referenced above will be used to implement this keyword. -- IBMer but all opinions are mine. https://www.ohloh.net/accounts/srl295 // fingerprint @ https://ssl.icu-project.org/trac/wiki/Srl From srl at icu-project.org Wed Jul 30 00:56:05 2014 From: srl at icu-project.org (Steven R. Loomis) Date: Tue, 29 Jul 2014 22:56:05 -0700 Subject: Proposal: "compat" keywords (BCP47) In-Reply-To: <53D88855.9060904@icu-project.org> References: <53D88855.9060904@icu-project.org> Message-ID: <53D888F5.9000208@icu-project.org> On 07/29/2014 10:53 PM, Steven R. Loomis wrote: > New keyword "compat" > > Example: > > ar-u-co-compat > > Meaning: "A compatibility collation tailoring, intended to use a > previous version of the collation rules." > > In ticket http://unicode.org/cldr/trac/ticket/4207 a major change to > Arabic collation is proposed. > So that the previous tailoring can be retrieved, the "compat" keyword is > proposed. > > Please reply to this list if you have comments or concerns. > > The ticket # referenced above will be used to implement this keyword. > This is a BCP47 keyword. -- IBMer but all opinions are mine. https://www.ohloh.net/accounts/srl295 // fingerprint @ https://ssl.icu-project.org/trac/wiki/Srl From verdy_p at wanadoo.fr Wed Jul 30 01:12:35 2014 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Wed, 30 Jul 2014 08:12:35 +0200 Subject: Proposal: "compat" keywords (BCP47) In-Reply-To: <53D888F5.9000208@icu-project.org> References: <53D88855.9060904@icu-project.org> <53D888F5.9000208@icu-project.org> Message-ID: compatiblity with what ??? the specifier is ambiguous if it's not followed by another qualifier (and possibly a version number or identifier). Then it would require another registry for those qualifiers. For me it's simpler to use data from a base URL which is identifiable and versionable, containing in some root element the use of a small identifier used privately within the domain of the URL (in a way similar to XML namespaces, allwing free substitution of the small identifier but keeping its definition to its defining URL). So I would just prefer the like form : ar-u-co-xmlns-xyz, where "xmlns" means XML namespage, and "xyz" is resolved as a namespace wthin the scope of the base URL of the data containing it (possibly with the addition of a namespace definition element in the LDML document, or some similar mechanism for other database formats providing a mapping of the identifier to a more complete definition giving details such as a name, description, version, authors, licences, possibly in standard RDF format...). 2014-07-30 7:56 GMT+02:00 Steven R. Loomis : > On 07/29/2014 10:53 PM, Steven R. Loomis wrote: > > New keyword "compat" > > > > Example: > > > > ar-u-co-compat > > > > Meaning: "A compatibility collation tailoring, intended to use a > > previous version of the collation rules." > > > > In ticket http://unicode.org/cldr/trac/ticket/4207 a major change to > > Arabic collation is proposed. > > So that the previous tailoring can be retrieved, the "compat" keyword is > > proposed. > > > > Please reply to this list if you have comments or concerns. > > > > The ticket # referenced above will be used to implement this keyword. > > > > This is a BCP47 keyword. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From srl at icu-project.org Wed Jul 30 01:44:23 2014 From: srl at icu-project.org (Steven R. Loomis) Date: Tue, 29 Jul 2014 23:44:23 -0700 Subject: Proposal: "compat" keywords (BCP47) In-Reply-To: References: <53D88855.9060904@icu-project.org> <53D888F5.9000208@icu-project.org> Message-ID: <53D89447.3070901@icu-project.org> Philippe, It is indeed ambiguous. But, so are other relative terms such as "traditional" or "alternate". Versioning could be really great. We have not had consensus around a locale-based versioning scheme, though. It is not the intent of this proposal to supply versioning. The results of a "compat" collator are completely version and implementation dependent. Thanks, -s On 07/29/2014 11:12 PM, Philippe Verdy wrote: > compatiblity with what ??? the specifier is ambiguous if it's not > followed by another qualifier (and possibly a version number or > identifier). Then it would require another registry for those qualifiers. > > For me it's simpler to use data from a base URL which is identifiable > and versionable, containing in some root element the use of a small > identifier used privately within the domain of the URL (in a way > similar to XML namespaces, allwing free substitution of the small > identifier but keeping its definition to its defining URL). > > So I would just prefer the like form : ar-u-co-xmlns-xyz, > where "xmlns" means XML namespage, and "xyz" is resolved as a > namespace wthin the scope of the base URL of the data containing it > (possibly with the addition of a namespace definition element in the > LDML document, or some similar mechanism for other database formats > providing a mapping of the identifier to a more complete definition > giving details such as a name, description, version, authors, > licences, possibly in standard RDF format...). > > 2014-07-30 7:56 GMT+02:00 Steven R. Loomis >: > > On 07/29/2014 10:53 PM, Steven R. Loomis wrote: > > New keyword "compat" > > > > Example: > > > > ar-u-co-compat > > > > Meaning: "A compatibility collation tailoring, intended to use a > > previous version of the collation rules." > > > > In ticket http://unicode.org/cldr/trac/ticket/4207 a major > change to > > Arabic collation is proposed. > > So that the previous tailoring can be retrieved, the "compat" > keyword is > > proposed. > > > > Please reply to this list if you have comments or concerns. > > > > The ticket # referenced above will be used to implement this > keyword. > > > > This is a BCP47 keyword. > -- IBMer but all opinions are mine. https://www.ohloh.net/accounts/srl295 // fingerprint @ https://ssl.icu-project.org/trac/wiki/Srl From markus.icu at gmail.com Wed Jul 30 10:40:51 2014 From: markus.icu at gmail.com (Markus Scherer) Date: Wed, 30 Jul 2014 08:40:51 -0700 Subject: Proposal: "compat" keywords (BCP47) In-Reply-To: <53D888F5.9000208@icu-project.org> References: <53D88855.9060904@icu-project.org> <53D888F5.9000208@icu-project.org> Message-ID: On Tue, Jul 29, 2014 at 10:56 PM, Steven R. Loomis wrote: > On 07/29/2014 10:53 PM, Steven R. Loomis wrote: > > New keyword "compat" > > > > Example: > > > > ar-u-co-compat > > > > Meaning: "A compatibility collation tailoring, intended to use a > > previous version of the collation rules." > > > > This is a BCP47 keyword. > More specifically, as shown in the example, this is a keyword proposed for addition to http://unicode.org/cldr/trac/browser/trunk/common/bcp47/collation.xml as a new type value for The committee felt that we should make the current Arabic tailoring available for at least this release, and that none of the current keyword values seem appropriate. Best regards, markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From srl at icu-project.org Wed Jul 30 13:34:36 2014 From: srl at icu-project.org (Steven R. Loomis) Date: Wed, 30 Jul 2014 11:34:36 -0700 Subject: Proposal: "compat" keywords (BCP47) In-Reply-To: References: <53D88855.9060904@icu-project.org> <53D888F5.9000208@icu-project.org> Message-ID: <53D93ABC.5080005@icu-project.org> On 07/30/2014 08:40 AM, Markus Scherer wrote: > On Tue, Jul 29, 2014 at 10:56 PM, Steven R. Loomis > > wrote: > > On 07/29/2014 10:53 PM, Steven R. Loomis wrote: > > New keyword "compat" > > > > Example: > > > > ar-u-co-compat > > > > Meaning: "A compatibility collation tailoring, intended to use a > > previous version of the collation rules." > > > > This is a BCP47 keyword. > > > More specifically, as shown in the example, this is a keyword proposed > for addition to > http://unicode.org/cldr/trac/browser/trunk/common/bcp47/collation.xml > as a new type value for > > The committee felt that we should make the current Arabic tailoring > available for at least this release, and that none of the current > keyword values seem appropriate. > > Best regards, > markus Thanks for the context, Markus. Perhaps it is worth having some wording such as, "this keyword may only have meaning relative to specific versions of specific implementations.". I.e., it doesn't reference a specific standard, nor a method such as "pinyin", nor a category of standards such as "phonebook". Steven