From cameron at lumoslabs.com Thu Jan 1 13:02:39 2015 From: cameron at lumoslabs.com (Cameron Dutro) Date: Thu, 1 Jan 2015 11:02:39 -0800 Subject: Unicode Regex Question In-Reply-To: References:

Message-ID: Thanks very much Mark for that additional documentation, and thanks Nick for filing the ticket :) -Cameron On Wed, Dec 31, 2014 at 11:18 AM, Steven R. Loomis wrote: > Philippe, Mark: > Transliterators seem to be in ICU 1.8, so 1999- 15 and almost 16 years > ago. > > S > > Enviado desde nuestro iPhone. > > El dic 31, 2014, a las 2:51 AM, Mark Davis [image: ?]? > escribi?: > > ?? > ?> > No the way it is written is really a litteral $ or a or b or a Greek > character. > > ? > Philippe, you are once again not listening. > ? ? > The $ in CLDR transforms is NOT the same as $ in regex. > ?I do know what I'm talking about here: Alan Liu and I designed this > (though years ago).? > > Now, there is a defect in the LDML documentation, in that the $ is not > described fully. For that, people can look at the ICU documentation (from > which LDML gets the transform syntax) > ?:? > ? > > http://userguide.icu-project.org/transforms/general/rules#TOC-ther > > Cameron, would you mind filing a CLDR ticket > ?to update and expand the documentation > ? > > > Mark > > *? Il meglio ? l?inimico del bene ?* > > On Wed, Dec 31, 2014 at 11:02 AM, Philippe Verdy > wrote: > >> No the way it is written is really a litteral $ or a or b or a Greek >> character. >> And yes you used a notation embedding two character classes within >> another character class to create an union. However $ (if it means an end >> of string) cannot be part of that union and cannot even be part of a >> character class as it is is then not a character itself but a boundary >> condition. >> >> So yes youe extension is very confusive (in addition of bing incoherent >> and not enough general to handle various boundary conditions) >> >> TL;DR: it was another proposal making a BETTER use of the $ for something >> else more productive and about how regexp can be embedded into a special >> syntax allowing to define any custom boundary conditions including end of >> strings, or other boundaries (and also not limited to properties defined >> with properties in the UCD. It is a generalisation of the concept; which >> will be used everywhere Uncode properties are not sufficient, and without >> necessarily needing addition of new properties to handle specific locales >> (for example these boundaries could be used in CLDR data instead of the >> UCD, or in specific locales not supported by CLDR). >> >> >> 2014-12-31 10:27 GMT+01:00 Mark Davis ? < >> mark at macchiato.com>: >> >>> >>> On Wed, Dec 31, 2014 at 1:40 AM, Philippe Verdy >>> wrote: >>> >>>> Your example with "[[a$b][:script=greek:]]" does not make any sense if >>>> that $ means an "end of string" and where it is embedded in a character >>>> class itself in another embedding character-class. >>>> >>> >>> ?That is incorrect. The way the transform works, any reference to a >>> character position outside the bounds of a string matches $. So what I >>> wrote matches the start or end of a string, or a, or b, or any greek-script >>> character. >>> >>> However, if you look at the transform data files, you'll see real cases >>> where $ is used, rather than the artificial one I used. >>> >>> As to the rest of your post, tl;dr. >>> >>> Mark >>> >>> *? Il meglio ? l?inimico del bene ?* >>> >> >> > _______________________________________________ > CLDR-Users mailing list > CLDR-Users at unicode.org > http://unicode.org/mailman/listinfo/cldr-users > > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: emoji_u2615.png Type: image/png Size: 1890 bytes Desc: not available URL: From martin_hosken at sil.org Mon Jan 5 00:12:17 2015 From: martin_hosken at sil.org (Martin Hosken) Date: Mon, 5 Jan 2015 13:12:17 +0700 Subject: listPatterns/listPattern/@type meaning Message-ID: <20150105131217.6d2db458@sil-mh7> Dear All, What is the semantics behind listPatterns/listPattern/@type? I may have missed it in the documentation. TIA, Yours, Martin From mark at macchiato.com Wed Jan 7 03:48:07 2015 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Wed, 7 Jan 2015 10:48:07 +0100 Subject: listPatterns/listPattern/@type meaning In-Reply-To: <20150105131217.6d2db458@sil-mh7> References: <20150105131217.6d2db458@sil-mh7> Message-ID: Because the spec is split in parts, I often use a search like the following to get me to the right part. https://google.com/search?q=site%3Aunicode.org%2Freports%2Ftr35%2F+listPattern That gets you to two points: http://www.unicode.org/reports/tr35/tr35-general.html#Unit_Sequences http://www.unicode.org/reports/tr35/tr35-general.html#ListPatterns Does that help? This also makes me wonder if we should have a "search box" that does that, maybe below http://www.unicode.org/reports/tr35/tr35.html#Parts in each section. Do you think that would be useful? Mark *? Il meglio ? l?inimico del bene ?* On Mon, Jan 5, 2015 at 7:12 AM, Martin Hosken wrote: > Dear All, > > What is the semantics behind listPatterns/listPattern/@type? I may have > missed it in the documentation. > > TIA, > Yours, > Martin > _______________________________________________ > CLDR-Users mailing list > CLDR-Users at unicode.org > http://unicode.org/mailman/listinfo/cldr-users > -------------- next part -------------- An HTML attachment was scrubbed... URL: From emmo at us.ibm.com Tue Jan 13 20:19:59 2015 From: emmo at us.ibm.com (John Emmons) Date: Tue, 13 Jan 2015 20:19:59 -0600 Subject: JSON packaging proposal for CLDR Message-ID: Hi everyone, I have been working on a design proposal for packaging of CLDR's JSON data, for the 27 release and following. Please review and comment: https://sites.google.com/site/cldr/development/development-process/design-proposals/json-packaging at your convenience. You should be able to put your comments directly into the document. I would like to bring this proposal for approval to the CLDR TC at NEXT week's meeting, on 2015-01-21. Regards, John C. Emmons Globalization Architect & Unicode CLDR TC Chairman IBM Software Group Internet: emmo at us.ibm.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From rxaviers at gmail.com Tue Jan 13 20:30:12 2015 From: rxaviers at gmail.com (Rafael Xavier) Date: Wed, 14 Jan 2015 00:30:12 -0200 Subject: JSON packaging proposal for CLDR In-Reply-To: References: Message-ID: Hi John, Please, could you include the size of each package? I'm interested to know how the functionality break down will balance the bytes weight. Thanks On Wed, Jan 14, 2015 at 12:19 AM, John Emmons wrote: > Hi everyone, > > I have been working on a design proposal for packaging of CLDR's JSON > data, for the 27 release and following. Please review and comment: > > > https://sites.google.com/site/cldr/development/development-process/design-proposals/json-packaging at > your convenience. You should be able to put your comments directly into > the document. > > I would like to bring this proposal for approval to the CLDR TC at NEXT > week's meeting, on 2015-01-21. > > > Regards, > > John C. Emmons > Globalization Architect & Unicode CLDR TC Chairman > IBM Software Group > Internet: emmo at us.ibm.com > > _______________________________________________ > CLDR-Users mailing list > CLDR-Users at unicode.org > http://unicode.org/mailman/listinfo/cldr-users > > -- +55 (16) 98138-1582, +1 (415) 568-5854, skype: rxaviers http://rafael.xavier.blog.br -------------- next part -------------- An HTML attachment was scrubbed... URL: From rxaviers at gmail.com Tue Jan 13 20:54:32 2015 From: rxaviers at gmail.com (Rafael Xavier) Date: Wed, 14 Jan 2015 00:54:32 -0200 Subject: JSON packaging proposal for CLDR In-Reply-To: References: Message-ID: About locale coverage, (just to make sure I understood correctly) currently I download one single file full.zip in order to download all available coverage for all functionalities. After this change, I'll need to download tier-1, tier-2, modern and all of a given functionality in order to download all available coverage of such functionality. Is that correct? PS: Initially, I was expecting an incremental coverage: tier-1 ? tier-2 ? modern ? all. If the sets are not incremental, should all be renamed leftover? :) On Wed, Jan 14, 2015 at 12:30 AM, Rafael Xavier wrote: > Hi John, > > Please, could you include the size of each package? I'm interested to know > how the functionality break down will balance the bytes weight. > > Thanks > > On Wed, Jan 14, 2015 at 12:19 AM, John Emmons wrote: > >> Hi everyone, >> >> I have been working on a design proposal for packaging of CLDR's JSON >> data, for the 27 release and following. Please review and comment: >> >> >> https://sites.google.com/site/cldr/development/development-process/design-proposals/json-packaging at >> your convenience. You should be able to put your comments directly into >> the document. >> >> I would like to bring this proposal for approval to the CLDR TC at NEXT >> week's meeting, on 2015-01-21. >> >> >> Regards, >> >> John C. Emmons >> Globalization Architect & Unicode CLDR TC Chairman >> IBM Software Group >> Internet: emmo at us.ibm.com >> >> _______________________________________________ >> CLDR-Users mailing list >> CLDR-Users at unicode.org >> http://unicode.org/mailman/listinfo/cldr-users >> >> > > > -- > +55 (16) 98138-1582, +1 (415) 568-5854, skype: rxaviers > http://rafael.xavier.blog.br > -- +55 (16) 98138-1582, +1 (415) 568-5854, skype: rxaviers http://rafael.xavier.blog.br -------------- next part -------------- An HTML attachment was scrubbed... URL: From emmo at us.ibm.com Tue Jan 13 22:04:23 2015 From: emmo at us.ibm.com (John Emmons) Date: Tue, 13 Jan 2015 22:04:23 -0600 Subject: JSON packaging proposal for CLDR In-Reply-To: References: Message-ID: Yes you are correct. I wanted to keep the size of each package to a minimum. So I thought to include only that data in each package that extended beyond the previous. Regards, John C. Emmons Globalization Architect & Unicode CLDR TC Chairman IBM Software Group Internet: emmo at us.ibm.com From: Rafael Xavier To: John Emmons/Austin/IBM at IBMUS Cc: Cldr dev , "cldr-users at unicode.org" Date: 01/13/2015 08:54 PM Subject: Re: JSON packaging proposal for CLDR About locale coverage, (just to make sure I understood correctly) currently I download one single file full.zip in order to download all available coverage for all functionalities. After this change, I'll need to download tier-1, tier-2, modern and all of a given functionality in order to download all available coverage of such functionality. Is that correct? PS: Initially, I was expecting an incremental coverage: tier-1 ? tier-2 ? modern ? all. If the sets are not incremental, should all be renamed leftover? :) On Wed, Jan 14, 2015 at 12:30 AM, Rafael Xavier wrote: Hi John, Please, could you include the size of each package? I'm interested to know how the functionality break down will balance the bytes weight. Thanks On Wed, Jan 14, 2015 at 12:19 AM, John Emmons wrote: Hi everyone, ? ?I have been working on a design proposal for packaging of CLDR's JSON data, for the 27 release and following.? Please review and comment: https://sites.google.com/site/cldr/development/development-process/design-proposals/json-packaging ?at your convenience.? You should be able to put your comments directly into the document. I would like to bring this proposal for approval to the CLDR TC at NEXT week's meeting, on 2015-01-21. Regards, John C. Emmons Globalization Architect & Unicode CLDR TC Chairman IBM Software Group Internet: emmo at us.ibm.com _______________________________________________ CLDR-Users mailing list CLDR-Users at unicode.org http://unicode.org/mailman/listinfo/cldr-users -- +55 (16) 98138-1582, +1 (415) 568-5854, skype: rxaviers http://rafael.xavier.blog.br -- +55 (16) 98138-1582, +1 (415) 568-5854, skype: rxaviers http://rafael.xavier.blog.br -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From emmo at us.ibm.com Tue Jan 13 22:08:01 2015 From: emmo at us.ibm.com (John Emmons) Date: Tue, 13 Jan 2015 22:08:01 -0600 Subject: JSON packaging proposal for CLDR In-Reply-To: References: Message-ID: Yes, you are correct. The thought was to keep the size of each individual package to a minimum, thus each package just builds on top of its requisites. Regards, John C. Emmons Globalization Architect & Unicode CLDR TC Chairman IBM Software Group Internet: emmo at us.ibm.com From: Rafael Xavier To: John Emmons/Austin/IBM at IBMUS Cc: Cldr dev , "cldr-users at unicode.org" Date: 01/13/2015 08:58 PM Subject: Re: JSON packaging proposal for CLDR Sent by: "CLDR-Users" About locale coverage, (just to make sure I understood correctly) currently I download one single file full.zip in order to download all available coverage for all functionalities. After this change, I'll need to download tier-1, tier-2, modern and all of a given functionality in order to download all available coverage of such functionality. Is that correct? PS: Initially, I was expecting an incremental coverage: tier-1 ? tier-2 ? modern ? all. If the sets are not incremental, should all be renamed leftover? :) On Wed, Jan 14, 2015 at 12:30 AM, Rafael Xavier wrote: Hi John, Please, could you include the size of each package? I'm interested to know how the functionality break down will balance the bytes weight. Thanks On Wed, Jan 14, 2015 at 12:19 AM, John Emmons wrote: Hi everyone, ? ?I have been working on a design proposal for packaging of CLDR's JSON data, for the 27 release and following.? Please review and comment: https://sites.google.com/site/cldr/development/development-process/design-proposals/json-packaging ?at your convenience.? You should be able to put your comments directly into the document. I would like to bring this proposal for approval to the CLDR TC at NEXT week's meeting, on 2015-01-21. Regards, John C. Emmons Globalization Architect & Unicode CLDR TC Chairman IBM Software Group Internet: emmo at us.ibm.com _______________________________________________ CLDR-Users mailing list CLDR-Users at unicode.org http://unicode.org/mailman/listinfo/cldr-users -- +55 (16) 98138-1582, +1 (415) 568-5854, skype: rxaviers http://rafael.xavier.blog.br -- +55 (16) 98138-1582, +1 (415) 568-5854, skype: rxaviers http://rafael.xavier.blog.br_______________________________________________ CLDR-Users mailing list CLDR-Users at unicode.org http://unicode.org/mailman/listinfo/cldr-users -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From rxaviers at gmail.com Wed Jan 14 10:04:04 2015 From: rxaviers at gmail.com (Rafael Xavier) Date: Wed, 14 Jan 2015 14:04:04 -0200 Subject: Bundle Lookup In-Reply-To: References:

<548A47BE.5080900@gmail.com>

Message-ID: Hello everyone, It's clear to me there are docs improvements coming up. But, giving the fact I'm still digging into it (testing LangageMatching as suggested for bundle lookup matcher), I would like to share my findings with you. It's still a draft. But, it has a suggestion that I think it would suite better than LanguageMatching for bundle lookup matcher purposes. https://docs.google.com/document/d/1qLbuz659VvCVhgyd08KRP0SMuqCvK9bSS3-0W-kMuuw/edit?usp=sharing On Fri, Dec 12, 2014 at 7:31 PM, Rafael Xavier wrote: > Looking forward to hearing how that shall work. > > Thank you very much so far. > > On Fri, Dec 12, 2014 at 6:27 PM, Mark Davis [image: ?]? < > mark at macchiato.com> wrote: >> >> >> >> >> Mark >> >> *? Il meglio ? l?inimico del bene ?* >> >> On Fri, Dec 12, 2014 at 7:48 PM, Rafael Xavier >> wrote: >> >>> Mark, >>> >>> Giving an arbitrary locale ID, the recommended and only process to >>> deduce its respective bundle (reliably) is through Language Matching. >>> >>> Is that true? >>> >> >> ?As I said: " >> That being said, often people don't understand language matching, and so >> we are in the process of adding more information so that there is a direct >> mapping from between locale IDs that are always considered to be >> "identical" on a deep level, like en-GB and en-Latn-GB. >> ?"? >> ? >> >> >>> >>> Considering all bundles are always present, isn't there any less >>> expensive algorithm that could be recommended? >>> >>> Thank you. >>> >>> >>> PS: My use case is a little different. I have *n* distributions of my >>> application. On each distribution, it's embedded with a different locale. >>> So, I don't need the full power of Language Matching on what's regard >>> having an arbitrary list of desired locales vs an aribtrary list of >>> available locales. Anyway, I do want my application to look up for the >>> right bundle given a locale (e.g., `zh-Hans-TW` when given `zh-TW`). >>> >>> On Fri, Dec 12, 2014 at 2:50 PM, Mark Davis [image: ?]? < >>> mark at macchiato.com> wrote: >>>> >>>> I also want to be clear that there are two closely-related but very >>>> different tasks. >>>> >>>> 1. *Inherited item lookup. *Given that you have a CLDR resource >>>> bundle, with inheritance, where do I go to get inherited items? >>>> >>>> That is specified by CLDR by means of the parentLocale + truncation >>>> algorithm, plus the alias element. (There are a few cases where we have >>>> "Lateral Inheritance" where the specification is in the text of LDML, >>>> such as when looking for an alt variant.) >>>> >>>> So back to Rafael's original question: >>>> >>>> 1. en-Latn-GB, and zh-TW are not CLDR bundles, so this doesn't >>>> apply to them. >>>> 2. en-US-u-nu-usd: the u-nu-usd doesn't select within a bundle, but >>>> rather customizes a service that uses information in the bundle. The item >>>> lookup (using by the currency formatting service) would be en-US => >>>> en => root. >>>> >>>> >>>> 2. *Bundle lookup. *Given a locale ID, where do I get the best >>>> matching CLDR bundle? >>>> >>>> My application has a set of supported locales, and the user comes in >>>> with a set of desired locales. What is the best bundle for that user? >>>> >>>> Here we are not as clear as we should be. The recommended process is in >>>> http://www.unicode.org/reports/tr35/#LanguageMatching >>>> >>>> So back to Rafael's original question: >>>> >>>> 1. en-Latn-GB, and zh-TW. When these are looked up with Language >>>> Matching, assuming that all the CLDR locales are available, they would >>>> return, respectively, en-GB and zh-Hant-TW. >>>> >>>> That being said, often people don't understand language matching, and >>>> so we are in the process of adding more information so that there is a >>>> direct mapping from between locale IDs that are always considered to >>>> be "identical" on a deep level, like en-GB and en-Latn-GB. >>>> >>>> >>>> >>>> Mark >>>> >>>> *? Il meglio ? l?inimico del bene ?* >>>> >>>> On Fri, Dec 12, 2014 at 5:04 PM, John Emmons wrote: >>>> >>>>> Yes, Edward, there is a very good reason we don't want zh-Hant to >>>>> inherit from zh. Simply put, in situations where you have locale resources >>>>> that aren't 100% populated, allowing zh-Hant to inherit from zh produces a >>>>> mixture of simplified and traditional Chinese, which is acceptable to no >>>>> one. This is what we call "cross script inheritance" in CLDR. While it >>>>> might be acceptable to some in the case of Chinese, it is certainly a >>>>> bigger problem in languages like Serbian, where you have both Latin and >>>>> Cyrillic scripts in use, and you certainly don't ever want a mixture of >>>>> Latin and Cyrillic scripts >>>>> >>>>> These relationships are documented in CLDR's supplemental data, where >>>>> you have specified: >>>>> >>>>> >>>>> >>>>> >>>>> Regards, >>>>> >>>>> John C. Emmons >>>>> Globalization Architect & Unicode CLDR TC Chairman >>>>> IBM Software Group >>>>> Internet: emmo at us.ibm.com >>>>> >>>>> >>>>> [image: Inactive hide details for Edwin Hoogerbeets ---12/11/2014 >>>>> 07:41:26 PM---Rafael, also take a look at common/supplemental/likelyS]Edwin >>>>> Hoogerbeets ---12/11/2014 07:41:26 PM---Rafael, also take a look at >>>>> common/supplemental/likelySubtags.xml. If the caller has passed you an i >>>>> >>>>> From: Edwin Hoogerbeets >>>>> To: John Emmons/Austin/IBM at IBMUS, Rafael Xavier >>>>> Cc: J?rn Zaefferer , " >>>>> cldr-users at unicode.org" >>>>> Date: 12/11/2014 07:41 PM >>>>> Subject: Re: Bundle Lookup >>>>> ------------------------------ >>>>> >>>>> >>>>> >>>>> Rafael, also take a look at common/supplemental/likelySubtags.xml. If >>>>> the caller has passed you an incompletely specified locale, you can use >>>>> those mappings to see if you can get to a locale for which you do have a >>>>> string bundle. I think that is the source for the "language aliases" to >>>>> which John was referring. >>>>> >>>>> John, for the last part of your example zh-TW inheritance chain, >>>>> wouldn't you just truncate "zh-Hant" again to "zh" like in the en-GB >>>>> example before inheriting from the root? If not, what is the reasoning >>>>> there? Is there already a document that specifies the inheritance rules in >>>>> CLDR? >>>>> >>>>> For efficiency, I can imagine you would put the common translations in >>>>> "zh" where there is no difference between traditional and simplified, and >>>>> other translations in "zh-Hant" or "zh-Hans" where there is. That would >>>>> save some disk space and you could leverage linguistic bug fixes at the >>>>> "zh" level. For other locales like "sr-Latn" and "sr-Cyrl" there would be >>>>> nothing in common so the string bundle at the "sr" level would be >>>>> essentially empty, but it should still appear in the inheritance chain just >>>>> in case. >>>>> >>>>> Edwin >>>>> >>>>> >>>>> On 12/11/2014 02:53 PM, John Emmons wrote: >>>>> >>>>> >>>>> #3 is currently a problem, which we are working on. Basically, >>>>> "Latn" needs to be stripped out because it isn't necessary. Then follow >>>>> the normal inheritance: >>>>> >>>>> en-GB: en-GB ? (parentLocale) en-001 ? (truncation) en ? root >>>>> >>>>> #4 - Any unicode locale extensions are meant to identify >>>>> particular behaviors that are desired in the context of a given locale. >>>>> Think of them like "options". They are not meant to be used in the context >>>>> of bundle lookups. >>>>> >>>>> #5 - zh_TW - Now that proper language aliases are in place ( See >>>>> *http://unicode.org/cldr/trac/ticket/5949* >>>>> ) >>>>> >>>>> zh-TW: zh-TW ? (languageAlias) zh-Hant-TW ? (truncation) zh-Hant >>>>> (parentLocale) ? root >>>>> >>>>> Regards, >>>>> >>>>> John C. Emmons >>>>> Globalization Architect & Unicode CLDR TC Chairman >>>>> IBM Software Group >>>>> Internet: *emmo at us.ibm.com* >>>>> >>>>> >>>>> [image: Inactive hide details for Rafael Xavier ---12/11/2014 >>>>> 01:02:57 PM---Friends, This is a very basic question. See below. There ar]Rafael >>>>> Xavier ---12/11/2014 01:02:57 PM---Friends, This is a very basic question. >>>>> See below. There are lots of documentation >>>>> >>>>> From: Rafael Xavier ** >>>>> To: *"cldr-users at unicode.org"* >>>>> ** >>>>> Cc: J?rn Zaefferer ** >>>>> >>>>> Date: 12/11/2014 01:02 PM >>>>> Subject: Bundle Lookup >>>>> Sent by: "CLDR-Users" ** >>>>> >>>>> >>>>> ------------------------------ >>>>> >>>>> >>>>> >>>>> Friends, >>>>> >>>>> This is a very basic question. See below. There are lots of >>>>> documentation about locale inheritance and matching. But, it fails in same >>>>> cases to me. >>>>> >>>>> * Giving a locale, what's the procedure to find the **bundle** lookup >>>>> chain?* >>>>> >>>>> 1. en-US: en-US ? (truncation) en ? root >>>>> >>>>> This one is dead simple. No problem. >>>>> >>>>> 2. en-GB: en-GB ? (parentLocale) en-001 ? (truncation) en ? root >>>>> >>>>> This one is also dead simple. Although, documentation says en-GB ? >>>>> en. Is it outdated or am I doing something wrong? >>>>> >>>>> Anyway, the ones I'm interested in knowing are: >>>>> >>>>> 3. en-Latn-GB >>>>> 4. en-US-u-nu-usd >>>>> 5. zh-TW >>>>> >>>>> Please, could someone show me what's the chain of these locales >>>>> (and obviously explain the steps)? >>>>> >>>>> Thanks! >>>>> >>>>> -- >>>>> *+55 (16) 98138-1582* <%2B55%20%2816%29%2098138-1582>, *+1 (415) >>>>> 568-5854* <%2B1%20%28415%29%20568-5854>, skype: rxaviers >>>>> *http://rafael.xavier.blog.br* >>>>> _______________________________________________ >>>>> CLDR-Users mailing list >>>>> *CLDR-Users at unicode.org* >>>>> *http://unicode.org/mailman/listinfo/cldr-users* >>>>> >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> CLDR-Users mailing list >>>>> *CLDR-Users at unicode.org* >>>>> *http://unicode.org/mailman/listinfo/cldr-users* >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> CLDR-Users mailing list >>>>> CLDR-Users at unicode.org >>>>> http://unicode.org/mailman/listinfo/cldr-users >>>>> >>>>> >>>> >>> >>> -- >>> +55 (16) 98138-1582, +1 (415) 568-5854, skype: rxaviers >>> http://rafael.xavier.blog.br >>> >> >> > > -- > +55 (16) 98138-1582, +1 (415) 568-5854, skype: rxaviers > http://rafael.xavier.blog.br > -- +55 (16) 98138-1582, +1 (415) 568-5854, skype: rxaviers http://rafael.xavier.blog.br -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: emoji_u2615.png Type: image/png Size: 1890 bytes Desc: not available URL: From mimckenna at paypal.com Wed Jan 14 12:06:36 2015 From: mimckenna at paypal.com (Mckenna, Mike) Date: Wed, 14 Jan 2015 18:06:36 +0000 Subject: JSON packaging proposal for CLDR In-Reply-To: References:

Message-ID: This will change the way we process CLDR since today we are just pulling the full file in for each locale before process in house. But smaller is better. This proposal will require that we put the sparse inheritance logic into our process on this side. One advantage to your proposal is that it will make it much easier for us to determine the diffs between locales and between releases. Thanks, Mike McKenna Sr Manager of Internationalization Technology +1-408-967-3631 (desk), +1-510-332-7820 (mobile) PayPal 2211 N. First Street, San Jose CA 95131 Ask-i18n at paypal.com From: John Emmons > Date: Tuesday, January 13, 2015 at 8:08 PM To: Rafael Xavier > Cc: Cldr dev >, "cldr-users at unicode.org" > Subject: Re: JSON packaging proposal for CLDR Yes, you are correct. The thought was to keep the size of each individual package to a minimum, thus each package just builds on top of its requisites. Regards, John C. Emmons Globalization Architect & Unicode CLDR TC Chairman IBM Software Group Internet: emmo at us.ibm.com [Inactive hide details for Rafael Xavier ---01/13/2015 08:58:09 PM---About locale coverage, (just to make sure I understood corr]Rafael Xavier ---01/13/2015 08:58:09 PM---About locale coverage, (just to make sure I understood correctly) currently I download one single fi From: Rafael Xavier > To: John Emmons/Austin/IBM at IBMUS Cc: Cldr dev >, "cldr-users at unicode.org" > Date: 01/13/2015 08:58 PM Subject: Re: JSON packaging proposal for CLDR Sent by: "CLDR-Users" > ________________________________ About locale coverage, (just to make sure I understood correctly) currently I download one single file full.zip in order to download all available coverage for all functionalities. After this change, I'll need to download tier-1, tier-2, modern and all of a given functionality in order to download all available coverage of such functionality. Is that correct? PS: Initially, I was expecting an incremental coverage: tier-1 ? tier-2 ? modern ? all. If the sets are not incremental, should all be renamed leftover? :) On Wed, Jan 14, 2015 at 12:30 AM, Rafael Xavier > wrote: Hi John, Please, could you include the size of each package? I'm interested to know how the functionality break down will balance the bytes weight. Thanks On Wed, Jan 14, 2015 at 12:19 AM, John Emmons > wrote: Hi everyone, I have been working on a design proposal for packaging of CLDR's JSON data, for the 27 release and following. Please review and comment: https://sites.google.com/site/cldr/development/development-process/design-proposals/json-packaging at your convenience. You should be able to put your comments directly into the document. I would like to bring this proposal for approval to the CLDR TC at NEXT week's meeting, on 2015-01-21. Regards, John C. Emmons Globalization Architect & Unicode CLDR TC Chairman IBM Software Group Internet: emmo at us.ibm.com _______________________________________________ CLDR-Users mailing list CLDR-Users at unicode.org http://unicode.org/mailman/listinfo/cldr-users -- +55 (16) 98138-1582, +1 (415) 568-5854, skype: rxaviers http://rafael.xavier.blog.br -- +55 (16) 98138-1582, +1 (415) 568-5854, skype: rxaviers http://rafael.xavier.blog.br_______________________________________________ CLDR-Users mailing list CLDR-Users at unicode.org http://unicode.org/mailman/listinfo/cldr-users -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: graycol.gif URL: From berenger at enselme.com Wed Jan 14 13:43:17 2015 From: berenger at enselme.com (=?UTF-8?Q?B=C3=A9renger_Enselme?=) Date: Wed, 14 Jan 2015 14:43:17 -0500 Subject: Add Likely Subtags first step Message-ID: Hello, In http://www.unicode.org/reports/tr35/#Likely_Subtags the first step is described as canonicalization. The 3rd substep says to return a tag as is if it is in the from the supplemental data. As far as I can tell this never happens since such tags have already been replaced in the 2nd substep. Thinking about it more, I don't think any of the grandfathered tags would actually make it to the second substep since they wouldn't pass the first substep. Thanks for any help on this, B?ranger -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Fri Jan 23 10:35:42 2015 From: doug at ewellic.org (Doug Ewell) Date: Fri, 23 Jan 2015 09:35:42 -0700 Subject: Add Likely Subtags first step Message-ID: <20150123093542.665a7a7059d7ee80bb4d670165c8327d.ddddc3c6bc.wbe@email03.secureserver.net> B?renger Enselme wrote: > In http://www.unicode.org/reports/tr35/#Likely_Subtags the first step > is described as canonicalization. > > The 3rd substep says to return a tag as is if it is in the id="$grandfathered" type="choice"> from the supplemental data. > > As far as I can tell this never happens since such tags have already > been replaced in the 2nd substep. > > Thinking about it more, I don't think any of the grandfathered tags > would actually make it to the second substep since they wouldn't pass > the first substep. Not all grandfathered tags have a Preferred-Value. Canonicalization leaves such tags unchanged. Examples include "cel-gaulish", "en-GB-oed", and "i-mingo". -- Doug Ewell | Thornton, CO, USA | http://ewellic.org From kent.karlsson14 at telia.com Fri Jan 23 12:50:00 2015 From: kent.karlsson14 at telia.com (Kent Karlsson) Date: Fri, 23 Jan 2015 19:50:00 +0100 Subject: Add Likely Subtags first step In-Reply-To: <20150123093542.665a7a7059d7ee80bb4d670165c8327d.ddddc3c6bc.wbe@email03.secureserver.net> Message-ID: Den 2015-01-23 17:35, skrev "Doug Ewell" (on the CLDR-users list): > Not all grandfathered tags have a Preferred-Value. Canonicalization leaves > such tags unchanged. Examples include "cel-gaulish", "en-GB-oed", and > "i-mingo". Speaking of grandfathered tags... %% Type: grandfathered Tag: cel-gaulish Description: Gaulish Added: 2001-05-25 According to Wikipedia, the Gaulish languages are (or rather were): xtg ? Transalpine Gaulish xcg ? Cisalpine Gaulish xlp ? Lepontic xga ? Galatian So I think cel-gaulish should be deprecated, with a comment like: "Comment: See: xtg for Transalpine Gaulish, xcg for Cisalpine Gaulish, xlp for Lepontic, xga for Galatian." --------------------------------------- %% Type: grandfathered Tag: zh-min Description: Min, Fuzhou, Hokkien, Amoy, or Taiwanese Added: 1999-12-18 Deprecated: 2009-07-29 According to Ethnologue (https://www.ethnologue.com/language/mnp): "The Chinese now divide Chinese Min into 5 major varieties: Min Nan [nan], Min Bei [mnp], Min Dong [cdo], Min Zhong [czo], and Pu-Xian [cpx]. Others say there are at least 9 varieties which are inherently mutually unintelligible." So I think a comment like: "Comment: See: nan for Min Nan, mnp for Min Bei, cdo for Min Dong, czo for Min Zhong, cpx for Pu-Xian." would be appropriate for this registry entry, while deprecating this tag. --------------------------------------- %% Type: grandfathered Tag: i-mingo Description: Mingo Added: 1997-09-19 According to Wikipedia, Mingo is a dialect of Seneca (language subtag: 'see'). So I think this one should be deprecated, with a reference to 'see' for Seneca, or possibly to (Preferred-Value) 'see-mingo' where 'mingo' is a new variant subtag. At the very least there should be a reference to Seneca, even of not deprecating this tag. /Kent Karlsson -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Fri Jan 23 13:00:37 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Fri, 23 Jan 2015 20:00:37 +0100 Subject: Add Likely Subtags first step In-Reply-To: <20150123093542.665a7a7059d7ee80bb4d670165c8327d.ddddc3c6bc.wbe@email03.secureserver.net> References: <20150123093542.665a7a7059d7ee80bb4d670165c8327d.ddddc3c6bc.wbe@email03.secureserver.net> Message-ID: For "cel-gaulish"; it should have been encoded in ISO 639-3; however this is more probably a family of dialects, and the language is mostly reconstructed as it was not always written in the early stages (when it was written, it was under influence of the Roman empire; and mixed with Latin, Greek or other regional languages or other past invaders and it's difficult to determine if this was an effective vernacular dialect; or just the language of some rulers or merchants). But even before the Roman invasion (and the brutal massacre by armies of Consul Julius Caesar before he became emperor; a massacre highly criticized even in the Roman Senate), there was this influence and Gaulish people were already present in many places of Europe; including in Rome. The grandfathered "oed" variant for "en-GB" is encodable as a standard variant. I wonder why it was not done; but it can be kept as is. 2015-01-23 17:35 GMT+01:00 Doug Ewell : > B?renger Enselme wrote: > > > In http://www.unicode.org/reports/tr35/#Likely_Subtags the first step > > is described as canonicalization. > > > > The 3rd substep says to return a tag as is if it is in the > id="$grandfathered" type="choice"> from the supplemental data. > > > > As far as I can tell this never happens since such tags have already > > been replaced in the 2nd substep. > > > > Thinking about it more, I don't think any of the grandfathered tags > > would actually make it to the second substep since they wouldn't pass > > the first substep. > > Not all grandfathered tags have a Preferred-Value. Canonicalization > leaves such tags unchanged. Examples include "cel-gaulish", "en-GB-oed", > and "i-mingo". > > -- > Doug Ewell | Thornton, CO, USA | http://ewellic.org > > > _______________________________________________ > CLDR-Users mailing list > CLDR-Users at unicode.org > http://unicode.org/mailman/listinfo/cldr-users > -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Fri Jan 23 13:30:43 2015 From: doug at ewellic.org (Doug Ewell) Date: Fri, 23 Jan 2015 12:30:43 -0700 Subject: Add Likely Subtags first step Message-ID: <20150123123043.665a7a7059d7ee80bb4d670165c8327d.cf60f88d11.wbe@email03.secureserver.net> Philippe Verdy wrote: > The grandfathered "oed" variant for "en-GB" is encodable as a standard > variant. Not unless you squint (or drink) hard enough that "oed" looks like at least five letters, the minimum for a well-formed variant that starts with a letter. "oxford" or similar would be syntactically allowable, but "oed" was chosen to show clearly that the variant applies to the spelling used in the dictionary, not usage in the city of Oxford. > I wonder why it was not done; Probably because little would be gained from doing so. The variant would make no sense with other languages, and parsers would still have to recognize the older form. -- Doug Ewell | Thornton, CO, USA | http://ewellic.org From verdy_p at wanadoo.fr Sat Jan 24 12:47:24 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Sat, 24 Jan 2015 19:47:24 +0100 Subject: Add Likely Subtags first step In-Reply-To: <20150123123043.665a7a7059d7ee80bb4d670165c8327d.cf60f88d11.wbe@email03.secureserver.net> References: <20150123123043.665a7a7059d7ee80bb4d670165c8327d.cf60f88d11.wbe@email03.secureserver.net> Message-ID: I said "encodable" but **with a standard subtag** (i.e. effectively with letters). It would make no sense to say "encode eod" given it is **already** encoded (but as a grandfather tag; which by itself remains part of the standard, but not decomposable as a subtag). The main reason is that there's no real benefit to do it, if the standard is followed exactly : grandfather **tags** (not subtags) are also part of the standard. They remain as is; even if there's no replacement. Though it would be useful to avoid complications in fallback resolutions, because here as part of a tag only, it is noremally not decomposable (but fallback resolvers will likely do so to fallback "en-GB-eod" to "en-GB" (and then "en"). The standard still says nothing about such fallfack mechanisms for grandfather tags, even if here what is to do seems evident. Note also that standard variant subtags are also directly linked to specific parent subtags in which they are valid. It would be enough to accept "eod" as a variant subtag, but with grandfathered status; valid only for the "en-GB" combination or maybe also "en-Latn-GB" (which is most probably what it refers to : the Latin script only, with the "most likely" script assignment infered by the Oxford Dictionnary which only uses that script)... But for now it is still impossible to define the correct replacement tag, unless en-Latn-GB-eod is also accepted and the IANA database contains not only suggested "replacements", but also a few needed minimum fallbacks to standard tags (decomposable as subtags) for grandfathered tags (this is not an heresy, after all the "likely" properties have also been added to the IANA registry, just like replacement properties have also been added (initially only for deprecated language subtags like "jw" or "iw"). For ambiguous tags that currently have no clear replacements, the posible candidate fallbacks could also be listed (e.g. for i-mingo) with no prefered order (applications are free to choose one or the other according to their own criteria or needs). This would apply also to a few old language tags (which were initially encoded as isolated tags; from the ISO 639-1 and -2 language codes, but were later considered to be language families). The problem being that the lists of encoded languages (including macrolanguages) which are mapped to a family is still not defined (unlike the standardized lists of isolated languages that are mapped to a macrolanguage). For i-mingo, it is very difficult to see a correct mapping by defining a "min" language code as a macrolanguage; it would be just a family; but the list of isolated languages or macrolanguages that are encoded with stand language subtags is known and their mapping to the "Min" family seems clear. Fallbacks could still work (after all if we can fallback Mandarin Chinese to English, we can as well fallback Min languages to another one, before trying Mandarin (cmn, or just zh, because Mandarin/cmn is the most likely for Chinese/zh) or Cantonese (yue). But are there really a lot of data using these grandfathered codes ? Users of these databases are just instructed that their data is ambiguous and that they should be more precise (but the same could be said about Quechua which is hardly a true macrolanguage but more likely a family (it maps to a likely language only when Querchua is precised with a country subtag such as Peru, Colombia or Bolivia.(it would be more difficult for the minorities remaining in Mexico, where their Quechua has been a lot creolized together or with the Spanish lingua franca) 2015-01-23 20:30 GMT+01:00 Doug Ewell : > Philippe Verdy wrote: > > > The grandfathered "oed" variant for "en-GB" is encodable as a standard > > variant. > > Not unless you squint (or drink) hard enough that "oed" looks like at > least five letters, the minimum for a well-formed variant that starts > with a letter. > > "oxford" or similar would be syntactically allowable, but "oed" was > chosen to show clearly that the variant applies to the spelling used in > the dictionary, not usage in the city of Oxford. > > > I wonder why it was not done; > > Probably because little would be gained from doing so. The variant would > make no sense with other languages, and parsers would still have to > recognize the older form. > > -- > Doug Ewell | Thornton, CO, USA | http://ewellic.org > > -------------- next part -------------- An HTML attachment was scrubbed... URL: