From rxaviers at gmail.com Mon Dec 1 06:52:53 2014 From: rxaviers at gmail.com (Rafael Xavier) Date: Mon, 1 Dec 2014 10:52:53 -0200 Subject: Formatting currencies In-Reply-To: References: <81E51250-6C5E-40CC-BBC4-C8A6D8843DD1@icu-project.org> Message-ID: One more question. 3.2 Special Pattern Characters ? (U+00A4) Prefix or suffix No Currency sign, replaced by currency symbol. If doubled, replaced by international currency symbol. If tripled, uses the long form of the decimal symbol. If present in a pattern, the monetary decimal separator and grouping separators (if available) are used instead of the numeric ones. The doubled and tripled forms are simply not mentioned in 4 Currencies section. There's also no mention defining what the "long form" is. Instead, there is displayName and there's an extensive algorithm explaining how displayNames should be implemented. It's also important to notice that unitPattern potentially varies depending of the plural count form. The tripled form doesn't support such variation. Therefore, they are not equivalent. Are the doubled and tripled forms defined somewhere else? Have they been deprecated and the above table not updated? On Sat, Nov 29, 2014 at 7:03 PM, Rafael Xavier wrote: > Thank you ver much so far Steven. > > > On Friday, November 28, 2014, Steven R. Loomis > wrote: > >> Too much turkey i guess. Sorry, I was responding for "normal " currency >> format not plural name. >> >> For currency name format it does look like it should be better >> specified. I'd expect "3 dollars" not "3.00 dollars". Anyways, I'll check >> on this next week. >> >> Enviado desde nuestro iPhone. >> >> El nov 28, 2014, a las 2:08 PM, Rafael Xavier >> escribi?: >> >> >> >> On Fri, Nov 28, 2014 at 7:46 PM, Steven R. Loomis >> wrote: >> >>> >>> >>> Enviado desde nuestro iPhone. >>> >>> El nov 28, 2014, a las 12:56 PM, Rafael Xavier >>> escribi?: >>> >>> Hello friends, hope you had a blessed thanksgiving (if you happen to >>> celebrate it). >>> >>> Follow a couple of questions I had interpreting 4 Currencies >>> , for >>> which I'd very much appreciate your replies. >>> >>> >>> *name currency formatting* (displayName) >>> >>> To format a particular currency value "ZWD" for a particular numeric >>>> value *n*: >>>> ... >>>> 5. The numeric value, formatted according to the locale with the >>>> number of decimals appropriate for the currency, is substituted for >>>> {0} in the unitPattern, while the currency display name is substituted for >>>> the {1}. >>>> >>> >>> What does "formatted according to the locale" mean? To use locale's >>> decimal standard pattern (for example, #,##0.### --- "69,900 US >>> dollars" in *en*)? Any other pattern instead? >>> >>> >>> No, the currency pattern. >>> >> >> What to do with the symbol from the currency pattern? Ignore/Drop it? Or >> we'd have "?69,900 US dollars". >> >> >>> >>> What does "with the number of decimals appropriate for the currency" >>> mean? To use the supplemental currency data >>> >>> `digits` and `rounding` values to override the above pattern (for example, >>> "69,900.00 US dollars" in *en*)? >>> >>> >>> >>> >>> *supplemental currency data* >>> >>>> *digits: *the number of decimal digits normally formatted. The default >>>> is 2. >>>> >>> >>> Are "number of decimal digits" the minimum fraction digits or the >>> maximum fraction digits? I'd assume the minimum. >>> >>> >>> Min and max. So USD and EUR=2, so 0.99, 1.00, 1.01, etc >>> >> >> Actually, I wasn't sure if it was: >> - Min and max (e.g., f(1) = "$1.00", f(1.123) = "$1.12"), or >> - Max only (e.g., f(1) = "$1", f(1.123) = "$1.12"). >> >>> >>> Thanks, >>> Rafael Xavier >>> >>> CurrencyFormat >>> -- >>> +55 (16) 98138-1582, +1 (415) 568-5854, skype: rxaviers >>> http://rafael.xavier.blog.br >>> >>> _______________________________________________ >>> CLDR-Users mailing list >>> CLDR-Users at unicode.org >>> http://unicode.org/mailman/listinfo/cldr-users >>> >>> >> >> >> -- >> +55 (16) 98138-1582, +1 (415) 568-5854, skype: rxaviers >> http://rafael.xavier.blog.br >> >> > > -- > +55 (16) 98138-1582, +1 (415) 568-5854, skype: rxaviers > http://rafael.xavier.blog.br > > -- +55 (16) 98138-1582, +1 (415) 568-5854, skype: rxaviers http://rafael.xavier.blog.br -------------- next part -------------- An HTML attachment was scrubbed... URL: From rxaviers at gmail.com Tue Dec 2 14:37:09 2014 From: rxaviers at gmail.com (Rafael Xavier) Date: Tue, 2 Dec 2014 18:37:09 -0200 Subject: Formatting currencies In-Reply-To: References: <81E51250-6C5E-40CC-BBC4-C8A6D8843DD1@icu-project.org> Message-ID: The above open questions became trac tickets: - http://unicode.org/cldr/trac/ticket/8053 - http://unicode.org/cldr/trac/ticket/8054 - http://unicode.org/cldr/trac/ticket/8055 On Mon, Dec 1, 2014 at 10:52 AM, Rafael Xavier wrote: > One more question. > > 3.2 Special Pattern Characters > > > ? (U+00A4) Prefix or suffix No Currency sign, replaced by currency > symbol. If doubled, replaced by international currency symbol. If > tripled, uses the long form of the decimal symbol. If present in a > pattern, the monetary decimal separator and grouping separators (if > available) are used instead of the numeric ones. > > The doubled and tripled forms are simply not mentioned in 4 Currencies > > section. There's also no mention defining what the "long form" is. Instead, > there is displayName and there's an extensive algorithm explaining how > displayNames should be implemented. It's also important to notice that > unitPattern potentially varies depending of the plural count form. The > tripled form doesn't support such variation. Therefore, they are not > equivalent. > > Are the doubled and tripled forms defined somewhere else? Have they been > deprecated and the above table not updated? > > > > On Sat, Nov 29, 2014 at 7:03 PM, Rafael Xavier wrote: > >> Thank you ver much so far Steven. >> >> >> On Friday, November 28, 2014, Steven R. Loomis >> wrote: >> >>> Too much turkey i guess. Sorry, I was responding for "normal " currency >>> format not plural name. >>> >>> For currency name format it does look like it should be better >>> specified. I'd expect "3 dollars" not "3.00 dollars". Anyways, I'll check >>> on this next week. >>> >>> Enviado desde nuestro iPhone. >>> >>> El nov 28, 2014, a las 2:08 PM, Rafael Xavier >>> escribi?: >>> >>> >>> >>> On Fri, Nov 28, 2014 at 7:46 PM, Steven R. Loomis >>> wrote: >>> >>>> >>>> >>>> Enviado desde nuestro iPhone. >>>> >>>> El nov 28, 2014, a las 12:56 PM, Rafael Xavier >>>> escribi?: >>>> >>>> Hello friends, hope you had a blessed thanksgiving (if you happen to >>>> celebrate it). >>>> >>>> Follow a couple of questions I had interpreting 4 Currencies >>>> , >>>> for which I'd very much appreciate your replies. >>>> >>>> >>>> *name currency formatting* (displayName) >>>> >>>> To format a particular currency value "ZWD" for a particular numeric >>>>> value *n*: >>>>> ... >>>>> 5. The numeric value, formatted according to the locale with the >>>>> number of decimals appropriate for the currency, is substituted for >>>>> {0} in the unitPattern, while the currency display name is substituted for >>>>> the {1}. >>>>> >>>> >>>> What does "formatted according to the locale" mean? To use locale's >>>> decimal standard pattern (for example, #,##0.### --- "69,900 US >>>> dollars" in *en*)? Any other pattern instead? >>>> >>>> >>>> No, the currency pattern. >>>> >>> >>> What to do with the symbol from the currency pattern? Ignore/Drop it? Or >>> we'd have "?69,900 US dollars". >>> >>> >>>> >>>> What does "with the number of decimals appropriate for the currency" >>>> mean? To use the supplemental currency data >>>> >>>> `digits` and `rounding` values to override the above pattern (for example, >>>> "69,900.00 US dollars" in *en*)? >>>> >>>> >>>> >>>> >>>> *supplemental currency data* >>>> >>>>> *digits: *the number of decimal digits normally formatted. The >>>>> default is 2. >>>>> >>>> >>>> Are "number of decimal digits" the minimum fraction digits or the >>>> maximum fraction digits? I'd assume the minimum. >>>> >>>> >>>> Min and max. So USD and EUR=2, so 0.99, 1.00, 1.01, etc >>>> >>> >>> Actually, I wasn't sure if it was: >>> - Min and max (e.g., f(1) = "$1.00", f(1.123) = "$1.12"), or >>> - Max only (e.g., f(1) = "$1", f(1.123) = "$1.12"). >>> >>>> >>>> Thanks, >>>> Rafael Xavier >>>> >>>> CurrencyFormat >>>> -- >>>> +55 (16) 98138-1582, +1 (415) 568-5854, skype: rxaviers >>>> http://rafael.xavier.blog.br >>>> >>>> _______________________________________________ >>>> CLDR-Users mailing list >>>> CLDR-Users at unicode.org >>>> http://unicode.org/mailman/listinfo/cldr-users >>>> >>>> >>> >>> >>> -- >>> +55 (16) 98138-1582, +1 (415) 568-5854, skype: rxaviers >>> http://rafael.xavier.blog.br >>> >>> >> >> -- >> +55 (16) 98138-1582, +1 (415) 568-5854, skype: rxaviers >> http://rafael.xavier.blog.br >> >> > > > -- > +55 (16) 98138-1582, +1 (415) 568-5854, skype: rxaviers > http://rafael.xavier.blog.br > -- +55 (16) 98138-1582, +1 (415) 568-5854, skype: rxaviers http://rafael.xavier.blog.br -------------- next part -------------- An HTML attachment was scrubbed... URL: From cameron at lumoslabs.com Tue Dec 9 16:56:37 2014 From: cameron at lumoslabs.com (Cameron Dutro) Date: Tue, 9 Dec 2014 14:56:37 -0800 Subject: Unit Intervals Message-ID: Hey CLDR Users, I took a look through CLDR data this afternoon looking for a way to consistently format what I call "unit intervals" in multiple languages. It would be useful to my current use case if CLDR contained the data to format phrases like "every year" and "every two years", as well as "every month" and "every two months". Does CLDR currently contain, or are there plans to add, formatting data for such a use case? Thanks! -Cameron -------------- next part -------------- An HTML attachment was scrubbed... URL: From yury.tarasievich at gmail.com Wed Dec 10 00:43:29 2014 From: yury.tarasievich at gmail.com (Yury Tarasievich) Date: Wed, 10 Dec 2014 09:43:29 +0300 Subject: Unit Intervals In-Reply-To: References: Message-ID: <5487EB91.10405@gmail.com> Dealing with similar problem right now, I'd note that "interval" would primarily mean a pair "startvalue, endvalue" with some formatting to it. That formatting isn't even "widely" cultural tradition, but "narrow" typographic convention, with possibly quite extensive definition, subject to change. E.g., for numbers intervals in Russian language typography, there are "..." and "--" (U+2013) and "---" (U+2014); of course, the "-" (dash) is commonly used; formerly, the U+00F7 was prescribed; in maths related text you'd meet ":" and ", ... ,;"; in bastardised "computer spelling" -- ".." (two dots). And it is context related, too (U+2013 for dates, U+2014 or ellipsis for numbers). How to formalise all this into CLDR? Or I may have misunderstood you completely :) Yury On 12/10/2014 01:56 AM, Cameron Dutro wrote: > I took a look through CLDR data this afternoon > looking for a way to consistently format what I > call "unit intervals" in multiple languages. It ... From yury.tarasievich at gmail.com Wed Dec 10 05:15:00 2014 From: yury.tarasievich at gmail.com (Yury Tarasievich) Date: Wed, 10 Dec 2014 14:15:00 +0300 Subject: Unit Intervals In-Reply-To: References: <5487EB91.10405@gmail.com> Message-ID: <54882B34.5030008@gmail.com> Then it's a natural language problem, and as such, (almost) completely un-algorithmisable, at least in such a context. yury On 12/10/2014 01:43 PM, Philippe Verdy wrote: > The term "interval" is badly chosen in what he > describes; Cameron actually wants to express > periodicity (without any implied start or end; > i.e. only the frequency). ... From jkorpela at cs.tut.fi Wed Dec 10 06:16:59 2014 From: jkorpela at cs.tut.fi (Jukka K. Korpela) Date: Wed, 10 Dec 2014 14:16:59 +0200 Subject: Unit Intervals In-Reply-To: <54882B34.5030008@gmail.com> References: <5487EB91.10405@gmail.com> <54882B34.5030008@gmail.com> Message-ID: <548839BB.9090804@cs.tut.fi> 2014-12-10, 13:15, Yury Tarasievich wrote: > Then it's a natural language problem, and as such, (almost) completely > un-algorithmisable, at least in such a context. > > yury > > On 12/10/2014 01:43 PM, Philippe Verdy wrote: >> The term "interval" is badly chosen in what he >> describes; Cameron actually wants to express >> periodicity (without any implied start or end; >> i.e. only the frequency). It is ?algorithmisable? in an essential way, though it is debatable whether this justifies inclusion into CLDR and how important this is relative to all kinds of things that might be included there. For example, in questionnaires and reports, phrases like ?every 2 months? are common. And the amount of time might be something that is determined ?dynamically?, i.e. during program execution, and should be presented without any contribution from a human being, i.e. automatically. An obvious problem here is that different languages use different expressions, so that this is not just a matter of using a pattern consisting of a word like ?every? and a designation of amount of time. Some languages use expressions that would be something like ?each 2nd month? if applied in English. There are probably other approaches too, so some pre-study would be needed to find out the basic overall patterns. Yucca From rxaviers at gmail.com Wed Dec 10 14:54:12 2014 From: rxaviers at gmail.com (Rafael Xavier) Date: Wed, 10 Dec 2014 18:54:12 -0200 Subject: Bundle Lookup Message-ID: Friends, This is a very basic question. See below. There are lots of documentation about locale inheritance and matching. But, it fails in same cases to me. *Giving a locale, what's the procedure to find the bundle lookup chain?* 1. en-US: en-US ? (truncation) en ? root This one is dead simple. No problem. 2. en-GB: en-GB ? (parentLocale) en-001 ? (truncation) en ? root This one is also dead simple. Although, documentation says en-GB ? en. Is it outdated or am I doing something wrong? Anyway, the ones I'm interested in knowing are: 3. en-Latn-GB 4. en-US-u-nu-usd 5. zh-TW Please, could someone show me what's the chain of these locales (and obviously explain the steps)? Thanks! -- +55 (16) 98138-1582, +1 (415) 568-5854, skype: rxaviers http://rafael.xavier.blog.br -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Wed Dec 10 04:43:38 2014 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Wed, 10 Dec 2014 11:43:38 +0100 Subject: Unit Intervals In-Reply-To: <5487EB91.10405@gmail.com> References: <5487EB91.10405@gmail.com> Message-ID: The term "interval" is badly chosen in what he describes; Cameron actually wants to express periodicity (without any implied start or end; i.e. only the frequency). However his description is limited to multiples of a base time unit; but he forgets submultiples : e.g. "twice a week" vs. "every two weeks" The intended usage could include displying an interface for controling a task scheduler (such as the shell command "at"); or adding a repeated event in a personal calendar. I suggest he looks into exisitng interface for personal calendars (in smartphones for example) or help pages for programmed schedulers ("at", "cron", Task Manager in Windows; etc.). 2014-12-10 7:43 GMT+01:00 Yury Tarasievich : > Dealing with similar problem right now, I'd note that "interval" would > primarily mean a pair "startvalue, endvalue" with some formatting to it. > That formatting isn't even "widely" cultural tradition, but "narrow" > typographic convention, with possibly quite extensive definition, subject > to change. E.g., for numbers intervals in Russian language typography, > there are "..." and "--" (U+2013) and "---" (U+2014); of course, the "-" > (dash) is commonly used; formerly, the U+00F7 was prescribed; in maths > related text you'd meet ":" and ", ... ,;"; in bastardised "computer > spelling" -- ".." (two dots). And it is context related, too (U+2013 for > dates, U+2014 or ellipsis for numbers). > > How to formalise all this into CLDR? Or I may have misunderstood you > completely :) > > Yury > > On 12/10/2014 01:56 AM, Cameron Dutro wrote: > >> I took a look through CLDR data this afternoon >> looking for a way to consistently format what I >> call "unit intervals" in multiple languages. It >> > ... > _______________________________________________ > CLDR-Users mailing list > CLDR-Users at unicode.org > http://unicode.org/mailman/listinfo/cldr-users > -------------- next part -------------- An HTML attachment was scrubbed... URL: From cameron at lumoslabs.com Thu Dec 11 13:21:13 2014 From: cameron at lumoslabs.com (Cameron Dutro) Date: Thu, 11 Dec 2014 11:21:13 -0800 Subject: Unit Intervals In-Reply-To: References: <5487EB91.10405@gmail.com> Message-ID: Hey Philippe, Yes, you're absolutely right, periodicity is a much better description of what I'm after. How would looking at the interfaces for personal calendars or help pages for schedulers like cron be helpful? I'm looking for translations for said units - I'm not interested in actually scheduling events. -Cameron On Wed, Dec 10, 2014 at 2:43 AM, Philippe Verdy wrote: > The term "interval" is badly chosen in what he describes; Cameron actually > wants to express periodicity (without any implied start or end; i.e. only > the frequency). > However his description is limited to multiples of a base time unit; but > he forgets submultiples : > e.g. "twice a week" vs. "every two weeks" > The intended usage could include displying an interface for controling a > task scheduler (such as the shell command "at"); or adding a repeated event > in a personal calendar. > I suggest he looks into exisitng interface for personal calendars (in > smartphones for example) or help pages for programmed schedulers ("at", > "cron", Task Manager in Windows; etc.). > > > 2014-12-10 7:43 GMT+01:00 Yury Tarasievich : > >> Dealing with similar problem right now, I'd note that "interval" would >> primarily mean a pair "startvalue, endvalue" with some formatting to it. >> That formatting isn't even "widely" cultural tradition, but "narrow" >> typographic convention, with possibly quite extensive definition, subject >> to change. E.g., for numbers intervals in Russian language typography, >> there are "..." and "--" (U+2013) and "---" (U+2014); of course, the "-" >> (dash) is commonly used; formerly, the U+00F7 was prescribed; in maths >> related text you'd meet ":" and ", ... ,;"; in bastardised "computer >> spelling" -- ".." (two dots). And it is context related, too (U+2013 for >> dates, U+2014 or ellipsis for numbers). >> >> How to formalise all this into CLDR? Or I may have misunderstood you >> completely :) >> >> Yury >> >> On 12/10/2014 01:56 AM, Cameron Dutro wrote: >> >>> I took a look through CLDR data this afternoon >>> looking for a way to consistently format what I >>> call "unit intervals" in multiple languages. It >>> >> ... >> _______________________________________________ >> CLDR-Users mailing list >> CLDR-Users at unicode.org >> http://unicode.org/mailman/listinfo/cldr-users >> > > > _______________________________________________ > CLDR-Users mailing list > CLDR-Users at unicode.org > http://unicode.org/mailman/listinfo/cldr-users > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mark at macchiato.com Wed Dec 10 05:58:26 2014 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Wed, 10 Dec 2014 12:58:26 +0100 Subject: Unit Intervals In-Reply-To: <5487EB91.10405@gmail.com> References: <5487EB91.10405@gmail.com> Message-ID: On Wed, Dec 10, 2014 at 7:43 AM, Yury Tarasievich < yury.tarasievich at gmail.com> wrote: > Dealing with similar problem right now, I'd note that "interval" would > primarily mean a pair "startvalue, endvalue" with some formatting to it. > That formatting isn't even "widely" cultural tradition, but "narrow" > typographic convention, with possibly quite extensive definition, subject > to change. E.g., for numbers intervals in Russian language typography, > there are "..." and "--" (U+2013) and "---" (U+2014); of course, the "-" > (dash) is commonly used; formerly, the U+00F7 was prescribed; in maths > related text you'd meet ":" and ", ... ,;"; in bastardised "computer > spelling" -- ".." (two dots). And it is context related, too (U+2013 for > dates, U+2014 or ellipsis for numbers). > The first message was about recurring dates, like "every Tuesday" or "Monday and Wednesdays, the 3rd week of each month". ?We have thought about adding those (there are some bugs about them), but haven't yet. The second message is about intervals / ranges. We support locale-specific date intervals, and ranges of other elements (typically numbers), and elision (when intervening elements are removed, as in "A very ? long message"). We don't support multiple choices for any particular interval/range. If you have suggestions for improvements... Mark *? Il meglio ? l?inimico del bene ?* -------------- next part -------------- An HTML attachment was scrubbed... URL: From cameron at lumoslabs.com Thu Dec 11 13:52:48 2014 From: cameron at lumoslabs.com (Cameron Dutro) Date: Thu, 11 Dec 2014 11:52:48 -0800 Subject: Unit Intervals In-Reply-To: References: <5487EB91.10405@gmail.com> Message-ID: Hey Mark, Where would I find the locale-specific date intervals you mentioned? Are you referring to phrases like "In 2 weeks" and the like? -Cameron On Wed, Dec 10, 2014 at 3:58 AM, Mark Davis ?? wrote: > > On Wed, Dec 10, 2014 at 7:43 AM, Yury Tarasievich < > yury.tarasievich at gmail.com> wrote: > >> Dealing with similar problem right now, I'd note that "interval" would >> primarily mean a pair "startvalue, endvalue" with some formatting to it. >> That formatting isn't even "widely" cultural tradition, but "narrow" >> typographic convention, with possibly quite extensive definition, subject >> to change. E.g., for numbers intervals in Russian language typography, >> there are "..." and "--" (U+2013) and "---" (U+2014); of course, the "-" >> (dash) is commonly used; formerly, the U+00F7 was prescribed; in maths >> related text you'd meet ":" and ", ... ,;"; in bastardised "computer >> spelling" -- ".." (two dots). And it is context related, too (U+2013 for >> dates, U+2014 or ellipsis for numbers). >> > > The first message was about recurring dates, like "every Tuesday" or > "Monday and Wednesdays, the 3rd week of each month". ?We have thought about > adding those (there are some bugs about them), but haven't yet. > > The second message is about intervals / ranges. We support locale-specific > date intervals, and ranges of other elements (typically numbers), and > elision (when intervening elements are removed, as in "A very ? long > message"). > > We don't support multiple choices for any particular interval/range. If > you have suggestions for improvements... > > Mark > > *? Il meglio ? l?inimico del bene ?* > > _______________________________________________ > CLDR-Users mailing list > CLDR-Users at unicode.org > http://unicode.org/mailman/listinfo/cldr-users > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From cameron at lumoslabs.com Wed Dec 10 15:44:24 2014 From: cameron at lumoslabs.com (Cameron Dutro) Date: Wed, 10 Dec 2014 13:44:24 -0800 Subject: Unit Intervals In-Reply-To: <548839BB.9090804@cs.tut.fi> References: <5487EB91.10405@gmail.com> <54882B34.5030008@gmail.com> <548839BB.9090804@cs.tut.fi> Message-ID: Hey Yury and Jukka, This is great info, thank you both for explaining the inherent issues in collecting such data. I imagine we could come up with several different contexts that would serve as the main use cases for such data, then add additional contexts as they become necessary. This would be similar to how CLDR currently stores data for full, narrow, and abbreviated date and unit formats. I'm envisioning something in the XML like: every year every {0} years each year every {0} years (I just made all this up, it's not scientifically derived in any way.) Would having multiple contexts be appropriate for Russian and other languages? -Cameron On Wed, Dec 10, 2014 at 4:16 AM, Jukka K. Korpela wrote: > 2014-12-10, 13:15, Yury Tarasievich wrote: > > Then it's a natural language problem, and as such, (almost) completely >> un-algorithmisable, at least in such a context. >> >> yury >> >> On 12/10/2014 01:43 PM, Philippe Verdy wrote: >> >>> The term "interval" is badly chosen in what he >>> describes; Cameron actually wants to express >>> periodicity (without any implied start or end; >>> i.e. only the frequency). >>> >> > It is ?algorithmisable? in an essential way, though it is debatable > whether this justifies inclusion into CLDR and how important this is > relative to all kinds of things that might be included there. > > For example, in questionnaires and reports, phrases like ?every 2 months? > are common. And the amount of time might be something that is determined > ?dynamically?, i.e. during program execution, and should be presented > without any contribution from a human being, i.e. automatically. > > An obvious problem here is that different languages use different > expressions, so that this is not just a matter of using a pattern > consisting of a word like ?every? and a designation of amount of time. Some > languages use expressions that would be something like ?each 2nd month? if > applied in English. There are probably other approaches too, so some > pre-study would be needed to find out the basic overall patterns. > > Yucca > > > > > _______________________________________________ > CLDR-Users mailing list > CLDR-Users at unicode.org > http://unicode.org/mailman/listinfo/cldr-users > -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Thu Dec 11 14:03:15 2014 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Thu, 11 Dec 2014 21:03:15 +0100 Subject: Unit Intervals In-Reply-To: References: <5487EB91.10405@gmail.com> Message-ID: "periodicity" matches your description with multiples of time units; for submultiples the term is "frequency", the reverse. There are SI-derived units for them, but only for non-calendric units - seconds, minutes, hours are OK, days and weeks may be OK if we ignore UTC adjustments, months and years are definitely not OK, unless we just set an average but the intended usage is generally aligned with a calendar (generally the Gregorian one. However you don't just want to expres the duration of the period, but also the fact that this is a period ("every...") - for submultiples, the concept of frequency also has derived SI units (hertz, rpm...) but here aso it is more complicate for expressions like "twice a year" as the base unit of time (year) is not fixed unless you take an average year length. There are expressions combinining time units and prefixes as adverbs or adjectives (e.g. "biweekly", "quarterly").These are common contractions; but generally an interface will be built using some generic static text with selectors for the base time unit, a counter, and possible selectors for days of week or days of months (this lat option implying fixing a relative start date) when the base unit of time is calendar-based (week, month, year) and also separate settings for scheduling time of day where activity of the repeated event can be allowed or restarted in case of failure to start in time or in case of other conditions (computer or network load, synchronization with an external event, power conditions; state of some devices such as screen on/off, case closure; battery level; monitored temperature sensors and fan controls; some maximum file size, or free space on disk...). All these are specific to the kind of schedular you have. For managing personal or business calendars, other conditions could include availability of some contacts; completion states or dealys of some other works, pricings; location of user, security levels... Things will be differently expressed if you speak about publications and productions, or make an application for logistic and transportation purposes. And it's difficult to map all distinct concepts across cultures/countries/languages/juridictions. 2014-12-11 20:21 GMT+01:00 Cameron Dutro : > Hey Philippe, > > Yes, you're absolutely right, periodicity is a much better description of > what I'm after. How would looking at the interfaces for personal calendars > or help pages for schedulers like cron be helpful? I'm looking for > translations for said units - I'm not interested in actually scheduling > events. > > -Cameron > > On Wed, Dec 10, 2014 at 2:43 AM, Philippe Verdy > wrote: > >> The term "interval" is badly chosen in what he describes; Cameron >> actually wants to express periodicity (without any implied start or end; >> i.e. only the frequency). >> However his description is limited to multiples of a base time unit; but >> he forgets submultiples : >> e.g. "twice a week" vs. "every two weeks" >> The intended usage could include displying an interface for controling a >> task scheduler (such as the shell command "at"); or adding a repeated event >> in a personal calendar. >> I suggest he looks into exisitng interface for personal calendars (in >> smartphones for example) or help pages for programmed schedulers ("at", >> "cron", Task Manager in Windows; etc.). >> >> >> 2014-12-10 7:43 GMT+01:00 Yury Tarasievich : >> >>> Dealing with similar problem right now, I'd note that "interval" would >>> primarily mean a pair "startvalue, endvalue" with some formatting to it. >>> That formatting isn't even "widely" cultural tradition, but "narrow" >>> typographic convention, with possibly quite extensive definition, subject >>> to change. E.g., for numbers intervals in Russian language typography, >>> there are "..." and "--" (U+2013) and "---" (U+2014); of course, the "-" >>> (dash) is commonly used; formerly, the U+00F7 was prescribed; in maths >>> related text you'd meet ":" and ", ... ,;"; in bastardised "computer >>> spelling" -- ".." (two dots). And it is context related, too (U+2013 for >>> dates, U+2014 or ellipsis for numbers). >>> >>> How to formalise all this into CLDR? Or I may have misunderstood you >>> completely :) >>> >>> Yury >>> >>> On 12/10/2014 01:56 AM, Cameron Dutro wrote: >>> >>>> I took a look through CLDR data this afternoon >>>> looking for a way to consistently format what I >>>> call "unit intervals" in multiple languages. It >>>> >>> ... >>> _______________________________________________ >>> CLDR-Users mailing list >>> CLDR-Users at unicode.org >>> http://unicode.org/mailman/listinfo/cldr-users >>> >> >> >> _______________________________________________ >> CLDR-Users mailing list >> CLDR-Users at unicode.org >> http://unicode.org/mailman/listinfo/cldr-users >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Thu Dec 11 14:08:06 2014 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Thu, 11 Dec 2014 21:08:06 +0100 Subject: Unit Intervals In-Reply-To: References: <5487EB91.10405@gmail.com> Message-ID: No. the "date intervals" currently described in CLDR are in fact about *ranges* of dates or timestamps (e.g. from 1 January 2005 to 31 December 2014) and the shorthand numeric notations, possibly abbreviating some common elements such as when the start and end dates fall on the same year or the same month in the same year. 2014-12-11 20:52 GMT+01:00 Cameron Dutro : > Hey Mark, > > Where would I find the locale-specific date intervals you mentioned? Are > you referring to phrases like "In 2 weeks" and the like? > > -Cameron > > On Wed, Dec 10, 2014 at 3:58 AM, Mark Davis ?? wrote: > >> >> On Wed, Dec 10, 2014 at 7:43 AM, Yury Tarasievich < >> yury.tarasievich at gmail.com> wrote: >> >>> Dealing with similar problem right now, I'd note that "interval" would >>> primarily mean a pair "startvalue, endvalue" with some formatting to it. >>> That formatting isn't even "widely" cultural tradition, but "narrow" >>> typographic convention, with possibly quite extensive definition, subject >>> to change. E.g., for numbers intervals in Russian language typography, >>> there are "..." and "--" (U+2013) and "---" (U+2014); of course, the "-" >>> (dash) is commonly used; formerly, the U+00F7 was prescribed; in maths >>> related text you'd meet ":" and ", ... ,;"; in bastardised "computer >>> spelling" -- ".." (two dots). And it is context related, too (U+2013 for >>> dates, U+2014 or ellipsis for numbers). >>> >> >> The first message was about recurring dates, like "every Tuesday" or >> "Monday and Wednesdays, the 3rd week of each month". ?We have thought about >> adding those (there are some bugs about them), but haven't yet. >> >> The second message is about intervals / ranges. We support >> locale-specific date intervals, and ranges of other elements (typically >> numbers), and elision (when intervening elements are removed, as in "A very >> ? long message"). >> >> We don't support multiple choices for any particular interval/range. If >> you have suggestions for improvements... >> >> Mark >> >> *? Il meglio ? l?inimico del bene ?* >> >> _______________________________________________ >> CLDR-Users mailing list >> CLDR-Users at unicode.org >> http://unicode.org/mailman/listinfo/cldr-users >> >> > > _______________________________________________ > CLDR-Users mailing list > CLDR-Users at unicode.org > http://unicode.org/mailman/listinfo/cldr-users > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From cameron at lumoslabs.com Thu Dec 11 15:00:33 2014 From: cameron at lumoslabs.com (Cameron Dutro) Date: Thu, 11 Dec 2014 13:00:33 -0800 Subject: Unit Intervals In-Reply-To: References: <5487EB91.10405@gmail.com> Message-ID: Philippe, Did you see the example XML markup in one of my previous emails? It allows for different contexts and could be expanded to service any of the contexts you mentioned, eg. frequency, completion of state, etc. We don't have to accommodate absolutely every use case here - I was hoping to start with a basic periodical format that would contain "every year", "every 2 years", etc. I don't know enough about every language to know how difficult this data would be to gather, however it sounds like there are quite a few contexts we can ignore in v1. -Cameron On Thu, Dec 11, 2014 at 12:08 PM, Philippe Verdy wrote: > No. the "date intervals" currently described in CLDR are in fact about > *ranges* of dates or timestamps (e.g. from 1 January 2005 to 31 December > 2014) and the shorthand numeric notations, possibly abbreviating some > common elements such as when the start and end dates fall on the same year > or the same month in the same year. > > 2014-12-11 20:52 GMT+01:00 Cameron Dutro : > >> Hey Mark, >> >> Where would I find the locale-specific date intervals you mentioned? Are >> you referring to phrases like "In 2 weeks" and the like? >> >> -Cameron >> >> On Wed, Dec 10, 2014 at 3:58 AM, Mark Davis ?? >> wrote: >> >>> >>> On Wed, Dec 10, 2014 at 7:43 AM, Yury Tarasievich < >>> yury.tarasievich at gmail.com> wrote: >>> >>>> Dealing with similar problem right now, I'd note that "interval" would >>>> primarily mean a pair "startvalue, endvalue" with some formatting to it. >>>> That formatting isn't even "widely" cultural tradition, but "narrow" >>>> typographic convention, with possibly quite extensive definition, subject >>>> to change. E.g., for numbers intervals in Russian language typography, >>>> there are "..." and "--" (U+2013) and "---" (U+2014); of course, the "-" >>>> (dash) is commonly used; formerly, the U+00F7 was prescribed; in maths >>>> related text you'd meet ":" and ", ... ,;"; in bastardised "computer >>>> spelling" -- ".." (two dots). And it is context related, too (U+2013 for >>>> dates, U+2014 or ellipsis for numbers). >>>> >>> >>> The first message was about recurring dates, like "every Tuesday" or >>> "Monday and Wednesdays, the 3rd week of each month". ?We have thought about >>> adding those (there are some bugs about them), but haven't yet. >>> >>> The second message is about intervals / ranges. We support >>> locale-specific date intervals, and ranges of other elements (typically >>> numbers), and elision (when intervening elements are removed, as in "A very >>> ? long message"). >>> >>> We don't support multiple choices for any particular interval/range. If >>> you have suggestions for improvements... >>> >>> Mark >>> >>> *? Il meglio ? l?inimico del bene ?* >>> >>> _______________________________________________ >>> CLDR-Users mailing list >>> CLDR-Users at unicode.org >>> http://unicode.org/mailman/listinfo/cldr-users >>> >>> >> >> _______________________________________________ >> CLDR-Users mailing list >> CLDR-Users at unicode.org >> http://unicode.org/mailman/listinfo/cldr-users >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From emmo at us.ibm.com Thu Dec 11 16:53:59 2014 From: emmo at us.ibm.com (John Emmons) Date: Thu, 11 Dec 2014 16:53:59 -0600 Subject: Bundle Lookup In-Reply-To: References: Message-ID: #3 is currently a problem, which we are working on. Basically, "Latn" needs to be stripped out because it isn't necessary. Then follow the normal inheritance: en-GB: en-GB ? (parentLocale) en-001 ? (truncation) en ? root #4 - Any unicode locale extensions are meant to identify particular behaviors that are desired in the context of a given locale. Think of them like "options". They are not meant to be used in the context of bundle lookups. #5 - zh_TW - Now that proper language aliases are in place ( See http://unicode.org/cldr/trac/ticket/5949 ) zh-TW: zh-TW ? (languageAlias) zh-Hant-TW ? (truncation) zh-Hant (parentLocale) ? root Regards, John C. Emmons Globalization Architect & Unicode CLDR TC Chairman IBM Software Group Internet: emmo at us.ibm.com From: Rafael Xavier To: "cldr-users at unicode.org" Cc: J?rn Zaefferer Date: 12/11/2014 01:02 PM Subject: Bundle Lookup Sent by: "CLDR-Users" Friends, This is a very basic question. See below. There are lots of documentation about locale inheritance and matching. But, it fails in same cases to me. Giving a locale, what's the procedure to find the bundle lookup chain? 1. en-US: en-US ? (truncation) en ? root This one is dead simple. No problem. 2. en-GB: en-GB ? (parentLocale) en-001 ? (truncation) en ? root This one is also dead simple. Although, documentation says en-GB ? en. Is it outdated or am I doing something wrong? Anyway, the ones I'm interested in knowing are: 3. en-Latn-GB 4. en-US-u-nu-usd 5. zh-TW Please, could someone show me what's the chain of these locales (and obviously explain the steps)? Thanks! -- +55 (16) 98138-1582, +1 (415) 568-5854, skype: rxaviers http://rafael.xavier.blog.br_______________________________________________ CLDR-Users mailing list CLDR-Users at unicode.org http://unicode.org/mailman/listinfo/cldr-users -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From ehoogerbeets at gmail.com Thu Dec 11 19:41:18 2014 From: ehoogerbeets at gmail.com (Edwin Hoogerbeets) Date: Thu, 11 Dec 2014 17:41:18 -0800 Subject: Bundle Lookup In-Reply-To: References: Message-ID: <548A47BE.5080900@gmail.com> Rafael, also take a look at common/supplemental/likelySubtags.xml. If the caller has passed you an incompletely specified locale, you can use those mappings to see if you can get to a locale for which you do have a string bundle. I think that is the source for the "language aliases" to which John was referring. John, for the last part of your example zh-TW inheritance chain, wouldn't you just truncate "zh-Hant" again to "zh" like in the en-GB example before inheriting from the root? If not, what is the reasoning there? Is there already a document that specifies the inheritance rules in CLDR? For efficiency, I can imagine you would put the common translations in "zh" where there is no difference between traditional and simplified, and other translations in "zh-Hant" or "zh-Hans" where there is. That would save some disk space and you could leverage linguistic bug fixes at the "zh" level. For other locales like "sr-Latn" and "sr-Cyrl" there would be nothing in common so the string bundle at the "sr" level would be essentially empty, but it should still appear in the inheritance chain just in case. Edwin On 12/11/2014 02:53 PM, John Emmons wrote: > > #3 is currently a problem, which we are working on. Basically, "Latn" > needs to be stripped out because it isn't necessary. Then follow the > normal inheritance: > > en-GB: en-GB ? (parentLocale) en-001 ? (truncation) en ? root > > #4 - Any unicode locale extensions are meant to identify particular > behaviors that are desired in the context of a given locale. Think of > them like "options". They are not meant to be used in the context of > bundle lookups. > > #5 - zh_TW - Now that proper language aliases are in place ( See > http://unicode.org/cldr/trac/ticket/5949 ) > > zh-TW: zh-TW ? (languageAlias) zh-Hant-TW ? > (truncation) zh-Hant (parentLocale) ? root > > Regards, > > John C. Emmons > Globalization Architect & Unicode CLDR TC Chairman > IBM Software Group > Internet: emmo at us.ibm.com > > > Inactive hide details for Rafael Xavier ---12/11/2014 01:02:57 > PM---Friends, This is a very basic question. See below. There arRafael > Xavier ---12/11/2014 01:02:57 PM---Friends, This is a very basic > question. See below. There are lots of documentation > > From: Rafael Xavier > To: "cldr-users at unicode.org" > Cc: J?rn Zaefferer > Date: 12/11/2014 01:02 PM > Subject: Bundle Lookup > Sent by: "CLDR-Users" > > ------------------------------------------------------------------------ > > > > Friends, > > This is a very basic question. See below. There are lots of > documentation about locale inheritance and matching. But, it fails in > same cases to me. > / > Giving a locale, what's the procedure to find the /*/bundle/*/ lookup > chain?/ > > 1. en-US: en-US ? (truncation) en ? root > > This one is dead simple. No problem. > > 2. en-GB: en-GB ? (parentLocale) en-001 ? (truncation) en ? root > > This one is also dead simple. Although, documentation says en-GB ? en. > Is it outdated or am I doing something wrong? > > Anyway, the ones I'm interested in knowing are: > > 3. en-Latn-GB > 4. en-US-u-nu-usd > 5. zh-TW > > Please, could someone show me what's the chain of these locales (and > obviously explain the steps)? > > Thanks! > > -- > _+55 (16) 98138-1582_ , _+1 (415) > 568-5854_ , skype: rxaviers_ > __http://rafael.xavier.blog.br_ > _______________________________________________ > CLDR-Users mailing list > CLDR-Users at unicode.org > http://unicode.org/mailman/listinfo/cldr-users > > > > _______________________________________________ > CLDR-Users mailing list > CLDR-Users at unicode.org > http://unicode.org/mailman/listinfo/cldr-users -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 105 bytes Desc: not available URL: From yury.tarasievich at gmail.com Thu Dec 11 22:27:16 2014 From: yury.tarasievich at gmail.com (Yury Tarasievich) Date: Fri, 12 Dec 2014 07:27:16 +0300 Subject: Unit Intervals In-Reply-To: References: <5487EB91.10405@gmail.com> <54882B34.5030008@gmail.com> <548839BB.9090804@cs.tut.fi> Message-ID: <548A6EA4.8060705@gmail.com> Only I don't quite understand, why there has to be this keyword (which'd need to be maintained and processed with a natural language in mind), and not something like "lead to inverse time units" (in fact, word(s) denoting the fraction line between numerator and denominator): For English it could be "in" (once "in" a year, 1 "in" 24 hours). For Russian, "?" (??? "?" ???, 1 "?" 24 ????) etc. Yury On 12/11/2014 12:44 AM, Cameron Dutro wrote: ... > date and unit formats. I'm envisioning something > in the XML like: > > ... From doug at ewellic.org Thu Dec 11 22:42:15 2014 From: doug at ewellic.org (Doug Ewell) Date: Thu, 11 Dec 2014 21:42:15 -0700 Subject: Bundle Lookup In-Reply-To: References: Message-ID: <32520D15CE4D4CF286D6EBB917CBCC0C@DougEwell> John Emmons wrote: > #3 is currently a problem, which we are working on. Basically, "Latn" > needs to be stripped out because it isn't necessary. Then follow the > normal inheritance: > > en-GB: en-GB ? (parentLocale) en-001 ? (truncation) en ? root Alternatively, you can say that 'Latn' needs to be stripped out because it's the BCP 47 Suppress-Script for 'en'. That sounds less like a personal opinion than "because it isn't necessary." -- Doug Ewell | Thornton, CO, USA | http://ewellic.org ? From emmo at us.ibm.com Fri Dec 12 10:04:51 2014 From: emmo at us.ibm.com (John Emmons) Date: Fri, 12 Dec 2014 10:04:51 -0600 Subject: Bundle Lookup In-Reply-To: <548A47BE.5080900@gmail.com> References: <548A47BE.5080900@gmail.com> Message-ID: Yes, Edward, there is a very good reason we don't want zh-Hant to inherit from zh. Simply put, in situations where you have locale resources that aren't 100% populated, allowing zh-Hant to inherit from zh produces a mixture of simplified and traditional Chinese, which is acceptable to no one. This is what we call "cross script inheritance" in CLDR. While it might be acceptable to some in the case of Chinese, it is certainly a bigger problem in languages like Serbian, where you have both Latin and Cyrillic scripts in use, and you certainly don't ever want a mixture of Latin and Cyrillic scripts These relationships are documented in CLDR's supplemental data, where you have specified: Regards, John C. Emmons Globalization Architect & Unicode CLDR TC Chairman IBM Software Group Internet: emmo at us.ibm.com From: Edwin Hoogerbeets To: John Emmons/Austin/IBM at IBMUS, Rafael Xavier Cc: J?rn Zaefferer , "cldr-users at unicode.org" Date: 12/11/2014 07:41 PM Subject: Re: Bundle Lookup Rafael, also take a look at common/supplemental/likelySubtags.xml. If the caller has passed you an incompletely specified locale, you can use those mappings to see if you can get to a locale for which you do have a string bundle. I think that is the source for the "language aliases" to which John was referring. John, for the last part of your example zh-TW inheritance chain, wouldn't you just truncate "zh-Hant" again to "zh" like in the en-GB example before inheriting from the root? If not, what is the reasoning there? Is there already a document that specifies the inheritance rules in CLDR? For efficiency, I can imagine you would put the common translations in "zh" where there is no difference between traditional and simplified, and other translations in "zh-Hant" or "zh-Hans" where there is. That would save some disk space and you could leverage linguistic bug fixes at the "zh" level. For other locales like "sr-Latn" and "sr-Cyrl" there would be nothing in common so the string bundle at the "sr" level would be essentially empty, but it should still appear in the inheritance chain just in case. Edwin On 12/11/2014 02:53 PM, John Emmons wrote: #3 is currently a problem, which we are working on. ?Basically, "Latn" needs to be stripped out because it isn't necessary. ?Then follow the normal inheritance: en-GB: en-GB ? (parentLocale)?en-001 ? (truncation)?en ? root #4 - Any unicode locale extensions are meant to identify particular behaviors that are desired in the context of a given locale. ?Think of them like "options". ?They are not meant to be used in the context of bundle lookups. #5 - zh_TW - Now that proper language aliases are in place ( See http://unicode.org/cldr/trac/ticket/5949?) zh-TW: zh-TW ? (languageAlias) zh-Hant-TW?? (truncation)?zh-Hant (parentLocale)?? root Regards, John C. Emmons Globalization Architect & Unicode CLDR TC Chairman IBM Software Group Internet: emmo at us.ibm.com Inactive hide details for Rafael Xavier ---12/11/2014 01:02:57 PM---Friends, This is a very basic question. See below. There arRafael Xavier ---12/11/2014 01:02:57 PM---Friends, This is a very basic question. See below. There are lots of documentation From: Rafael Xavier To: "cldr-users at unicode.org" Cc: J?rn Zaefferer Date: 12/11/2014 01:02 PM Subject: Bundle Lookup Sent by: "CLDR-Users" Friends, This is a very basic question. See below. There are lots of documentation about locale inheritance and matching. But, it fails in same cases to me. Giving a locale, what's the procedure to find the bundle?lookup chain? 1. en-US: en-US ? (truncation)?en ? root This one is dead simple. No problem. 2. en-GB: en-GB ? (parentLocale)?en-001 ? (truncation)?en ? root This one is also dead simple. Although, documentation says en-GB ? en. Is it outdated or am I doing something wrong? Anyway, the ones I'm interested in knowing are: 3. en-Latn-GB 4. en-US-u-nu-usd 5. zh-TW Please, could someone show me what's the chain of these locales (and obviously explain the steps)? Thanks! -- +55 (16) 98138-1582, +1 (415) 568-5854, skype: rxaviers http://rafael.xavier.blog.br _______________________________________________ CLDR-Users mailing list CLDR-Users at unicode.org http://unicode.org/mailman/listinfo/cldr-users _______________________________________________ CLDR-Users mailing list CLDR-Users at unicode.org http://unicode.org/mailman/listinfo/cldr-users -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From mark at macchiato.com Fri Dec 12 10:50:44 2014 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Fri, 12 Dec 2014 17:50:44 +0100 Subject: Bundle Lookup In-Reply-To: References: <548A47BE.5080900@gmail.com> Message-ID: I also want to be clear that there are two closely-related but very different tasks. 1. *Inherited item lookup. *Given that you have a CLDR resource bundle, with inheritance, where do I go to get inherited items? That is specified by CLDR by means of the parentLocale + truncation algorithm, plus the alias element. (There are a few cases where we have "Lateral Inheritance" where the specification is in the text of LDML, such as when looking for an alt variant.) So back to Rafael's original question: 1. en-Latn-GB, and zh-TW are not CLDR bundles, so this doesn't apply to them. 2. en-US-u-nu-usd: the u-nu-usd doesn't select within a bundle, but rather customizes a service that uses information in the bundle. The item lookup (using by the currency formatting service) would be en-US => en => root. 2. *Bundle lookup. *Given a locale ID, where do I get the best matching CLDR bundle? My application has a set of supported locales, and the user comes in with a set of desired locales. What is the best bundle for that user? Here we are not as clear as we should be. The recommended process is in http://www.unicode.org/reports/tr35/#LanguageMatching So back to Rafael's original question: 1. en-Latn-GB, and zh-TW. When these are looked up with Language Matching, assuming that all the CLDR locales are available, they would return, respectively, en-GB and zh-Hant-TW. That being said, often people don't understand language matching, and so we are in the process of adding more information so that there is a direct mapping from between locale IDs that are always considered to be "identical" on a deep level, like en-GB and en-Latn-GB. Mark *? Il meglio ? l?inimico del bene ?* On Fri, Dec 12, 2014 at 5:04 PM, John Emmons wrote: > Yes, Edward, there is a very good reason we don't want zh-Hant to inherit > from zh. Simply put, in situations where you have locale resources that > aren't 100% populated, allowing zh-Hant to inherit from zh produces a > mixture of simplified and traditional Chinese, which is acceptable to no > one. This is what we call "cross script inheritance" in CLDR. While it > might be acceptable to some in the case of Chinese, it is certainly a > bigger problem in languages like Serbian, where you have both Latin and > Cyrillic scripts in use, and you certainly don't ever want a mixture of > Latin and Cyrillic scripts > > These relationships are documented in CLDR's supplemental data, where you > have specified: > > > > > Regards, > > John C. Emmons > Globalization Architect & Unicode CLDR TC Chairman > IBM Software Group > Internet: emmo at us.ibm.com > > > [image: Inactive hide details for Edwin Hoogerbeets ---12/11/2014 07:41:26 > PM---Rafael, also take a look at common/supplemental/likelyS]Edwin > Hoogerbeets ---12/11/2014 07:41:26 PM---Rafael, also take a look at > common/supplemental/likelySubtags.xml. If the caller has passed you an i > > From: Edwin Hoogerbeets > To: John Emmons/Austin/IBM at IBMUS, Rafael Xavier > Cc: J?rn Zaefferer , "cldr-users at unicode.org" < > cldr-users at unicode.org> > Date: 12/11/2014 07:41 PM > Subject: Re: Bundle Lookup > ------------------------------ > > > > Rafael, also take a look at common/supplemental/likelySubtags.xml. If the > caller has passed you an incompletely specified locale, you can use those > mappings to see if you can get to a locale for which you do have a string > bundle. I think that is the source for the "language aliases" to which John > was referring. > > John, for the last part of your example zh-TW inheritance chain, wouldn't > you just truncate "zh-Hant" again to "zh" like in the en-GB example before > inheriting from the root? If not, what is the reasoning there? Is there > already a document that specifies the inheritance rules in CLDR? > > For efficiency, I can imagine you would put the common translations in > "zh" where there is no difference between traditional and simplified, and > other translations in "zh-Hant" or "zh-Hans" where there is. That would > save some disk space and you could leverage linguistic bug fixes at the > "zh" level. For other locales like "sr-Latn" and "sr-Cyrl" there would be > nothing in common so the string bundle at the "sr" level would be > essentially empty, but it should still appear in the inheritance chain just > in case. > > Edwin > > > On 12/11/2014 02:53 PM, John Emmons wrote: > > > #3 is currently a problem, which we are working on. Basically, "Latn" > needs to be stripped out because it isn't necessary. Then follow the > normal inheritance: > > en-GB: en-GB ? (parentLocale) en-001 ? (truncation) en ? root > > #4 - Any unicode locale extensions are meant to identify particular > behaviors that are desired in the context of a given locale. Think of them > like "options". They are not meant to be used in the context of bundle > lookups. > > #5 - zh_TW - Now that proper language aliases are in place ( See > *http://unicode.org/cldr/trac/ticket/5949* > ) > > zh-TW: zh-TW ? (languageAlias) zh-Hant-TW ? (truncation) zh-Hant > (parentLocale) ? root > > Regards, > > John C. Emmons > Globalization Architect & Unicode CLDR TC Chairman > IBM Software Group > Internet: *emmo at us.ibm.com* > > > [image: Inactive hide details for Rafael Xavier ---12/11/2014 01:02:57 > PM---Friends, This is a very basic question. See below. There ar]Rafael > Xavier ---12/11/2014 01:02:57 PM---Friends, This is a very basic question. > See below. There are lots of documentation > > From: Rafael Xavier ** > To: *"cldr-users at unicode.org"* > ** > Cc: J?rn Zaefferer ** > > Date: 12/11/2014 01:02 PM > Subject: Bundle Lookup > Sent by: "CLDR-Users" ** > > > ------------------------------ > > > > Friends, > > This is a very basic question. See below. There are lots of > documentation about locale inheritance and matching. But, it fails in same > cases to me. > > * Giving a locale, what's the procedure to find the **bundle** lookup > chain?* > > 1. en-US: en-US ? (truncation) en ? root > > This one is dead simple. No problem. > > 2. en-GB: en-GB ? (parentLocale) en-001 ? (truncation) en ? root > > This one is also dead simple. Although, documentation says en-GB ? en. > Is it outdated or am I doing something wrong? > > Anyway, the ones I'm interested in knowing are: > > 3. en-Latn-GB > 4. en-US-u-nu-usd > 5. zh-TW > > Please, could someone show me what's the chain of these locales (and > obviously explain the steps)? > > Thanks! > > -- > *+55 (16) 98138-1582* <%2B55%20%2816%29%2098138-1582>, *+1 (415) 568-5854* > <%2B1%20%28415%29%20568-5854>, skype: rxaviers > *http://rafael.xavier.blog.br* > _______________________________________________ > CLDR-Users mailing list > *CLDR-Users at unicode.org* > *http://unicode.org/mailman/listinfo/cldr-users* > > > > > _______________________________________________ > CLDR-Users mailing list > *CLDR-Users at unicode.org* > *http://unicode.org/mailman/listinfo/cldr-users* > > > > > > _______________________________________________ > CLDR-Users mailing list > CLDR-Users at unicode.org > http://unicode.org/mailman/listinfo/cldr-users > > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From rxaviers at gmail.com Fri Dec 12 12:48:08 2014 From: rxaviers at gmail.com (Rafael Xavier) Date: Fri, 12 Dec 2014 16:48:08 -0200 Subject: Bundle Lookup In-Reply-To: References: <548A47BE.5080900@gmail.com> Message-ID: Mark, Giving an arbitrary locale ID, the recommended and only process to deduce its respective bundle (reliably) is through Language Matching. Is that true? Considering all bundles are always present, isn't there any less expensive algorithm that could be recommended? Thank you. PS: My use case is a little different. I have *n* distributions of my application. On each distribution, it's embedded with a different locale. So, I don't need the full power of Language Matching on what's regard having an arbitrary list of desired locales vs an aribtrary list of available locales. Anyway, I do want my application to look up for the right bundle given a locale (e.g., `zh-Hans-TW` when given `zh-TW`). On Fri, Dec 12, 2014 at 2:50 PM, Mark Davis ?? wrote: > > I also want to be clear that there are two closely-related but very > different tasks. > > 1. *Inherited item lookup. *Given that you have a CLDR resource bundle, > with inheritance, where do I go to get inherited items? > > That is specified by CLDR by means of the parentLocale + truncation > algorithm, plus the alias element. (There are a few cases where we have > "Lateral Inheritance" where the specification is in the text of LDML, > such as when looking for an alt variant.) > > So back to Rafael's original question: > > 1. en-Latn-GB, and zh-TW are not CLDR bundles, so this doesn't apply > to them. > 2. en-US-u-nu-usd: the u-nu-usd doesn't select within a bundle, but > rather customizes a service that uses information in the bundle. The item > lookup (using by the currency formatting service) would be en-US => en > => root. > > > 2. *Bundle lookup. *Given a locale ID, where do I get the best matching > CLDR bundle? > > My application has a set of supported locales, and the user comes in with > a set of desired locales. What is the best bundle for that user? > > Here we are not as clear as we should be. The recommended process is in > http://www.unicode.org/reports/tr35/#LanguageMatching > > So back to Rafael's original question: > > 1. en-Latn-GB, and zh-TW. When these are looked up with Language > Matching, assuming that all the CLDR locales are available, they would > return, respectively, en-GB and zh-Hant-TW. > > That being said, often people don't understand language matching, and so > we are in the process of adding more information so that there is a direct > mapping from between locale IDs that are always considered to be > "identical" on a deep level, like en-GB and en-Latn-GB. > > > > Mark > > *? Il meglio ? l?inimico del bene ?* > > On Fri, Dec 12, 2014 at 5:04 PM, John Emmons wrote: > >> Yes, Edward, there is a very good reason we don't want zh-Hant to inherit >> from zh. Simply put, in situations where you have locale resources that >> aren't 100% populated, allowing zh-Hant to inherit from zh produces a >> mixture of simplified and traditional Chinese, which is acceptable to no >> one. This is what we call "cross script inheritance" in CLDR. While it >> might be acceptable to some in the case of Chinese, it is certainly a >> bigger problem in languages like Serbian, where you have both Latin and >> Cyrillic scripts in use, and you certainly don't ever want a mixture of >> Latin and Cyrillic scripts >> >> These relationships are documented in CLDR's supplemental data, where you >> have specified: >> >> >> >> >> Regards, >> >> John C. Emmons >> Globalization Architect & Unicode CLDR TC Chairman >> IBM Software Group >> Internet: emmo at us.ibm.com >> >> >> [image: Inactive hide details for Edwin Hoogerbeets ---12/11/2014 >> 07:41:26 PM---Rafael, also take a look at common/supplemental/likelyS]Edwin >> Hoogerbeets ---12/11/2014 07:41:26 PM---Rafael, also take a look at >> common/supplemental/likelySubtags.xml. If the caller has passed you an i >> >> From: Edwin Hoogerbeets >> To: John Emmons/Austin/IBM at IBMUS, Rafael Xavier >> Cc: J?rn Zaefferer , "cldr-users at unicode.org" >> >> Date: 12/11/2014 07:41 PM >> Subject: Re: Bundle Lookup >> ------------------------------ >> >> >> >> Rafael, also take a look at common/supplemental/likelySubtags.xml. If the >> caller has passed you an incompletely specified locale, you can use those >> mappings to see if you can get to a locale for which you do have a string >> bundle. I think that is the source for the "language aliases" to which John >> was referring. >> >> John, for the last part of your example zh-TW inheritance chain, wouldn't >> you just truncate "zh-Hant" again to "zh" like in the en-GB example before >> inheriting from the root? If not, what is the reasoning there? Is there >> already a document that specifies the inheritance rules in CLDR? >> >> For efficiency, I can imagine you would put the common translations in >> "zh" where there is no difference between traditional and simplified, and >> other translations in "zh-Hant" or "zh-Hans" where there is. That would >> save some disk space and you could leverage linguistic bug fixes at the >> "zh" level. For other locales like "sr-Latn" and "sr-Cyrl" there would be >> nothing in common so the string bundle at the "sr" level would be >> essentially empty, but it should still appear in the inheritance chain just >> in case. >> >> Edwin >> >> >> On 12/11/2014 02:53 PM, John Emmons wrote: >> >> >> #3 is currently a problem, which we are working on. Basically, >> "Latn" needs to be stripped out because it isn't necessary. Then follow >> the normal inheritance: >> >> en-GB: en-GB ? (parentLocale) en-001 ? (truncation) en ? root >> >> #4 - Any unicode locale extensions are meant to identify particular >> behaviors that are desired in the context of a given locale. Think of them >> like "options". They are not meant to be used in the context of bundle >> lookups. >> >> #5 - zh_TW - Now that proper language aliases are in place ( See >> *http://unicode.org/cldr/trac/ticket/5949* >> ) >> >> zh-TW: zh-TW ? (languageAlias) zh-Hant-TW ? (truncation) zh-Hant >> (parentLocale) ? root >> >> Regards, >> >> John C. Emmons >> Globalization Architect & Unicode CLDR TC Chairman >> IBM Software Group >> Internet: *emmo at us.ibm.com* >> >> >> [image: Inactive hide details for Rafael Xavier ---12/11/2014 >> 01:02:57 PM---Friends, This is a very basic question. See below. There ar]Rafael >> Xavier ---12/11/2014 01:02:57 PM---Friends, This is a very basic question. >> See below. There are lots of documentation >> >> From: Rafael Xavier ** >> To: *"cldr-users at unicode.org"* >> ** >> Cc: J?rn Zaefferer ** >> >> Date: 12/11/2014 01:02 PM >> Subject: Bundle Lookup >> Sent by: "CLDR-Users" ** >> >> >> ------------------------------ >> >> >> >> Friends, >> >> This is a very basic question. See below. There are lots of >> documentation about locale inheritance and matching. But, it fails in same >> cases to me. >> >> * Giving a locale, what's the procedure to find the **bundle** lookup >> chain?* >> >> 1. en-US: en-US ? (truncation) en ? root >> >> This one is dead simple. No problem. >> >> 2. en-GB: en-GB ? (parentLocale) en-001 ? (truncation) en ? root >> >> This one is also dead simple. Although, documentation says en-GB ? >> en. Is it outdated or am I doing something wrong? >> >> Anyway, the ones I'm interested in knowing are: >> >> 3. en-Latn-GB >> 4. en-US-u-nu-usd >> 5. zh-TW >> >> Please, could someone show me what's the chain of these locales (and >> obviously explain the steps)? >> >> Thanks! >> >> -- >> *+55 (16) 98138-1582* <%2B55%20%2816%29%2098138-1582>, *+1 (415) >> 568-5854* <%2B1%20%28415%29%20568-5854>, skype: rxaviers >> *http://rafael.xavier.blog.br* >> _______________________________________________ >> CLDR-Users mailing list >> *CLDR-Users at unicode.org* >> *http://unicode.org/mailman/listinfo/cldr-users* >> >> >> >> >> _______________________________________________ >> CLDR-Users mailing list >> *CLDR-Users at unicode.org* >> *http://unicode.org/mailman/listinfo/cldr-users* >> >> >> >> >> >> _______________________________________________ >> CLDR-Users mailing list >> CLDR-Users at unicode.org >> http://unicode.org/mailman/listinfo/cldr-users >> >> > -- +55 (16) 98138-1582, +1 (415) 568-5854, skype: rxaviers http://rafael.xavier.blog.br -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From mark at macchiato.com Fri Dec 12 14:27:09 2014 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Fri, 12 Dec 2014 21:27:09 +0100 Subject: Bundle Lookup In-Reply-To: References: <548A47BE.5080900@gmail.com> Message-ID: Mark *? Il meglio ? l?inimico del bene ?* On Fri, Dec 12, 2014 at 7:48 PM, Rafael Xavier wrote: > Mark, > > Giving an arbitrary locale ID, the recommended and only process to deduce > its respective bundle (reliably) is through Language Matching. > > Is that true? > ?As I said: " That being said, often people don't understand language matching, and so we are in the process of adding more information so that there is a direct mapping from between locale IDs that are always considered to be "identical" on a deep level, like en-GB and en-Latn-GB. ?"? ? > > Considering all bundles are always present, isn't there any less expensive > algorithm that could be recommended? > > Thank you. > > > PS: My use case is a little different. I have *n* distributions of my > application. On each distribution, it's embedded with a different locale. > So, I don't need the full power of Language Matching on what's regard > having an arbitrary list of desired locales vs an aribtrary list of > available locales. Anyway, I do want my application to look up for the > right bundle given a locale (e.g., `zh-Hans-TW` when given `zh-TW`). > > On Fri, Dec 12, 2014 at 2:50 PM, Mark Davis ?? wrote: >> >> I also want to be clear that there are two closely-related but very >> different tasks. >> >> 1. *Inherited item lookup. *Given that you have a CLDR resource bundle, >> with inheritance, where do I go to get inherited items? >> >> That is specified by CLDR by means of the parentLocale + truncation >> algorithm, plus the alias element. (There are a few cases where we have >> "Lateral Inheritance" where the specification is in the text of LDML, >> such as when looking for an alt variant.) >> >> So back to Rafael's original question: >> >> 1. en-Latn-GB, and zh-TW are not CLDR bundles, so this doesn't apply >> to them. >> 2. en-US-u-nu-usd: the u-nu-usd doesn't select within a bundle, but >> rather customizes a service that uses information in the bundle. The item >> lookup (using by the currency formatting service) would be en-US => >> en => root. >> >> >> 2. *Bundle lookup. *Given a locale ID, where do I get the best matching >> CLDR bundle? >> >> My application has a set of supported locales, and the user comes in with >> a set of desired locales. What is the best bundle for that user? >> >> Here we are not as clear as we should be. The recommended process is in >> http://www.unicode.org/reports/tr35/#LanguageMatching >> >> So back to Rafael's original question: >> >> 1. en-Latn-GB, and zh-TW. When these are looked up with Language >> Matching, assuming that all the CLDR locales are available, they would >> return, respectively, en-GB and zh-Hant-TW. >> >> That being said, often people don't understand language matching, and so >> we are in the process of adding more information so that there is a direct >> mapping from between locale IDs that are always considered to be >> "identical" on a deep level, like en-GB and en-Latn-GB. >> >> >> >> Mark >> >> *? Il meglio ? l?inimico del bene ?* >> >> On Fri, Dec 12, 2014 at 5:04 PM, John Emmons wrote: >> >>> Yes, Edward, there is a very good reason we don't want zh-Hant to >>> inherit from zh. Simply put, in situations where you have locale resources >>> that aren't 100% populated, allowing zh-Hant to inherit from zh produces a >>> mixture of simplified and traditional Chinese, which is acceptable to no >>> one. This is what we call "cross script inheritance" in CLDR. While it >>> might be acceptable to some in the case of Chinese, it is certainly a >>> bigger problem in languages like Serbian, where you have both Latin and >>> Cyrillic scripts in use, and you certainly don't ever want a mixture of >>> Latin and Cyrillic scripts >>> >>> These relationships are documented in CLDR's supplemental data, where >>> you have specified: >>> >>> >>> >>> >>> Regards, >>> >>> John C. Emmons >>> Globalization Architect & Unicode CLDR TC Chairman >>> IBM Software Group >>> Internet: emmo at us.ibm.com >>> >>> >>> [image: Inactive hide details for Edwin Hoogerbeets ---12/11/2014 >>> 07:41:26 PM---Rafael, also take a look at common/supplemental/likelyS]Edwin >>> Hoogerbeets ---12/11/2014 07:41:26 PM---Rafael, also take a look at >>> common/supplemental/likelySubtags.xml. If the caller has passed you an i >>> >>> From: Edwin Hoogerbeets >>> To: John Emmons/Austin/IBM at IBMUS, Rafael Xavier >>> Cc: J?rn Zaefferer , "cldr-users at unicode.org" >>> >>> Date: 12/11/2014 07:41 PM >>> Subject: Re: Bundle Lookup >>> ------------------------------ >>> >>> >>> >>> Rafael, also take a look at common/supplemental/likelySubtags.xml. If >>> the caller has passed you an incompletely specified locale, you can use >>> those mappings to see if you can get to a locale for which you do have a >>> string bundle. I think that is the source for the "language aliases" to >>> which John was referring. >>> >>> John, for the last part of your example zh-TW inheritance chain, >>> wouldn't you just truncate "zh-Hant" again to "zh" like in the en-GB >>> example before inheriting from the root? If not, what is the reasoning >>> there? Is there already a document that specifies the inheritance rules in >>> CLDR? >>> >>> For efficiency, I can imagine you would put the common translations in >>> "zh" where there is no difference between traditional and simplified, and >>> other translations in "zh-Hant" or "zh-Hans" where there is. That would >>> save some disk space and you could leverage linguistic bug fixes at the >>> "zh" level. For other locales like "sr-Latn" and "sr-Cyrl" there would be >>> nothing in common so the string bundle at the "sr" level would be >>> essentially empty, but it should still appear in the inheritance chain just >>> in case. >>> >>> Edwin >>> >>> >>> On 12/11/2014 02:53 PM, John Emmons wrote: >>> >>> >>> #3 is currently a problem, which we are working on. Basically, >>> "Latn" needs to be stripped out because it isn't necessary. Then follow >>> the normal inheritance: >>> >>> en-GB: en-GB ? (parentLocale) en-001 ? (truncation) en ? root >>> >>> #4 - Any unicode locale extensions are meant to identify particular >>> behaviors that are desired in the context of a given locale. Think of them >>> like "options". They are not meant to be used in the context of bundle >>> lookups. >>> >>> #5 - zh_TW - Now that proper language aliases are in place ( See >>> *http://unicode.org/cldr/trac/ticket/5949* >>> ) >>> >>> zh-TW: zh-TW ? (languageAlias) zh-Hant-TW ? (truncation) zh-Hant >>> (parentLocale) ? root >>> >>> Regards, >>> >>> John C. Emmons >>> Globalization Architect & Unicode CLDR TC Chairman >>> IBM Software Group >>> Internet: *emmo at us.ibm.com* >>> >>> >>> [image: Inactive hide details for Rafael Xavier ---12/11/2014 >>> 01:02:57 PM---Friends, This is a very basic question. See below. There ar]Rafael >>> Xavier ---12/11/2014 01:02:57 PM---Friends, This is a very basic question. >>> See below. There are lots of documentation >>> >>> From: Rafael Xavier ** >>> To: *"cldr-users at unicode.org"* >>> ** >>> Cc: J?rn Zaefferer ** >>> >>> Date: 12/11/2014 01:02 PM >>> Subject: Bundle Lookup >>> Sent by: "CLDR-Users" ** >>> >>> >>> ------------------------------ >>> >>> >>> >>> Friends, >>> >>> This is a very basic question. See below. There are lots of >>> documentation about locale inheritance and matching. But, it fails in same >>> cases to me. >>> >>> * Giving a locale, what's the procedure to find the **bundle** lookup >>> chain?* >>> >>> 1. en-US: en-US ? (truncation) en ? root >>> >>> This one is dead simple. No problem. >>> >>> 2. en-GB: en-GB ? (parentLocale) en-001 ? (truncation) en ? root >>> >>> This one is also dead simple. Although, documentation says en-GB ? >>> en. Is it outdated or am I doing something wrong? >>> >>> Anyway, the ones I'm interested in knowing are: >>> >>> 3. en-Latn-GB >>> 4. en-US-u-nu-usd >>> 5. zh-TW >>> >>> Please, could someone show me what's the chain of these locales (and >>> obviously explain the steps)? >>> >>> Thanks! >>> >>> -- >>> *+55 (16) 98138-1582* <%2B55%20%2816%29%2098138-1582>, *+1 (415) >>> 568-5854* <%2B1%20%28415%29%20568-5854>, skype: rxaviers >>> *http://rafael.xavier.blog.br* >>> _______________________________________________ >>> CLDR-Users mailing list >>> *CLDR-Users at unicode.org* >>> *http://unicode.org/mailman/listinfo/cldr-users* >>> >>> >>> >>> >>> _______________________________________________ >>> CLDR-Users mailing list >>> *CLDR-Users at unicode.org* >>> *http://unicode.org/mailman/listinfo/cldr-users* >>> >>> >>> >>> >>> >>> _______________________________________________ >>> CLDR-Users mailing list >>> CLDR-Users at unicode.org >>> http://unicode.org/mailman/listinfo/cldr-users >>> >>> >> > > -- > +55 (16) 98138-1582, +1 (415) 568-5854, skype: rxaviers > http://rafael.xavier.blog.br > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From rxaviers at gmail.com Fri Dec 12 15:31:50 2014 From: rxaviers at gmail.com (Rafael Xavier) Date: Fri, 12 Dec 2014 19:31:50 -0200 Subject: Bundle Lookup In-Reply-To: References: <548A47BE.5080900@gmail.com> Message-ID: Looking forward to hearing how that shall work. Thank you very much so far. On Fri, Dec 12, 2014 at 6:27 PM, Mark Davis ?? wrote: > > > > > Mark > > *? Il meglio ? l?inimico del bene ?* > > On Fri, Dec 12, 2014 at 7:48 PM, Rafael Xavier wrote: > >> Mark, >> >> Giving an arbitrary locale ID, the recommended and only process to deduce >> its respective bundle (reliably) is through Language Matching. >> >> Is that true? >> > > ?As I said: " > That being said, often people don't understand language matching, and so > we are in the process of adding more information so that there is a direct > mapping from between locale IDs that are always considered to be > "identical" on a deep level, like en-GB and en-Latn-GB. > ?"? > ? > > >> >> Considering all bundles are always present, isn't there any less >> expensive algorithm that could be recommended? >> >> Thank you. >> >> >> PS: My use case is a little different. I have *n* distributions of my >> application. On each distribution, it's embedded with a different locale. >> So, I don't need the full power of Language Matching on what's regard >> having an arbitrary list of desired locales vs an aribtrary list of >> available locales. Anyway, I do want my application to look up for the >> right bundle given a locale (e.g., `zh-Hans-TW` when given `zh-TW`). >> >> On Fri, Dec 12, 2014 at 2:50 PM, Mark Davis ?? >> wrote: >>> >>> I also want to be clear that there are two closely-related but very >>> different tasks. >>> >>> 1. *Inherited item lookup. *Given that you have a CLDR resource bundle, >>> with inheritance, where do I go to get inherited items? >>> >>> That is specified by CLDR by means of the parentLocale + truncation >>> algorithm, plus the alias element. (There are a few cases where we have >>> "Lateral Inheritance" where the specification is in the text of LDML, >>> such as when looking for an alt variant.) >>> >>> So back to Rafael's original question: >>> >>> 1. en-Latn-GB, and zh-TW are not CLDR bundles, so this doesn't apply >>> to them. >>> 2. en-US-u-nu-usd: the u-nu-usd doesn't select within a bundle, but >>> rather customizes a service that uses information in the bundle. The item >>> lookup (using by the currency formatting service) would be en-US => >>> en => root. >>> >>> >>> 2. *Bundle lookup. *Given a locale ID, where do I get the best matching >>> CLDR bundle? >>> >>> My application has a set of supported locales, and the user comes in >>> with a set of desired locales. What is the best bundle for that user? >>> >>> Here we are not as clear as we should be. The recommended process is in >>> http://www.unicode.org/reports/tr35/#LanguageMatching >>> >>> So back to Rafael's original question: >>> >>> 1. en-Latn-GB, and zh-TW. When these are looked up with Language >>> Matching, assuming that all the CLDR locales are available, they would >>> return, respectively, en-GB and zh-Hant-TW. >>> >>> That being said, often people don't understand language matching, and so >>> we are in the process of adding more information so that there is a direct >>> mapping from between locale IDs that are always considered to be >>> "identical" on a deep level, like en-GB and en-Latn-GB. >>> >>> >>> >>> Mark >>> >>> *? Il meglio ? l?inimico del bene ?* >>> >>> On Fri, Dec 12, 2014 at 5:04 PM, John Emmons wrote: >>> >>>> Yes, Edward, there is a very good reason we don't want zh-Hant to >>>> inherit from zh. Simply put, in situations where you have locale resources >>>> that aren't 100% populated, allowing zh-Hant to inherit from zh produces a >>>> mixture of simplified and traditional Chinese, which is acceptable to no >>>> one. This is what we call "cross script inheritance" in CLDR. While it >>>> might be acceptable to some in the case of Chinese, it is certainly a >>>> bigger problem in languages like Serbian, where you have both Latin and >>>> Cyrillic scripts in use, and you certainly don't ever want a mixture of >>>> Latin and Cyrillic scripts >>>> >>>> These relationships are documented in CLDR's supplemental data, where >>>> you have specified: >>>> >>>> >>>> >>>> >>>> Regards, >>>> >>>> John C. Emmons >>>> Globalization Architect & Unicode CLDR TC Chairman >>>> IBM Software Group >>>> Internet: emmo at us.ibm.com >>>> >>>> >>>> [image: Inactive hide details for Edwin Hoogerbeets ---12/11/2014 >>>> 07:41:26 PM---Rafael, also take a look at common/supplemental/likelyS]Edwin >>>> Hoogerbeets ---12/11/2014 07:41:26 PM---Rafael, also take a look at >>>> common/supplemental/likelySubtags.xml. If the caller has passed you an i >>>> >>>> From: Edwin Hoogerbeets >>>> To: John Emmons/Austin/IBM at IBMUS, Rafael Xavier >>>> Cc: J?rn Zaefferer , "cldr-users at unicode.org" >>>> >>>> Date: 12/11/2014 07:41 PM >>>> Subject: Re: Bundle Lookup >>>> ------------------------------ >>>> >>>> >>>> >>>> Rafael, also take a look at common/supplemental/likelySubtags.xml. If >>>> the caller has passed you an incompletely specified locale, you can use >>>> those mappings to see if you can get to a locale for which you do have a >>>> string bundle. I think that is the source for the "language aliases" to >>>> which John was referring. >>>> >>>> John, for the last part of your example zh-TW inheritance chain, >>>> wouldn't you just truncate "zh-Hant" again to "zh" like in the en-GB >>>> example before inheriting from the root? If not, what is the reasoning >>>> there? Is there already a document that specifies the inheritance rules in >>>> CLDR? >>>> >>>> For efficiency, I can imagine you would put the common translations in >>>> "zh" where there is no difference between traditional and simplified, and >>>> other translations in "zh-Hant" or "zh-Hans" where there is. That would >>>> save some disk space and you could leverage linguistic bug fixes at the >>>> "zh" level. For other locales like "sr-Latn" and "sr-Cyrl" there would be >>>> nothing in common so the string bundle at the "sr" level would be >>>> essentially empty, but it should still appear in the inheritance chain just >>>> in case. >>>> >>>> Edwin >>>> >>>> >>>> On 12/11/2014 02:53 PM, John Emmons wrote: >>>> >>>> >>>> #3 is currently a problem, which we are working on. Basically, >>>> "Latn" needs to be stripped out because it isn't necessary. Then follow >>>> the normal inheritance: >>>> >>>> en-GB: en-GB ? (parentLocale) en-001 ? (truncation) en ? root >>>> >>>> #4 - Any unicode locale extensions are meant to identify particular >>>> behaviors that are desired in the context of a given locale. Think of them >>>> like "options". They are not meant to be used in the context of bundle >>>> lookups. >>>> >>>> #5 - zh_TW - Now that proper language aliases are in place ( See >>>> *http://unicode.org/cldr/trac/ticket/5949* >>>> ) >>>> >>>> zh-TW: zh-TW ? (languageAlias) zh-Hant-TW ? (truncation) zh-Hant >>>> (parentLocale) ? root >>>> >>>> Regards, >>>> >>>> John C. Emmons >>>> Globalization Architect & Unicode CLDR TC Chairman >>>> IBM Software Group >>>> Internet: *emmo at us.ibm.com* >>>> >>>> >>>> [image: Inactive hide details for Rafael Xavier ---12/11/2014 >>>> 01:02:57 PM---Friends, This is a very basic question. See below. There ar]Rafael >>>> Xavier ---12/11/2014 01:02:57 PM---Friends, This is a very basic question. >>>> See below. There are lots of documentation >>>> >>>> From: Rafael Xavier ** >>>> To: *"cldr-users at unicode.org"* >>>> ** >>>> Cc: J?rn Zaefferer ** >>>> >>>> Date: 12/11/2014 01:02 PM >>>> Subject: Bundle Lookup >>>> Sent by: "CLDR-Users" ** >>>> >>>> >>>> ------------------------------ >>>> >>>> >>>> >>>> Friends, >>>> >>>> This is a very basic question. See below. There are lots of >>>> documentation about locale inheritance and matching. But, it fails in same >>>> cases to me. >>>> >>>> * Giving a locale, what's the procedure to find the **bundle** lookup >>>> chain?* >>>> >>>> 1. en-US: en-US ? (truncation) en ? root >>>> >>>> This one is dead simple. No problem. >>>> >>>> 2. en-GB: en-GB ? (parentLocale) en-001 ? (truncation) en ? root >>>> >>>> This one is also dead simple. Although, documentation says en-GB ? >>>> en. Is it outdated or am I doing something wrong? >>>> >>>> Anyway, the ones I'm interested in knowing are: >>>> >>>> 3. en-Latn-GB >>>> 4. en-US-u-nu-usd >>>> 5. zh-TW >>>> >>>> Please, could someone show me what's the chain of these locales >>>> (and obviously explain the steps)? >>>> >>>> Thanks! >>>> >>>> -- >>>> *+55 (16) 98138-1582* <%2B55%20%2816%29%2098138-1582>, *+1 (415) >>>> 568-5854* <%2B1%20%28415%29%20568-5854>, skype: rxaviers >>>> *http://rafael.xavier.blog.br* >>>> _______________________________________________ >>>> CLDR-Users mailing list >>>> *CLDR-Users at unicode.org* >>>> *http://unicode.org/mailman/listinfo/cldr-users* >>>> >>>> >>>> >>>> >>>> _______________________________________________ >>>> CLDR-Users mailing list >>>> *CLDR-Users at unicode.org* >>>> *http://unicode.org/mailman/listinfo/cldr-users* >>>> >>>> >>>> >>>> >>>> >>>> _______________________________________________ >>>> CLDR-Users mailing list >>>> CLDR-Users at unicode.org >>>> http://unicode.org/mailman/listinfo/cldr-users >>>> >>>> >>> >> >> -- >> +55 (16) 98138-1582, +1 (415) 568-5854, skype: rxaviers >> http://rafael.xavier.blog.br >> > > -- +55 (16) 98138-1582, +1 (415) 568-5854, skype: rxaviers http://rafael.xavier.blog.br -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From doug at ewellic.org Fri Dec 12 16:32:28 2014 From: doug at ewellic.org (Doug Ewell) Date: Fri, 12 Dec 2014 15:32:28 -0700 Subject: Bundle Lookup Message-ID: <20141212153228.665a7a7059d7ee80bb4d670165c8327d.c7f4fbe0f6.wbe@email03.secureserver.net> Rafael Xavier wrote: > Giving an arbitrary locale ID, the recommended and only process to > deduce its respective bundle (reliably) is through Language Matching. > > Is that true? > > Considering all bundles are always present, isn't there any less > expensive algorithm that could be recommended? If your locale IDs really are arbitrary, then the simpler your matching algorithm is, the less reliable your results will be. Language tag matching is just not as complicated or expensive as it is made out to be, especially when compared to things like cryptography and image processing that people deploy all the time. Lookup is simple enough to be described in full in two pages of RFC 4647. -- Doug Ewell | Thornton, CO, USA | http://ewellic.org From mark at macchiato.com Sat Dec 13 03:10:03 2014 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Sat, 13 Dec 2014 10:10:03 +0100 Subject: Bundle Lookup In-Reply-To: <20141212153228.665a7a7059d7ee80bb4d670165c8327d.c7f4fbe0f6.wbe@email03.secureserver.net> References: <20141212153228.665a7a7059d7ee80bb4d670165c8327d.c7f4fbe0f6.wbe@email03.secureserver.net> Message-ID: On Fri, Dec 12, 2014 at 11:32 PM, Doug Ewell wrote: > Lookup is simple > enough to be described in full in two pages of RFC 4647. > ?That lookup algorithm?as far as I'm concerned?has always been just one example. It is not sophisticated enough to give particularly good results, and should be avoided in any production system. As far as the speed goes, there are definitely optimizations that can be put into place to shortcut the full matching process in CLDR. ?+icu-support at lists.sourceforge.net? Mark *? Il meglio ? l?inimico del bene ?* -------------- next part -------------- An HTML attachment was scrubbed... URL: From patch at cpan.org Tue Dec 23 14:18:33 2014 From: patch at cpan.org (Nick Patch) Date: Tue, 23 Dec 2014 15:18:33 -0500 Subject: CLDR TL;DR article Message-ID: I wrote a short article on programming with the CLDR, which was published in the Perl Advent Calendar today. http://perladvent.org/2014/2014-12-23.html -- Nick Patch @nickpatch -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Wed Dec 24 05:55:59 2014 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Wed, 24 Dec 2014 12:55:59 +0100 Subject: CLDR TL;DR article In-Reply-To: References: Message-ID: That article about the Locale::CLDR gives an example of bad usage with: - fr: ?foo?, ?bar? et ?baz? In this case the quotations marks are not enough in French, there MUST also be some non-breaking whitespace (preferably the thin non-breaking space) after the opening quotation mark, and before the closing mark. Unfortunately the CLDR data only accepts 1 character for these marks when we should expect to find also the THINSP character The THINSP is used before the exclamation mark, the question mark, the colon and the semi-colon (i.e. all punctuation signs made with more than 1 connected glyph). That thin space should also be present beside all dashes not connecting two words; unlike English that prefers no whitespace at all with em dash), and it is also the standard whitespace used as the separator for grouping digits (numeral quantities, phone numbers...). And that library does not correct that... (Note that on systems that cannot accept THINSP for French, the fallback can be NBSP, or a standard SPACE, but NEVER the absence of whitespace like in English). Have a nice Christmas in family. See you back in two days. 2014-12-23 21:18 GMT+01:00 Nick Patch : > I wrote a short article on programming with the CLDR, which was published > in the Perl Advent Calendar today. > > http://perladvent.org/2014/2014-12-23.html > > -- > Nick Patch > @nickpatch > > _______________________________________________ > CLDR-Users mailing list > CLDR-Users at unicode.org > http://unicode.org/mailman/listinfo/cldr-users > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jkorpela at cs.tut.fi Wed Dec 24 06:49:21 2014 From: jkorpela at cs.tut.fi (Jukka K. Korpela) Date: Wed, 24 Dec 2014 14:49:21 +0200 Subject: CLDR TL;DR article In-Reply-To: References: Message-ID: <549AB651.1090908@cs.tut.fi> 2014-12-24, 13:55, Philippe Verdy wrote, commenting on an announcement of http://perladvent.org/2014/2014-12-23.html : > That article about the Locale::CLDR gives an example of bad usage with: > > * > > fr: ?foo?, ?bar? et ?baz? I agree, but I think it?s a more serious mistake to have ur: ?foo?? ?bar?? ??? ?baz? As far as I know, Urdu is written right to left, so the order of words is all wrong. > In this case the quotations marks are not enough in French, there MUST > also be some non-breaking whitespace (preferably the thin non-breaking > space) after the opening quotation mark, and before the closing mark. This is a longstanding issue with no clear solution so far. In plain text, you can choose between SPACE, NO-BREAK SPACE, one of the ?fixed-width? spaces like THIN SPACE, and the NARROW NO-BREAK SPACE. The ?fixed-width? spaces (which largely aren?t fixed-width in reality) are by definition compatibility equivalent to SPACE, with its line breaking behavior. The NARROW NO-BREAK SPACE would seem ideal, but it has really been designed for a different purposes and there is no reason to expect that its width corresponds to that of espace fine ins?cable in French typography; moreover, its availability in fonts is limited, and it may still cause a symbol of undisplayable character to appear?surely worse than a space of any width, or no space. In rich text, there are many things you can do to control the width and the line breaking behavior. > Unfortunately the CLDR data only accepts 1 character for these marks > when we should expect to find also the THINSP character Is it so? In any case, a more fundamental problem is what string you would put there. It should indicate spacing, but considerably less than a normal space, and it should be non-breakable. I would make non-breakability the main concern, and between no spacing and a full space, I?d prefer no spacing. But? The pages of l?Acad?mie fran?aise use SPACE, so I guess it cannot be an all wrong approach, even though it looks rather strange to me > (Note that on systems that cannot accept THINSP for French, the fallback > can be NBSP, or a standard SPACE, but NEVER the absence of whitespace > like in English). Is that a rule that has officially been declared somewhere? When reading, say, ? An deux mil ? ou ? an deux mille ? ? on the Academy pages, I find it confusing that it looks like ?ou? were a quoted string, and I would really prefer ?An deux mil? ou ?an deux mille? ? for reasons of clarity and typography. A punctuation mark isolated by full spaces looks so lonely, though at the end of sentence, it might be acceptable. On the other hand, on a page that summarizes CLDR principles, I think the example should reflect what CLDR actually has, rather than what it should have. Although a note could be made about spacing issues in French, I think the only mistake in this area on the page is the wrong writing direction for Urdu?it might even be construed as claiming that CLDR suggests or requires such directionality! Yucca From verdy_p at wanadoo.fr Wed Dec 24 07:35:02 2014 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Wed, 24 Dec 2014 14:35:02 +0100 Subject: CLDR TL;DR article In-Reply-To: <549AB651.1090908@cs.tut.fi> References: <549AB651.1090908@cs.tut.fi> Message-ID: 2014-12-24 13:49 GMT+01:00 Jukka K. Korpela : > 2014-12-24, 13:55, Philippe Verdy wrote, commenting on an announcement of > http://perladvent.org/2014/2014-12-23.html : > > That article about the Locale::CLDR gives an example of bad usage with: >> >> * >> >> fr: ?foo?, ?bar? et ?baz? >> > > I agree, but I think it?s a more serious mistake to have > > ur: ?foo?? ?bar?? ??? ?baz? > > This is a longstanding issue with no clear solution so far. In plain text, > you can choose between SPACE, NO-BREAK SPACE, one of the ?fixed-width? > spaces like THIN SPACE, and the NARROW NO-BREAK SPACE. The ?fixed-width? > spaces (which largely aren?t fixed-width in reality) are by definition > compatibility equivalent to SPACE, with its line breaking behavior. The > NARROW NO-BREAK SPACE would seem ideal, but it has really been designed for > a different purposes and there is no reason to expect that its width > corresponds to that of espace fine ins?cable in French typography; > moreover, its availability in fonts is limited, and it may still cause a > symbol of undisplayable character to appear?surely worse than a space of > any width, or no space. > Definitely THINSP is THE most correct space to use ; other spaces (NBSP, and standard SPACE) are just some best-fit fallbacks meant to be used where Unicode is not usable, but the absence of any space is definitely wrong. (no longer an issue in HTML, but may remain an issue only when converting from Unicode to legacy 8-bit codepages where only NBSP is present in ISO8859-1 or Windows-1252 or CP850, the 3 most used legacy charsets used in French) THINSP is rendered correctly now by all current versions of OpenType renderers. The world now speaks Unicode in all new applications all over the web, and even on databases; fallbacks may only be needed for some console applications using those legacy charsets (but console drivers are handling these fallbacks, or should do it. So the only thing I personnally don't like in CLDR is that it restricts those punctuations to only one Unicode character (this restriction is quite.. ahemmm.... stupid). If you want to preserve compatiility, this is ONLY for applications that do not "speak" Unicode at all (not even UTF-8 which already generates multibyte sequences not fitting in one of their 8-bit "characters" and in that case those single-character punctuations for egacy apps should be restricted to only ASCII and you'll have problems with most Asian languages or with Armenian, Arabic, Persan, Urdu...: they will need "multibyte" strings in their legacy apps for their punctuations, and will be forced to use either UTF-8 for non-ASCIII characters, or various non-portable legacy 8-bit charsets). Pesonally I thnk its up to the adapters processing the CLDR data to provide these fallbacks (and for French the fallback from THINSP to an empty string is always wrong (even if this may be a good choice for English, whose typographic thin space was traditionnally narrower at 1/8th em; instead of 1/6th to 1/4th em in French typography). -------------- next part -------------- An HTML attachment was scrubbed... URL: From emuller at adobe.com Wed Dec 24 09:52:28 2014 From: emuller at adobe.com (Eric Muller) Date: Wed, 24 Dec 2014 07:52:28 -0800 Subject: CLDR TL;DR article In-Reply-To: <549AB651.1090908@cs.tut.fi> References: <549AB651.1090908@cs.tut.fi> Message-ID: <549AE13C.8000903@adobe.com> On 12/24/2014 4:49 AM, Jukka K. Korpela wrote: > > Is that a rule that has officially been declared somewhere? Fortunately, there is no such thing as "official" typographic rules. Two sources of recommendations: The "Lexique des r?gles tyographiques en usage ? l'imprimerie nationale" adopts an "espace fine ins?cable" before ; ! and ? but an "espace mot ins?cable" before : and ? and after ?. It puts an "espace justifiante" around ?. "Le Correcteur Typographe" by L.-E. Brossard has a whole chapter on the subject (starting at page 309 of volume 2, ), with the most relevant discussion starting on page 317. It recommends 2/3 of the word space after ? and before ?. It also has a bit more details, such as a fixed space between ? and ?, a "demi-cadratin" after a ? that is repeated on start of lines, a "demi-cadratin" after the ? in dialogs, as well as in the form '? ?' of dialogs. In terms of Unicode in plain text, I agree with Philippe that some kind of space character is necessary in all those places where a typographic space is to appear. In my reprints of public domain texts (http://efele.net/ebooks), after having dealt with more than 100K pages of 19th century typography, I settled on using U+0020 ? ? SPACE in the source texts, with a processing at publishing time to replace those by U+202F ??? NARROW NO-BREAK SPACE when appropriate. This processing is conservative and on occasion I use a U+202F ??? NARROW NO-BREAK SPACE in the source text. This approach is viable only because I control both the source texts and the processing; in some sense I have a "private" meaning of U+0020 ? ? SPACE, which I justify (the meaning, not the space) by the immense simplification of the editing of the source texts it provides. Eric. From patrick.andries at xcential.com Wed Dec 24 10:58:56 2014 From: patrick.andries at xcential.com (Patrick Andries) Date: Wed, 24 Dec 2014 11:58:56 -0500 Subject: CLDR TL;DR article In-Reply-To: <549AE13C.8000903@adobe.com> References: <549AB651.1090908@cs.tut.fi> <549AE13C.8000903@adobe.com> Message-ID: <549AF0D0.6010902@xcential.com> Le 24/d?c./2014 10 h 52, Eric Muller a ?crit : > > In my reprints of public domain texts (http://efele.net/ebooks), > after having dealt with more than 100K pages of 19th century typography. F?licitations, ?ric, pour cette belle initiative... 100.000 pages ! P. A. --- Ce courrier ?lectronique ne contient aucun virus ou logiciel malveillant parce que la protection avast! Antivirus est active. http://www.avast.com From srl at icu-project.org Wed Dec 24 12:15:47 2014 From: srl at icu-project.org (Steven R. Loomis) Date: Wed, 24 Dec 2014 10:15:47 -0800 Subject: CLDR TL;DR article In-Reply-To: References: Message-ID: Philippe, I don't read in http://www.unicode.org/reports/tr35/tr35-general.html#ListPatterns where only one character is allowed, is it a survey tool or test limitation? Is there a bug filed if so? S Enviado desde nuestro iPhone. > El dic 24, 2014, a las 3:55 AM, Philippe Verdy escribi?: > > That article about the Locale::CLDR gives an example of bad usage with: > > fr: ?foo?, ?bar? et ?baz? > In this case the quotations marks are not enough in French, there MUST also be some non-breaking whitespace (preferably the thin non-breaking space) after the opening quotation mark, and before the closing mark. > > Unfortunately the CLDR data only accepts 1 character for these marks when we should expect to find also the THINSP character > > The THINSP is used before the exclamation mark, the question mark, the colon and the semi-colon (i.e. all punctuation signs made with more than 1 connected glyph). That thin space should also be present beside all dashes not connecting two words; unlike English that prefers no whitespace at all with em dash), and it is also the standard whitespace used as the separator for grouping digits (numeral quantities, phone numbers...). > > And that library does not correct that... > > (Note that on systems that cannot accept THINSP for French, the fallback can be NBSP, or a standard SPACE, but NEVER the absence of whitespace like in English). > > Have a nice Christmas in family. See you back in two days. > > 2014-12-23 21:18 GMT+01:00 Nick Patch : >> I wrote a short article on programming with the CLDR, which was published in the Perl Advent Calendar today. >> >> http://perladvent.org/2014/2014-12-23.html >> >> -- >> Nick Patch >> @nickpatch >> >> _______________________________________________ >> CLDR-Users mailing list >> CLDR-Users at unicode.org >> http://unicode.org/mailman/listinfo/cldr-users > > _______________________________________________ > CLDR-Users mailing list > CLDR-Users at unicode.org > http://unicode.org/mailman/listinfo/cldr-users -------------- next part -------------- An HTML attachment was scrubbed... URL: From cameron at lumoslabs.com Wed Dec 24 13:02:41 2014 From: cameron at lumoslabs.com (Cameron Dutro) Date: Wed, 24 Dec 2014 11:02:41 -0800 Subject: CLDR TL;DR article In-Reply-To: References: Message-ID: Nice article Nick, thanks for sending it out! Happy to see CLDR Perl development continuing forward :) -Cameron On Wed, Dec 24, 2014 at 10:15 AM, Steven R. Loomis wrote: > Philippe, > I don't read in > http://www.unicode.org/reports/tr35/tr35-general.html#ListPatterns where > only one character is allowed, is it a survey tool or test limitation? Is > there a bug filed if so? > > S > > Enviado desde nuestro iPhone. > > El dic 24, 2014, a las 3:55 AM, Philippe Verdy > escribi?: > > That article about the Locale::CLDR gives an example of bad usage with: > > > - > > fr: ?foo?, ?bar? et ?baz? > > In this case the quotations marks are not enough in French, there MUST > also be some non-breaking whitespace (preferably the thin non-breaking > space) after the opening quotation mark, and before the closing mark. > > Unfortunately the CLDR data only accepts 1 character for these marks when > we should expect to find also the THINSP character > > The THINSP is used before the exclamation mark, the question mark, the > colon and the semi-colon (i.e. all punctuation signs made with more than 1 > connected glyph). That thin space should also be present beside all dashes > not connecting two words; unlike English that prefers no whitespace at all > with em dash), and it is also the standard whitespace used as the separator > for grouping digits (numeral quantities, phone numbers...). > > And that library does not correct that... > > (Note that on systems that cannot accept THINSP for French, the fallback > can be NBSP, or a standard SPACE, but NEVER the absence of whitespace > like in English). > > Have a nice Christmas in family. See you back in two days. > > 2014-12-23 21:18 GMT+01:00 Nick Patch : > >> I wrote a short article on programming with the CLDR, which was published >> in the Perl Advent Calendar today. >> >> http://perladvent.org/2014/2014-12-23.html >> >> -- >> Nick Patch >> @nickpatch >> >> _______________________________________________ >> CLDR-Users mailing list >> CLDR-Users at unicode.org >> http://unicode.org/mailman/listinfo/cldr-users >> >> > _______________________________________________ > CLDR-Users mailing list > CLDR-Users at unicode.org > http://unicode.org/mailman/listinfo/cldr-users > > > _______________________________________________ > CLDR-Users mailing list > CLDR-Users at unicode.org > http://unicode.org/mailman/listinfo/cldr-users > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mark at macchiato.com Wed Dec 24 14:19:01 2014 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Wed, 24 Dec 2014 21:19:01 +0100 Subject: CLDR TL;DR article In-Reply-To: References: Message-ID: Nice article! {phone} On Dec 23, 2014 10:02 PM, "Nick Patch" wrote: > I wrote a short article on programming with the CLDR, which was published > in the Perl Advent Calendar today. > > http://perladvent.org/2014/2014-12-23.html > > -- > Nick Patch > @nickpatch > > _______________________________________________ > CLDR-Users mailing list > CLDR-Users at unicode.org > http://unicode.org/mailman/listinfo/cldr-users > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From cameron at lumoslabs.com Tue Dec 30 13:21:51 2014 From: cameron at lumoslabs.com (Cameron Dutro) Date: Tue, 30 Dec 2014 11:21:51 -0800 Subject: Unicode Regex Question Message-ID: Hey cldr-users, I'm looking at this entry in CLDR transforms. I'm curious why that "$" character is inside the character class. Here's the line reproduced: $makeRight = [[:Z:][:Ps:][:Pi:]$] ; I see an outer character class that contains three internal unicode character sets and a literal dollar sign. Usually in regular expressions, the dollar sign is used to match the end of the string. When it's included in a character class however, it should be interpreted as a literal character. Was including the dollar sign in the character class intentional? Should it be treated as an end-of-string anchor or a literal string? -Cameron -------------- next part -------------- An HTML attachment was scrubbed... URL: From mark at macchiato.com Tue Dec 30 13:40:36 2014 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Tue, 30 Dec 2014 20:40:36 +0100 Subject: Unicode Regex Question In-Reply-To: References: Message-ID: $ has a special meaning in the transforms; it means the end of string (either end). Unlike normal regex, however, it can occur in character classes, eg [[a$b][:script=greek:]] Mark *? Il meglio ? l?inimico del bene ?* On Tue, Dec 30, 2014 at 8:21 PM, Cameron Dutro wrote: > Hey cldr-users, > > I'm looking at this entry > > in CLDR transforms. I'm curious why that "$" character is inside the > character class. Here's the line reproduced: > > $makeRight = [[:Z:][:Ps:][:Pi:]$] ; > > I see an outer character class that contains three internal unicode > character sets and a literal dollar sign. Usually in regular expressions, > the dollar sign is used to match the end of the string. When it's included > in a character class however, it should be interpreted as a literal > character. > > Was including the dollar sign in the character class intentional? Should > it be treated as an end-of-string anchor or a literal string? > > -Cameron > > _______________________________________________ > CLDR-Users mailing list > CLDR-Users at unicode.org > http://unicode.org/mailman/listinfo/cldr-users > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From cameron at lumoslabs.com Tue Dec 30 17:22:12 2014 From: cameron at lumoslabs.com (Cameron Dutro) Date: Tue, 30 Dec 2014 15:22:12 -0800 Subject: Unicode Regex Question In-Reply-To: References: Message-ID: Thanks Mark. Is that documented anywhere? -Cameron On Tue, Dec 30, 2014 at 11:40 AM, Mark Davis ?? wrote: > $ has a special meaning in the transforms; it means the end of string > (either end). Unlike normal regex, however, it can occur in character > classes, eg [[a$b][:script=greek:]] > > > Mark > > *? Il meglio ? l?inimico del bene ?* > > On Tue, Dec 30, 2014 at 8:21 PM, Cameron Dutro > wrote: > >> Hey cldr-users, >> >> I'm looking at this entry >> >> in CLDR transforms. I'm curious why that "$" character is inside the >> character class. Here's the line reproduced: >> >> $makeRight = [[:Z:][:Ps:][:Pi:]$] ; >> >> I see an outer character class that contains three internal unicode >> character sets and a literal dollar sign. Usually in regular expressions, >> the dollar sign is used to match the end of the string. When it's included >> in a character class however, it should be interpreted as a literal >> character. >> >> Was including the dollar sign in the character class intentional? Should >> it be treated as an end-of-string anchor or a literal string? >> >> -Cameron >> >> _______________________________________________ >> CLDR-Users mailing list >> CLDR-Users at unicode.org >> http://unicode.org/mailman/listinfo/cldr-users >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From cameron at lumoslabs.com Tue Dec 30 17:26:00 2014 From: cameron at lumoslabs.com (Cameron Dutro) Date: Tue, 30 Dec 2014 15:26:00 -0800 Subject: Unicode Regex Question In-Reply-To: References: Message-ID: Also, would it be fair to say simply removing the outer set of square brackets and treating the entire thing as a regex is correct? It doesn't make sense to me to have these transform rules be "almost" regexes except for this one "$" exception, especially given "$"'s special significance in regexes. -Cameron On Tue, Dec 30, 2014 at 3:22 PM, Cameron Dutro wrote: > Thanks Mark. Is that documented anywhere? > > -Cameron > > On Tue, Dec 30, 2014 at 11:40 AM, Mark Davis [image: ?]? < > mark at macchiato.com> wrote: > >> $ has a special meaning in the transforms; it means the end of string >> (either end). Unlike normal regex, however, it can occur in character >> classes, eg [[a$b][:script=greek:]] >> >> >> Mark >> >> *? Il meglio ? l?inimico del bene ?* >> >> On Tue, Dec 30, 2014 at 8:21 PM, Cameron Dutro >> wrote: >> >>> Hey cldr-users, >>> >>> I'm looking at this entry >>> >>> in CLDR transforms. I'm curious why that "$" character is inside the >>> character class. Here's the line reproduced: >>> >>> $makeRight = [[:Z:][:Ps:][:Pi:]$] ; >>> >>> I see an outer character class that contains three internal unicode >>> character sets and a literal dollar sign. Usually in regular expressions, >>> the dollar sign is used to match the end of the string. When it's included >>> in a character class however, it should be interpreted as a literal >>> character. >>> >>> Was including the dollar sign in the character class intentional? Should >>> it be treated as an end-of-string anchor or a literal string? >>> >>> -Cameron >>> >>> _______________________________________________ >>> CLDR-Users mailing list >>> CLDR-Users at unicode.org >>> http://unicode.org/mailman/listinfo/cldr-users >>> >>> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: emoji_u2615.png Type: image/png Size: 1890 bytes Desc: not available URL: From verdy_p at wanadoo.fr Tue Dec 30 18:40:28 2014 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Wed, 31 Dec 2014 01:40:28 +0100 Subject: Unicode Regex Question In-Reply-To: References: Message-ID: I do agree. The $ is just a common shortcut that represents a condition which could be also be given more explicitly, with a named pseudo-class. Your example with "[[a$b][:script=greek:]]" does not make any sense if that $ means an "end of string" and where it is embedded in a character class itself in another embedding character-class. I would better expect something like "[[ab][:script=greek:]]|$" or " [[ab][:script=greek:][:eos:]]" where [:eos:] matches no character but only at end of string (also at start of string ???, for that it is mixing two different kinds of matching: a precontext instead of a post-context) Regexps should also have a better and more explicit notation for precontexts and postcontexts (including with non-empty matched contents). And the being/end of strings/texts is not the only boundary needed in those pre/post-contexts, there are other interesting ones, notably start/end of words (depending on locale-sensitive definitions of word boundaries), or even start/end of sentences (here also locale dependant). Possibly we could also have regexps defining their own custom boundary conditions, assigned locally with a symbolic name (internally the regexp engine would parse all those conditions in parallel to the main parsing, to define when they raise up their condition flag to true. E.g. [:^hex=[0-9a-f]+] where internally the defined "hex" boundary is a normal regexp matched greedily in both directions, backward and forward). then we could reuse that condition in several other places of the regexp with "[:^hex]" (if they are reused, they are still matched separately on distinct positions and do not have theur matched content equal in each instance). With a similar system we could also define named subregexps such as [:$hex=[0-9a-f]+] defining the "[:$hex]" subregexp. The previous custom boundary could also be defined as [:^hex=[:$hex]]. Here we see that this "$" would be more useful and would not imply any"end of string" meaning (but it would deviate from legacy regexps where this $ is taken litterally in character classes). In that last example the custom boundary and the defined subregexp are given the same "hex" name separately. But we could also say that they automatically share the same namespace, so that any defined subregexp would also be the name of a defined boundary, to use them the first character $ or ^ after "[:" is used to see if we mean an expansion of a subregexp whose matched content will be part of the outer matched content, or if we mean a condition matched internally ony with a testable flag but not included in the outer matched content. Then the start/end of string condition is nothing else than the evaluation of the custom boundary condition "[:^eos]", definabled as "[:^eos=.*]" and matched greedly by default; also here I'm assuming that "." matches any character, including newlines if they are part of the content of a "string", otherwise you'll need to define the "eos" custom boundary as "[:^eos=([\r\n]|.)*]" 2014-12-31 0:26 GMT+01:00 Cameron Dutro : > Also, would it be fair to say simply removing the outer set of square > brackets and treating the entire thing as a regex is correct? It doesn't > make sense to me to have these transform rules be "almost" regexes except > for this one "$" exception, especially given "$"'s special significance in > regexes. > > -Cameron > > On Tue, Dec 30, 2014 at 3:22 PM, Cameron Dutro > wrote: > >> Thanks Mark. Is that documented anywhere? >> >> -Cameron >> >> On Tue, Dec 30, 2014 at 11:40 AM, Mark Davis [image: ?]? < >> mark at macchiato.com> wrote: >> >>> $ has a special meaning in the transforms; it means the end of string >>> (either end). Unlike normal regex, however, it can occur in character >>> classes, eg [[a$b][:script=greek:]] >>> >>> >>> Mark >>> >>> *? Il meglio ? l?inimico del bene ?* >>> >>> On Tue, Dec 30, 2014 at 8:21 PM, Cameron Dutro >>> wrote: >>> >>>> Hey cldr-users, >>>> >>>> I'm looking at this entry >>>> >>>> in CLDR transforms. I'm curious why that "$" character is inside the >>>> character class. Here's the line reproduced: >>>> >>>> $makeRight = [[:Z:][:Ps:][:Pi:]$] ; >>>> >>>> I see an outer character class that contains three internal unicode >>>> character sets and a literal dollar sign. Usually in regular expressions, >>>> the dollar sign is used to match the end of the string. When it's included >>>> in a character class however, it should be interpreted as a literal >>>> character. >>>> >>>> Was including the dollar sign in the character class intentional? >>>> Should it be treated as an end-of-string anchor or a literal string? >>>> >>>> -Cameron >>>> >>>> _______________________________________________ >>>> CLDR-Users mailing list >>>> CLDR-Users at unicode.org >>>> http://unicode.org/mailman/listinfo/cldr-users >>>> >>>> >>> >> > > _______________________________________________ > CLDR-Users mailing list > CLDR-Users at unicode.org > http://unicode.org/mailman/listinfo/cldr-users > > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: emoji_u2615.png Type: image/png Size: 1890 bytes Desc: not available URL: From cameron at lumoslabs.com Tue Dec 30 20:35:44 2014 From: cameron at lumoslabs.com (Cameron Dutro) Date: Tue, 30 Dec 2014 18:35:44 -0800 Subject: Unicode Regex Question In-Reply-To: References: Message-ID: Thanks Philippe, the [:eos:] pseudo class looks much less ambiguous than the "$" character, thanks for your thorough writeup. What would be process be for getting your change reviewed/accepted? -Cameron On Tue, Dec 30, 2014 at 4:40 PM, Philippe Verdy wrote: > I do agree. The $ is just a common shortcut that represents a condition > which could be also be given more explicitly, with a named pseudo-class. > > Your example with "[[a$b][:script=greek:]]" does not make any sense if > that $ means an "end of string" and where it is embedded in a character > class itself in another embedding character-class. > > I would better expect something like "[[ab][:script=greek:]]|$" or " > [[ab][:script=greek:][:eos:]]" where [:eos:] matches no character but > only at end of string (also at start of string ???, for that it is mixing > two different kinds of matching: a precontext instead of a post-context) > > Regexps should also have a better and more explicit notation for > precontexts and postcontexts (including with non-empty matched contents). > > And the being/end of strings/texts is not the only boundary needed in > those pre/post-contexts, there are other interesting ones, notably > start/end of words (depending on locale-sensitive definitions of word > boundaries), or even start/end of sentences (here also locale dependant). > > Possibly we could also have regexps defining their own custom boundary > conditions, assigned locally with a symbolic name (internally the regexp > engine would parse all those conditions in parallel to the main parsing, to > define when they raise up their condition flag to true. > > E.g. [:^hex=[0-9a-f]+] where internally the defined "hex" boundary is a > normal regexp matched greedily in both directions, backward and forward). > then we could reuse that condition in several other places of the regexp > with "[:^hex]" > (if they are reused, they are still matched separately on distinct > positions and do not have theur matched content equal in each instance). > > With a similar system we could also define named subregexps such as > [:$hex=[0-9a-f]+] defining the "[:$hex]" subregexp. > > The previous custom boundary could also be defined as [:^hex=[:$hex]]. > Here we see that this "$" would be more useful and would not imply any"end > of string" meaning (but it would deviate from legacy regexps where this $ > is taken litterally in character classes). In that last example the custom > boundary and the defined subregexp are given the same "hex" name separately. > > But we could also say that they automatically share the same namespace, so > that any defined subregexp would also be the name of a defined boundary, to > use them the first character $ or ^ after "[:" is used to see if we mean an > expansion of a subregexp whose matched content will be part of the outer > matched content, or if we mean a condition matched internally ony with a > testable flag but not included in the outer matched content. > > Then the start/end of string condition is nothing else than the evaluation > of the custom boundary condition "[:^eos]", definabled as "[:^eos=.*]" and > matched greedly by default; also here I'm assuming that "." matches any > character, including newlines if they are part of the content of a > "string", otherwise you'll need to define the "eos" custom boundary as > "[:^eos=([\r\n]|.)*]" > > > > > > 2014-12-31 0:26 GMT+01:00 Cameron Dutro : > >> Also, would it be fair to say simply removing the outer set of square >> brackets and treating the entire thing as a regex is correct? It doesn't >> make sense to me to have these transform rules be "almost" regexes except >> for this one "$" exception, especially given "$"'s special significance in >> regexes. >> >> -Cameron >> >> On Tue, Dec 30, 2014 at 3:22 PM, Cameron Dutro >> wrote: >> >>> Thanks Mark. Is that documented anywhere? >>> >>> -Cameron >>> >>> On Tue, Dec 30, 2014 at 11:40 AM, Mark Davis [image: ?]? < >>> mark at macchiato.com> wrote: >>> >>>> $ has a special meaning in the transforms; it means the end of string >>>> (either end). Unlike normal regex, however, it can occur in character >>>> classes, eg [[a$b][:script=greek:]] >>>> >>>> >>>> Mark >>>> >>>> *? Il meglio ? l?inimico del bene ?* >>>> >>>> On Tue, Dec 30, 2014 at 8:21 PM, Cameron Dutro >>>> wrote: >>>> >>>>> Hey cldr-users, >>>>> >>>>> I'm looking at this entry >>>>> >>>>> in CLDR transforms. I'm curious why that "$" character is inside the >>>>> character class. Here's the line reproduced: >>>>> >>>>> $makeRight = [[:Z:][:Ps:][:Pi:]$] ; >>>>> >>>>> I see an outer character class that contains three internal unicode >>>>> character sets and a literal dollar sign. Usually in regular expressions, >>>>> the dollar sign is used to match the end of the string. When it's included >>>>> in a character class however, it should be interpreted as a literal >>>>> character. >>>>> >>>>> Was including the dollar sign in the character class intentional? >>>>> Should it be treated as an end-of-string anchor or a literal string? >>>>> >>>>> -Cameron >>>>> >>>>> _______________________________________________ >>>>> CLDR-Users mailing list >>>>> CLDR-Users at unicode.org >>>>> http://unicode.org/mailman/listinfo/cldr-users >>>>> >>>>> >>>> >>> >> >> _______________________________________________ >> CLDR-Users mailing list >> CLDR-Users at unicode.org >> http://unicode.org/mailman/listinfo/cldr-users >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: emoji_u2615.png Type: image/png Size: 1890 bytes Desc: not available URL: From mark at macchiato.com Wed Dec 31 03:27:15 2014 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Wed, 31 Dec 2014 10:27:15 +0100 Subject: Unicode Regex Question In-Reply-To: References: Message-ID: On Wed, Dec 31, 2014 at 1:40 AM, Philippe Verdy wrote: > Your example with "[[a$b][:script=greek:]]" does not make any sense if > that $ means an "end of string" and where it is embedded in a character > class itself in another embedding character-class. > ?That is incorrect. The way the transform works, any reference to a character position outside the bounds of a string matches $. So what I wrote matches the start or end of a string, or a, or b, or any greek-script character. However, if you look at the transform data files, you'll see real cases where $ is used, rather than the artificial one I used. As to the rest of your post, tl;dr. Mark *? Il meglio ? l?inimico del bene ?* -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Wed Dec 31 04:02:31 2014 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Wed, 31 Dec 2014 11:02:31 +0100 Subject: Unicode Regex Question In-Reply-To: References: Message-ID: No the way it is written is really a litteral $ or a or b or a Greek character. And yes you used a notation embedding two character classes within another character class to create an union. However $ (if it means an end of string) cannot be part of that union and cannot even be part of a character class as it is is then not a character itself but a boundary condition. So yes youe extension is very confusive (in addition of bing incoherent and not enough general to handle various boundary conditions) TL;DR: it was another proposal making a BETTER use of the $ for something else more productive and about how regexp can be embedded into a special syntax allowing to define any custom boundary conditions including end of strings, or other boundaries (and also not limited to properties defined with properties in the UCD. It is a generalisation of the concept; which will be used everywhere Uncode properties are not sufficient, and without necessarily needing addition of new properties to handle specific locales (for example these boundaries could be used in CLDR data instead of the UCD, or in specific locales not supported by CLDR). 2014-12-31 10:27 GMT+01:00 Mark Davis ?? : > > On Wed, Dec 31, 2014 at 1:40 AM, Philippe Verdy > wrote: > >> Your example with "[[a$b][:script=greek:]]" does not make any sense if >> that $ means an "end of string" and where it is embedded in a character >> class itself in another embedding character-class. >> > > ?That is incorrect. The way the transform works, any reference to a > character position outside the bounds of a string matches $. So what I > wrote matches the start or end of a string, or a, or b, or any greek-script > character. > > However, if you look at the transform data files, you'll see real cases > where $ is used, rather than the artificial one I used. > > As to the rest of your post, tl;dr. > > Mark > > *? Il meglio ? l?inimico del bene ?* > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mark at macchiato.com Wed Dec 31 04:51:54 2014 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Wed, 31 Dec 2014 11:51:54 +0100 Subject: Unicode Regex Question In-Reply-To: References: Message-ID: ?? ?> No the way it is written is really a litteral $ or a or b or a Greek character. ? Philippe, you are once again not listening. ? ? The $ in CLDR transforms is NOT the same as $ in regex. ?I do know what I'm talking about here: Alan Liu and I designed this (though years ago).? Now, there is a defect in the LDML documentation, in that the $ is not described fully. For that, people can look at the ICU documentation (from which LDML gets the transform syntax) ?:? ? http://userguide.icu-project.org/transforms/general/rules#TOC-ther Cameron, would you mind filing a CLDR ticket ?to update and expand the documentation ? Mark *? Il meglio ? l?inimico del bene ?* On Wed, Dec 31, 2014 at 11:02 AM, Philippe Verdy wrote: > No the way it is written is really a litteral $ or a or b or a Greek > character. > And yes you used a notation embedding two character classes within another > character class to create an union. However $ (if it means an end of > string) cannot be part of that union and cannot even be part of a character > class as it is is then not a character itself but a boundary condition. > > So yes youe extension is very confusive (in addition of bing incoherent > and not enough general to handle various boundary conditions) > > TL;DR: it was another proposal making a BETTER use of the $ for something > else more productive and about how regexp can be embedded into a special > syntax allowing to define any custom boundary conditions including end of > strings, or other boundaries (and also not limited to properties defined > with properties in the UCD. It is a generalisation of the concept; which > will be used everywhere Uncode properties are not sufficient, and without > necessarily needing addition of new properties to handle specific locales > (for example these boundaries could be used in CLDR data instead of the > UCD, or in specific locales not supported by CLDR). > > > 2014-12-31 10:27 GMT+01:00 Mark Davis [image: ?]? : > >> >> On Wed, Dec 31, 2014 at 1:40 AM, Philippe Verdy >> wrote: >> >>> Your example with "[[a$b][:script=greek:]]" does not make any sense if >>> that $ means an "end of string" and where it is embedded in a character >>> class itself in another embedding character-class. >>> >> >> ?That is incorrect. The way the transform works, any reference to a >> character position outside the bounds of a string matches $. So what I >> wrote matches the start or end of a string, or a, or b, or any greek-script >> character. >> >> However, if you look at the transform data files, you'll see real cases >> where $ is used, rather than the artificial one I used. >> >> As to the rest of your post, tl;dr. >> >> Mark >> >> *? Il meglio ? l?inimico del bene ?* >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: emoji_u2615.png Type: image/png Size: 1890 bytes Desc: not available URL: From patch at cpan.org Wed Dec 31 08:31:27 2014 From: patch at cpan.org (Nick Patch) Date: Wed, 31 Dec 2014 09:31:27 -0500 Subject: Unicode Regex Question In-Reply-To: References: Message-ID: On 31 December 2014 at 05:51, Mark Davis ?? wrote: > The $ in CLDR transforms is NOT the same as $ in regex. Considering that transform syntax shares some common elements with regex syntax, it might be good to document that regular expressions are not supported in transforms. They both share Unicode sets (character classes), but the similarities stop there. > Cameron, would you mind filing a CLDR ticket to update and expand the documentation? I created a ticket yesterday when I noticed that this was undocumented in UTS #35: http://unicode.org/cldr/trac/ticket/8085 Nick -------------- next part -------------- An HTML attachment was scrubbed... URL: From mark at macchiato.com Wed Dec 31 12:28:44 2014 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Wed, 31 Dec 2014 19:28:44 +0100 Subject: Unicode Regex Question In-Reply-To: References: Message-ID: {phone} On Dec 31, 2014 3:31 PM, "Nick Patch" wrote: > > On 31 December 2014 at 05:51, Mark Davis ?? wrote: > > > The $ in CLDR transforms is NOT the same as $ in regex. > > Considering that transform syntax shares some common elements with regex syntax, it might be good to document that regular expressions are not supported in transforms. They both share Unicode sets (character classes), but the similarities stop there. Good idea. It is more than just the sets, but is a very limited subset of regex operations, plus some special features. > > > > Cameron, would you mind filing a CLDR ticket to update and expand the documentation? > > I created a ticket yesterday when I noticed that this was undocumented in UTS #35: > > http://unicode.org/cldr/trac/ticket/8085 > > Nick -------------- next part -------------- An HTML attachment was scrubbed... URL: From srl at icu-project.org Wed Dec 31 12:49:05 2014 From: srl at icu-project.org (Steven R. Loomis) Date: Wed, 31 Dec 2014 10:49:05 -0800 Subject: Unicode Regex Question In-Reply-To: References: Message-ID: http://www.unicode.org/reports/tr35/tr35-general.html#Transform_Rules_Syntax says they are similar. I couldn't find anywhere that says they ARE regexes. Stronger warnings would be fine. ICU's docs should be removed and just point to cldr. S Enviado desde nuestro iPhone. > El dic 31, 2014, a las 10:28 AM, Mark Davis ?? escribi?: > > {phone} > On Dec 31, 2014 3:31 PM, "Nick Patch" wrote: > > > > On 31 December 2014 at 05:51, Mark Davis ?? wrote: > > > > > The $ in CLDR transforms is NOT the same as $ in regex. > > > > Considering that transform syntax shares some common elements with regex syntax, it might be good to document that regular expressions are not supported in transforms. They both share Unicode sets (character classes), but the similarities stop there. > > Good idea. It is more than just the sets, but is a very limited subset of regex operations, plus some special features. -------------- next part -------------- An HTML attachment was scrubbed... URL: From srl at icu-project.org Wed Dec 31 13:18:00 2014 From: srl at icu-project.org (Steven R. Loomis) Date: Wed, 31 Dec 2014 11:18:00 -0800 Subject: Unicode Regex Question In-Reply-To: References: Message-ID: Philippe, Mark: Transliterators seem to be in ICU 1.8, so 1999- 15 and almost 16 years ago. S Enviado desde nuestro iPhone. > El dic 31, 2014, a las 2:51 AM, Mark Davis ?? escribi?: > > ???> No the way it is written is really a litteral $ or a or b or a Greek character. > > ?Philippe, you are once again not listening.? ?The $ in CLDR transforms is NOT the same as $ in regex. ?I do know what I'm talking about here: Alan Liu and I designed this (though years ago).? > > Now, there is a defect in the LDML documentation, in that the $ is not described fully. For that, people can look at the ICU documentation (from which LDML gets the transform syntax)?:?? > > http://userguide.icu-project.org/transforms/general/rules#TOC-ther > > Cameron, would you mind filing a CLDR ticket ?to update and expand the documentation? > > > Mark > > ? Il meglio ? l?inimico del bene ? > >> On Wed, Dec 31, 2014 at 11:02 AM, Philippe Verdy wrote: >> No the way it is written is really a litteral $ or a or b or a Greek character. >> And yes you used a notation embedding two character classes within another character class to create an union. However $ (if it means an end of string) cannot be part of that union and cannot even be part of a character class as it is is then not a character itself but a boundary condition. >> >> So yes youe extension is very confusive (in addition of bing incoherent and not enough general to handle various boundary conditions) >> >> TL;DR: it was another proposal making a BETTER use of the $ for something else more productive and about how regexp can be embedded into a special syntax allowing to define any custom boundary conditions including end of strings, or other boundaries (and also not limited to properties defined with properties in the UCD. It is a generalisation of the concept; which will be used everywhere Uncode properties are not sufficient, and without necessarily needing addition of new properties to handle specific locales (for example these boundaries could be used in CLDR data instead of the UCD, or in specific locales not supported by CLDR). >> >> >> 2014-12-31 10:27 GMT+01:00 Mark Davis ? : >>> >>>> On Wed, Dec 31, 2014 at 1:40 AM, Philippe Verdy wrote: >>>> Your example with "[[a$b][:script=greek:]]" does not make any sense if that $ means an "end of string" and where it is embedded in a character class itself in another embedding character-class. >>> >>> ?That is incorrect. The way the transform works, any reference to a character position outside the bounds of a string matches $. So what I wrote matches the start or end of a string, or a, or b, or any greek-script character. >>> >>> However, if you look at the transform data files, you'll see real cases where $ is used, rather than the artificial one I used. >>> >>> As to the rest of your post, tl;dr. >>> >>> Mark >>> >>> ? Il meglio ? l?inimico del bene ? > > _______________________________________________ > CLDR-Users mailing list > CLDR-Users at unicode.org > http://unicode.org/mailman/listinfo/cldr-users -------------- next part -------------- An HTML attachment was scrubbed... URL: