Bundle Lookup

Mark Davis ☕️ mark at macchiato.com
Fri Dec 12 14:27:09 CST 2014


Mark <https://google.com/+MarkDavis>

*— Il meglio è l’inimico del bene —*

On Fri, Dec 12, 2014 at 7:48 PM, Rafael Xavier <rxaviers at gmail.com> wrote:

> Mark,
>
> Giving an arbitrary locale ID, the recommended and only process to deduce
> its respective bundle (reliably) is through Language Matching.
>
> Is that true?
>

​As I said: "
That being said, often people don't understand language matching, and so we
are in the process of adding more information so that there is a direct
mapping from between locale IDs that are always considered to be
"identical" on a deep level, like en-GB and en-Latn-GB.
​"​
​


>
> Considering all bundles are always present, isn't there any less expensive
> algorithm that could be recommended?
>
> Thank you.
>
>
> PS: My use case is a little different. I have *n* distributions of my
> application. On each distribution, it's embedded with a different locale.
> So, I don't need the full power of Language Matching on what's regard
> having an arbitrary list of desired locales vs an aribtrary list of
> available locales. Anyway, I do want my application to look up for the
> right bundle given a locale (e.g., `zh-Hans-TW` when given `zh-TW`).
>
> On Fri, Dec 12, 2014 at 2:50 PM, Mark Davis ☕️ <mark at macchiato.com> wrote:
>>
>> I also want to be clear that there are two closely-related but very
>> different tasks.
>>
>> 1. *Inherited item lookup. *Given that you have a CLDR resource bundle,
>> with inheritance, where do I go to get inherited items?
>>
>> That is specified by CLDR by means of the parentLocale + truncation
>> algorithm, plus the alias element. (There are a few cases where we have
>> "Lateral Inheritance" where the specification is in the text of LDML,
>> such as when looking for an alt variant.)
>>
>> So back to Rafael's original question:
>>
>>    1. en-Latn-GB, and zh-TW are not CLDR bundles, so this doesn't apply
>>    to them.
>>    2. en-US-u-nu-usd: the u-nu-usd doesn't select within a bundle, but
>>    rather customizes a service that uses information in the bundle. The item
>>    lookup (using by the currency formatting service) would be en-US =>
>>    en => root.
>>
>>
>> 2. *Bundle lookup. *Given a locale ID, where do I get the best matching
>> CLDR bundle?
>>
>> My application has a set of supported locales, and the user comes in with
>> a set of desired locales. What is the best bundle for that user?
>>
>> Here we are not as clear as we should be. The recommended process is in
>> http://www.unicode.org/reports/tr35/#LanguageMatching
>>
>> So back to Rafael's original question:
>>
>>    1. en-Latn-GB, and zh-TW. When these are looked up with Language
>>    Matching, assuming that all the CLDR locales are available, they would
>>    return, respectively, en-GB and zh-Hant-TW.
>>
>> That being said, often people don't understand language matching, and so
>> we are in the process of adding more information so that there is a direct
>> mapping from between locale IDs that are always considered to be
>> "identical" on a deep level, like en-GB and en-Latn-GB.
>>
>>
>>
>> Mark <https://google.com/+MarkDavis>
>>
>> *— Il meglio è l’inimico del bene —*
>>
>> On Fri, Dec 12, 2014 at 5:04 PM, John Emmons <emmo at us.ibm.com> wrote:
>>
>>> Yes, Edward, there is a very good reason we don't want zh-Hant to
>>> inherit from zh.  Simply put, in situations where you have locale resources
>>> that aren't 100% populated, allowing zh-Hant to inherit from zh produces a
>>> mixture of simplified and traditional Chinese, which is acceptable to no
>>> one.  This is what we call "cross script inheritance" in CLDR.  While it
>>> might be acceptable to some in the case of Chinese, it is certainly a
>>> bigger problem in languages like Serbian, where you have both Latin and
>>> Cyrillic scripts in use, and you certainly don't ever want a mixture of
>>> Latin and Cyrillic scripts
>>>
>>> These relationships are documented in CLDR's supplemental data, where
>>> you have specified:
>>>
>>> <parentLocale parent="root" locales="az_Cyrl bm_Nkoo bs_Cyrl en_Dsrt
>>> ha_Arab mn_Mong ms_Arab pa_Arab shi_Latn sr_Latn uz_Arab uz_Cyrl vai_Latn
>>> zh_Hant"/>
>>>
>>>
>>> Regards,
>>>
>>> John C. Emmons
>>> Globalization Architect & Unicode CLDR TC Chairman
>>> IBM Software Group
>>> Internet: emmo at us.ibm.com
>>>
>>>
>>> [image: Inactive hide details for Edwin Hoogerbeets ---12/11/2014
>>> 07:41:26 PM---Rafael, also take a look at common/supplemental/likelyS]Edwin
>>> Hoogerbeets ---12/11/2014 07:41:26 PM---Rafael, also take a look at
>>> common/supplemental/likelySubtags.xml. If the caller has passed you an i
>>>
>>> From: Edwin Hoogerbeets <ehoogerbeets at gmail.com>
>>> To: John Emmons/Austin/IBM at IBMUS, Rafael Xavier <rxaviers at gmail.com>
>>> Cc: Jörn Zaefferer <joern.zaefferer at gmail.com>, "cldr-users at unicode.org"
>>> <cldr-users at unicode.org>
>>> Date: 12/11/2014 07:41 PM
>>> Subject: Re: Bundle Lookup
>>> ------------------------------
>>>
>>>
>>>
>>> Rafael, also take a look at common/supplemental/likelySubtags.xml. If
>>> the caller has passed you an incompletely specified locale, you can use
>>> those mappings to see if you can get to a locale for which you do have a
>>> string bundle. I think that is the source for the "language aliases" to
>>> which John was referring.
>>>
>>> John, for the last part of your example zh-TW inheritance chain,
>>> wouldn't you just truncate "zh-Hant" again to "zh" like in the en-GB
>>> example before inheriting from the root? If not, what is the reasoning
>>> there? Is there already a document that specifies the inheritance rules in
>>> CLDR?
>>>
>>> For efficiency, I can imagine you would put the common translations in
>>> "zh" where there is no difference between traditional and simplified, and
>>> other translations in "zh-Hant" or "zh-Hans" where there is. That would
>>> save some disk space and you could leverage linguistic bug fixes at the
>>> "zh" level. For other locales like "sr-Latn" and "sr-Cyrl" there would be
>>> nothing in common so the string bundle at the "sr" level would be
>>> essentially empty, but it should still appear in the inheritance chain just
>>> in case.
>>>
>>> Edwin
>>>
>>>
>>> On 12/11/2014 02:53 PM, John Emmons wrote:
>>>
>>>
>>>    #3 is currently a problem, which we are working on.  Basically,
>>>    "Latn" needs to be stripped out because it isn't necessary.  Then follow
>>>    the normal inheritance:
>>>
>>>    en-GB: en-GB → (parentLocale) en-001 → (truncation) en → root
>>>
>>>    #4 - Any unicode locale extensions are meant to identify particular
>>>    behaviors that are desired in the context of a given locale.  Think of them
>>>    like "options".  They are not meant to be used in the context of bundle
>>>    lookups.
>>>
>>>    #5 - zh_TW - Now that proper language aliases are in place ( See
>>>    *http://unicode.org/cldr/trac/ticket/5949*
>>>    <http://unicode.org/cldr/trac/ticket/5949> )
>>>
>>>    zh-TW: zh-TW → (languageAlias) zh-Hant-TW → (truncation) zh-Hant
>>>     (parentLocale) → root
>>>
>>>    Regards,
>>>
>>>    John C. Emmons
>>>    Globalization Architect & Unicode CLDR TC Chairman
>>>    IBM Software Group
>>>    Internet: *emmo at us.ibm.com* <emmo at us.ibm.com>
>>>
>>>
>>>    [image: Inactive hide details for Rafael Xavier ---12/11/2014
>>>    01:02:57 PM---Friends, This is a very basic question. See below. There ar]Rafael
>>>    Xavier ---12/11/2014 01:02:57 PM---Friends, This is a very basic question.
>>>    See below. There are lots of documentation
>>>
>>>    From: Rafael Xavier *<rxaviers at gmail.com>* <rxaviers at gmail.com>
>>>    To: *"cldr-users at unicode.org"* <cldr-users at unicode.org>
>>>    *<cldr-users at unicode.org>* <cldr-users at unicode.org>
>>>    Cc: Jörn Zaefferer *<joern.zaefferer at gmail.com>*
>>>    <joern.zaefferer at gmail.com>
>>>    Date: 12/11/2014 01:02 PM
>>>    Subject: Bundle Lookup
>>>    Sent by: "CLDR-Users" *<cldr-users-bounces at unicode.org>*
>>>    <cldr-users-bounces at unicode.org>
>>>
>>>    ------------------------------
>>>
>>>
>>>
>>>    Friends,
>>>
>>>    This is a very basic question. See below. There are lots of
>>>    documentation about locale inheritance and matching. But, it fails in same
>>>    cases to me.
>>>
>>> * Giving a locale, what's the procedure to find the **bundle** lookup
>>>    chain?*
>>>
>>>    1. en-US: en-US → (truncation) en → root
>>>
>>>    This one is dead simple. No problem.
>>>
>>>    2. en-GB: en-GB → (parentLocale) en-001 → (truncation) en → root
>>>
>>>    This one is also dead simple. Although, documentation says en-GB →
>>>    en. Is it outdated or am I doing something wrong?
>>>
>>>    Anyway, the ones I'm interested in knowing are:
>>>
>>>    3. en-Latn-GB
>>>    4. en-US-u-nu-usd
>>>    5. zh-TW
>>>
>>>    Please, could someone show me what's the chain of these locales (and
>>>    obviously explain the steps)?
>>>
>>>    Thanks!
>>>
>>>    --
>>> *+55 (16) 98138-1582* <%2B55%20%2816%29%2098138-1582>, *+1 (415)
>>>    568-5854* <%2B1%20%28415%29%20568-5854>, skype: rxaviers
>>> *http://rafael.xavier.blog.br* <http://rafael.xavier.blog.br/>
>>>    _______________________________________________
>>>    CLDR-Users mailing list
>>> *CLDR-Users at unicode.org* <CLDR-Users at unicode.org>
>>> *http://unicode.org/mailman/listinfo/cldr-users*
>>>    <http://unicode.org/mailman/listinfo/cldr-users>
>>>
>>>
>>>
>>>    _______________________________________________
>>>    CLDR-Users mailing list
>>>    *CLDR-Users at unicode.org* <CLDR-Users at unicode.org>
>>>    *http://unicode.org/mailman/listinfo/cldr-users*
>>>    <http://unicode.org/mailman/listinfo/cldr-users>
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> CLDR-Users mailing list
>>> CLDR-Users at unicode.org
>>> http://unicode.org/mailman/listinfo/cldr-users
>>>
>>>
>>
>
> --
> +55 (16) 98138-1582, +1 (415) 568-5854, skype: rxaviers
> http://rafael.xavier.blog.br
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/cldr-users/attachments/20141212/85ed59ec/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: graycol.gif
Type: image/gif
Size: 105 bytes
Desc: not available
URL: <http://unicode.org/pipermail/cldr-users/attachments/20141212/85ed59ec/attachment-0001.gif>


More information about the CLDR-Users mailing list