Bundle Lookup

Rafael Xavier rxaviers at gmail.com
Mon Feb 16 10:44:44 CST 2015


For the record,

Language Matching documentation improvements coming up. "... To make it
clear that the recommended methodology for Bundle lookup is to use Language
Matching":
http://www.unicode.org/cldr/trac/ticket/8067

On Wed, Jan 14, 2015 at 2:04 PM, Rafael Xavier <rxaviers at gmail.com> wrote:

> Hello everyone,
>
> It's clear to me there are docs improvements coming up. But, giving the
> fact I'm still digging into it (testing LangageMatching as suggested for
> bundle lookup matcher), I would like to share my findings with you. It's
> still a draft. But, it has a suggestion that I think it would suite better
> than LanguageMatching for bundle lookup matcher purposes.
>
>
> https://docs.google.com/document/d/1qLbuz659VvCVhgyd08KRP0SMuqCvK9bSS3-0W-kMuuw/edit?usp=sharing
>
> On Fri, Dec 12, 2014 at 7:31 PM, Rafael Xavier <rxaviers at gmail.com> wrote:
>
>> Looking forward to hearing how that shall work.
>>
>> Thank you very much so far.
>>
>> On Fri, Dec 12, 2014 at 6:27 PM, Mark Davis [image: ☕]️ <
>> mark at macchiato.com> wrote:
>>>
>>>
>>>
>>>
>>> Mark <https://google.com/+MarkDavis>
>>>
>>> *— Il meglio è l’inimico del bene —*
>>>
>>> On Fri, Dec 12, 2014 at 7:48 PM, Rafael Xavier <rxaviers at gmail.com>
>>> wrote:
>>>
>>>> Mark,
>>>>
>>>> Giving an arbitrary locale ID, the recommended and only process to
>>>> deduce its respective bundle (reliably) is through Language Matching.
>>>>
>>>> Is that true?
>>>>
>>>
>>> ​As I said: "
>>> That being said, often people don't understand language matching, and so
>>> we are in the process of adding more information so that there is a direct
>>> mapping from between locale IDs that are always considered to be
>>> "identical" on a deep level, like en-GB and en-Latn-GB.
>>> ​"​
>>>>>>
>>>
>>>>
>>>> Considering all bundles are always present, isn't there any less
>>>> expensive algorithm that could be recommended?
>>>>
>>>> Thank you.
>>>>
>>>>
>>>> PS: My use case is a little different. I have *n* distributions of my
>>>> application. On each distribution, it's embedded with a different locale.
>>>> So, I don't need the full power of Language Matching on what's regard
>>>> having an arbitrary list of desired locales vs an aribtrary list of
>>>> available locales. Anyway, I do want my application to look up for the
>>>> right bundle given a locale (e.g., `zh-Hans-TW` when given `zh-TW`).
>>>>
>>>> On Fri, Dec 12, 2014 at 2:50 PM, Mark Davis [image: ☕]️ <
>>>> mark at macchiato.com> wrote:
>>>>>
>>>>> I also want to be clear that there are two closely-related but very
>>>>> different tasks.
>>>>>
>>>>> 1. *Inherited item lookup. *Given that you have a CLDR resource
>>>>> bundle, with inheritance, where do I go to get inherited items?
>>>>>
>>>>> That is specified by CLDR by means of the parentLocale + truncation
>>>>> algorithm, plus the alias element. (There are a few cases where we have
>>>>> "Lateral Inheritance" where the specification is in the text of LDML,
>>>>> such as when looking for an alt variant.)
>>>>>
>>>>> So back to Rafael's original question:
>>>>>
>>>>>    1. en-Latn-GB, and zh-TW are not CLDR bundles, so this doesn't
>>>>>    apply to them.
>>>>>    2. en-US-u-nu-usd: the u-nu-usd doesn't select within a bundle,
>>>>>    but rather customizes a service that uses information in the bundle. The
>>>>>    item lookup (using by the currency formatting service) would be en-US
>>>>>    => en => root.
>>>>>
>>>>>
>>>>> 2. *Bundle lookup. *Given a locale ID, where do I get the best
>>>>> matching CLDR bundle?
>>>>>
>>>>> My application has a set of supported locales, and the user comes in
>>>>> with a set of desired locales. What is the best bundle for that user?
>>>>>
>>>>> Here we are not as clear as we should be. The recommended process is in
>>>>>  http://www.unicode.org/reports/tr35/#LanguageMatching
>>>>>
>>>>> So back to Rafael's original question:
>>>>>
>>>>>    1. en-Latn-GB, and zh-TW. When these are looked up with Language
>>>>>    Matching, assuming that all the CLDR locales are available, they would
>>>>>    return, respectively, en-GB and zh-Hant-TW.
>>>>>
>>>>> That being said, often people don't understand language matching, and
>>>>> so we are in the process of adding more information so that there is a
>>>>> direct mapping from between locale IDs that are always considered to
>>>>> be "identical" on a deep level, like en-GB and en-Latn-GB.
>>>>>
>>>>>
>>>>>
>>>>> Mark <https://google.com/+MarkDavis>
>>>>>
>>>>> *— Il meglio è l’inimico del bene —*
>>>>>
>>>>> On Fri, Dec 12, 2014 at 5:04 PM, John Emmons <emmo at us.ibm.com> wrote:
>>>>>
>>>>>> Yes, Edward, there is a very good reason we don't want zh-Hant to
>>>>>> inherit from zh.  Simply put, in situations where you have locale resources
>>>>>> that aren't 100% populated, allowing zh-Hant to inherit from zh produces a
>>>>>> mixture of simplified and traditional Chinese, which is acceptable to no
>>>>>> one.  This is what we call "cross script inheritance" in CLDR.  While it
>>>>>> might be acceptable to some in the case of Chinese, it is certainly a
>>>>>> bigger problem in languages like Serbian, where you have both Latin and
>>>>>> Cyrillic scripts in use, and you certainly don't ever want a mixture of
>>>>>> Latin and Cyrillic scripts
>>>>>>
>>>>>> These relationships are documented in CLDR's supplemental data, where
>>>>>> you have specified:
>>>>>>
>>>>>> <parentLocale parent="root" locales="az_Cyrl bm_Nkoo bs_Cyrl en_Dsrt
>>>>>> ha_Arab mn_Mong ms_Arab pa_Arab shi_Latn sr_Latn uz_Arab uz_Cyrl vai_Latn
>>>>>> zh_Hant"/>
>>>>>>
>>>>>>
>>>>>> Regards,
>>>>>>
>>>>>> John C. Emmons
>>>>>> Globalization Architect & Unicode CLDR TC Chairman
>>>>>> IBM Software Group
>>>>>> Internet: emmo at us.ibm.com
>>>>>>
>>>>>>
>>>>>> [image: Inactive hide details for Edwin Hoogerbeets ---12/11/2014
>>>>>> 07:41:26 PM---Rafael, also take a look at common/supplemental/likelyS]Edwin
>>>>>> Hoogerbeets ---12/11/2014 07:41:26 PM---Rafael, also take a look at
>>>>>> common/supplemental/likelySubtags.xml. If the caller has passed you an i
>>>>>>
>>>>>> From: Edwin Hoogerbeets <ehoogerbeets at gmail.com>
>>>>>> To: John Emmons/Austin/IBM at IBMUS, Rafael Xavier <rxaviers at gmail.com>
>>>>>> Cc: Jörn Zaefferer <joern.zaefferer at gmail.com>, "
>>>>>> cldr-users at unicode.org" <cldr-users at unicode.org>
>>>>>> Date: 12/11/2014 07:41 PM
>>>>>> Subject: Re: Bundle Lookup
>>>>>> ------------------------------
>>>>>>
>>>>>>
>>>>>>
>>>>>> Rafael, also take a look at common/supplemental/likelySubtags.xml. If
>>>>>> the caller has passed you an incompletely specified locale, you can use
>>>>>> those mappings to see if you can get to a locale for which you do have a
>>>>>> string bundle. I think that is the source for the "language aliases" to
>>>>>> which John was referring.
>>>>>>
>>>>>> John, for the last part of your example zh-TW inheritance chain,
>>>>>> wouldn't you just truncate "zh-Hant" again to "zh" like in the en-GB
>>>>>> example before inheriting from the root? If not, what is the reasoning
>>>>>> there? Is there already a document that specifies the inheritance rules in
>>>>>> CLDR?
>>>>>>
>>>>>> For efficiency, I can imagine you would put the common translations
>>>>>> in "zh" where there is no difference between traditional and simplified,
>>>>>> and other translations in "zh-Hant" or "zh-Hans" where there is. That would
>>>>>> save some disk space and you could leverage linguistic bug fixes at the
>>>>>> "zh" level. For other locales like "sr-Latn" and "sr-Cyrl" there would be
>>>>>> nothing in common so the string bundle at the "sr" level would be
>>>>>> essentially empty, but it should still appear in the inheritance chain just
>>>>>> in case.
>>>>>>
>>>>>> Edwin
>>>>>>
>>>>>>
>>>>>> On 12/11/2014 02:53 PM, John Emmons wrote:
>>>>>>
>>>>>>
>>>>>>    #3 is currently a problem, which we are working on.  Basically,
>>>>>>    "Latn" needs to be stripped out because it isn't necessary.  Then follow
>>>>>>    the normal inheritance:
>>>>>>
>>>>>>    en-GB: en-GB → (parentLocale) en-001 → (truncation) en → root
>>>>>>
>>>>>>    #4 - Any unicode locale extensions are meant to identify
>>>>>>    particular behaviors that are desired in the context of a given locale.
>>>>>>    Think of them like "options".  They are not meant to be used in the context
>>>>>>    of bundle lookups.
>>>>>>
>>>>>>    #5 - zh_TW - Now that proper language aliases are in place ( See
>>>>>>    *http://unicode.org/cldr/trac/ticket/5949*
>>>>>>    <http://unicode.org/cldr/trac/ticket/5949> )
>>>>>>
>>>>>>    zh-TW: zh-TW → (languageAlias) zh-Hant-TW → (truncation) zh-Hant
>>>>>>     (parentLocale) → root
>>>>>>
>>>>>>    Regards,
>>>>>>
>>>>>>    John C. Emmons
>>>>>>    Globalization Architect & Unicode CLDR TC Chairman
>>>>>>    IBM Software Group
>>>>>>    Internet: *emmo at us.ibm.com* <emmo at us.ibm.com>
>>>>>>
>>>>>>
>>>>>>    [image: Inactive hide details for Rafael Xavier ---12/11/2014
>>>>>>    01:02:57 PM---Friends, This is a very basic question. See below. There ar]Rafael
>>>>>>    Xavier ---12/11/2014 01:02:57 PM---Friends, This is a very basic question.
>>>>>>    See below. There are lots of documentation
>>>>>>
>>>>>>    From: Rafael Xavier *<rxaviers at gmail.com>* <rxaviers at gmail.com>
>>>>>>    To: *"cldr-users at unicode.org"* <cldr-users at unicode.org>
>>>>>>    *<cldr-users at unicode.org>* <cldr-users at unicode.org>
>>>>>>    Cc: Jörn Zaefferer *<joern.zaefferer at gmail.com>*
>>>>>>    <joern.zaefferer at gmail.com>
>>>>>>    Date: 12/11/2014 01:02 PM
>>>>>>    Subject: Bundle Lookup
>>>>>>    Sent by: "CLDR-Users" *<cldr-users-bounces at unicode.org>*
>>>>>>    <cldr-users-bounces at unicode.org>
>>>>>>
>>>>>>    ------------------------------
>>>>>>
>>>>>>
>>>>>>
>>>>>>    Friends,
>>>>>>
>>>>>>    This is a very basic question. See below. There are lots of
>>>>>>    documentation about locale inheritance and matching. But, it fails in same
>>>>>>    cases to me.
>>>>>>
>>>>>> * Giving a locale, what's the procedure to find the **bundle** lookup
>>>>>>    chain?*
>>>>>>
>>>>>>    1. en-US: en-US → (truncation) en → root
>>>>>>
>>>>>>    This one is dead simple. No problem.
>>>>>>
>>>>>>    2. en-GB: en-GB → (parentLocale) en-001 → (truncation) en → root
>>>>>>
>>>>>>    This one is also dead simple. Although, documentation says en-GB
>>>>>>    → en. Is it outdated or am I doing something wrong?
>>>>>>
>>>>>>    Anyway, the ones I'm interested in knowing are:
>>>>>>
>>>>>>    3. en-Latn-GB
>>>>>>    4. en-US-u-nu-usd
>>>>>>    5. zh-TW
>>>>>>
>>>>>>    Please, could someone show me what's the chain of these locales
>>>>>>    (and obviously explain the steps)?
>>>>>>
>>>>>>    Thanks!
>>>>>>
>>>>>>    --
>>>>>> *+55 (16) 98138-1582* <%2B55%20%2816%29%2098138-1582>, *+1 (415)
>>>>>>    568-5854* <%2B1%20%28415%29%20568-5854>, skype: rxaviers
>>>>>> *http://rafael.xavier.blog.br* <http://rafael.xavier.blog.br/>
>>>>>>    _______________________________________________
>>>>>>    CLDR-Users mailing list
>>>>>> *CLDR-Users at unicode.org* <CLDR-Users at unicode.org>
>>>>>> *http://unicode.org/mailman/listinfo/cldr-users*
>>>>>>    <http://unicode.org/mailman/listinfo/cldr-users>
>>>>>>
>>>>>>
>>>>>>
>>>>>>    _______________________________________________
>>>>>>    CLDR-Users mailing list
>>>>>>    *CLDR-Users at unicode.org* <CLDR-Users at unicode.org>
>>>>>>    *http://unicode.org/mailman/listinfo/cldr-users*
>>>>>>    <http://unicode.org/mailman/listinfo/cldr-users>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> CLDR-Users mailing list
>>>>>> CLDR-Users at unicode.org
>>>>>> http://unicode.org/mailman/listinfo/cldr-users
>>>>>>
>>>>>>
>>>>>
>>>>
>>>> --
>>>> +55 (16) 98138-1582, +1 (415) 568-5854, skype: rxaviers
>>>> http://rafael.xavier.blog.br
>>>>
>>>
>>>
>>
>> --
>> +55 (16) 98138-1582, +1 (415) 568-5854, skype: rxaviers
>> http://rafael.xavier.blog.br
>>
>
>
>
> --
> +55 (16) 98138-1582, +1 (415) 568-5854, skype: rxaviers
> http://rafael.xavier.blog.br
>



-- 
+55 (16) 98138-1582, +1 (415) 568-5854, skype: rxaviers
http://rafael.xavier.blog.br
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/cldr-users/attachments/20150216/830a6ac0/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: graycol.gif
Type: image/gif
Size: 105 bytes
Desc: not available
URL: <http://unicode.org/pipermail/cldr-users/attachments/20150216/830a6ac0/attachment-0001.gif>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: emoji_u2615.png
Type: image/png
Size: 1890 bytes
Desc: not available
URL: <http://unicode.org/pipermail/cldr-users/attachments/20150216/830a6ac0/attachment-0001.png>


More information about the CLDR-Users mailing list