Fwd: alternate formatting data for algorithmic number systems when they fallback to a decimal system

Thu Apr 16 11:09:52 CDT 2015

---------- Forwarded message ----------
From: Cameron Dutro <cameron at lumoslabs.com>
Date: Thu, Apr 16, 2015 at 9:09 AM
Subject: Re: alternate formatting data for algorithmic number systems when
they fallback to a decimal system
To: Philippe Verdy <verdy_p at wanadoo.fr>

Thank you for the clarification Philippe. In my previous email I was not
trying necessarily to respond with approval or disapproval of your
proposal, but instead understand the issue better. I am in no position to
affect any kind of change in CLDR or ICU. Having read your second and third
emails, I think I agree with you. I'd like to hear what Mark and Markus
have to say about this too, however.

-Cameron

On Thu, Apr 16, 2015 at 3:09 AM, Philippe Verdy <verdy_p at wanadoo.fr> wrote:

> My proposal concerns in fact all types of number formatters currently
> supported in CLDR data and that could all be algorithmic:
> - number systems (cardinals),
> - ordinal,
> - year numbering,
> - month numbering,
> - day numbering,
> - century numbering (in French it uses the roman-lower system with
> ordinals),
> - millenium numbering (in French it uses the roman-upper system with
> ordinal),
> - accounting amounts,
> - currency amounts (displayed prices),
> - measurement with unit,
> - spellout using translated words for all the usages above...
>
> It also concerns number parsers, that are built to parse and accept all
> these formatted numbers using the same rulesets, plus a lenient parsing
> ruleset for accepting numbers not formatted this way (e.g. a "roman-lower"
> parser will typically contain lenient parsing rules for accepting all
> numbers formatted with a decimal system, as well as numbers formatted in
> "roman-upper")...
>  Le 16 avr. 2015 11:27, "Philippe Verdy" <verdy_p at wanadoo.fr> a écrit :
>
>> No ICU does NOT handle this case.
>>
>> When using a locale whose number system is algorithmic, yes it uses that
>> system, as specified in CLDR data, and yes it yes the RBNF rulesets
>> associated.
>>
>> But the problem is within these rulesets when one of the rules specifies
>> a substitution which is neither another ruleset name and neither an empty
>> substitution (such as == or << or >>) but a decimal format starting by 0 or
>> #.
>>
>> On that case the decimal format is used blindly and does not use the
>> native decimal digits or the native separators or the native grouping and
>> decimal formats or that locale.
>>
>> The problem being in fact in CLDR data where the rule specifies a
>> substitution like this one in the "roman-lower" system:
>>
>> "5000: =##,##0="
>>
>> which should really be
>>
>> "5000: =="
>>
>> to ignore the specified decimal format but instead select an appropriate
>> decimal format for the locale in ANOTHER number system that will not be
>> algorithmic but decimal, and searched by default first for the "native"
>> system when it is mapped for that locale (in CLDR data all locales have a
>> mapping of the effective number system to use when we use the "native"
>> number system alias, this is mot the case for the "finance" or "traditio"
>> number system alias) before the defaut number system for that locale (in
>> CLDR data, all locales have a decimal system mapped there which is not
>> necessarily the modern latin system but is formatable with ten digits and
>> standard separators and signs which are still localized.)
>>
>> On summary you have still not understood why this an issue not just
>> inside ICU but in fact in CLDR data itself independantly of the ICU
>> implementation. The problem is NOT:
>> — in the mapping of locales to their number systems in several variants
>> (default, native, traditio, finance) and possibly also aliased,
>> — in the mapping of number system to a decimal or algorithmic type.
>> — in the definition of each algorithmic number system by a group of
>> rulesets including one which is public (not named with a %% prefix) and
>> designated as the main ruleset to use.
>> — in the definition of each ruleset widget several rules, each file being
>> keyed either by special rule type (proper fraction, improper fraction, or
>> master) or by value (an integer or fraction).
>>
>> The problem is in the definition of an individual RBNF rule, where it
>> uses a substitution to a decimal format starting by 0 or # (such
>> substitution may be surrounded by == or << or >> to soecify hiw to compute
>> the value to firmat): this is something that I propose to deprecate and
>> even completely from CLDR data as it is clearly wrong or insufficient as it
>> bypasses the per-locale settings of their prefered decimal system if not
>> using their prefered algorithmic system.
>>
>> However I maintain the role of == or << or >> to compute the value that
>> will be passed down the decimal formater.
>>
>> So your reply in fact gives absolutely no hint and even the link to the
>> ICU constructor is inappropriate for this issue (I know what it does, and I
>> had already inspected this code before sending my first email with the
>> proposal). You had clearly not understood the issue that i have just
>> reformulated here with more explicit details.
>> Le 15 avr. 2015 19:39, "Cameron Dutro" <cameron at lumoslabs.com> a écrit :
>>
>>> Hey Philippe,
>>>
>>> My understanding is that the implementer should just use the number
>>> system for the given locale. ICU actually lets you specify the number
>>> system, see the docs here:
>>> http://www.icu-project.org/apiref/icu4c/classRuleBasedNumberFormat.html
>>> (see specifically the icu::RuleBasedNumberFormat::RuleBasedNumberFormat
>>> constructor). I understand from your email that converting to a different
>>> number system isn't always as straightforward as a 1:1 text replace, but I
>>> believe the current CLDR number formatting rules handle these cases, yes?
>>> I've noticed that ICU at least formats numbers in RBNF rules using the
>>> correct numbering system for the locale.
>>>
>>> -Cameron
>>>
>>> On Wed, Apr 15, 2015 at 6:42 AM, Philippe Verdy <verdy_p at wanadoo.fr>
>>> wrote:
>>>
>>>> For now the CLDR data for algorithmic number systems are using RBNF
>>>> rules when this is possible but the last mapping when this does not work is
>>>> to use a specific decimal format (starting by 0 or #).
>>>>
>>>> One problem is that this decimal format is the same independantly of
>>>> the actual locale (language or number style in that language) for which the
>>>> number system has been mapped.
>>>>
>>>> Different locales using the same number system have in fact different
>>>> rules for formatting numbers when they are forced to use a fallback to a
>>>> decimal system.
>>>>
>>>> These fallbacks are typically currently specified as the substitution
>>>> "=#,##0.00=", which is clearly wrong (e.g. for Traditional Tamil): these
>>>> formats are assuming in fact a specific language, and it is not the same
>>>> for all locales using this number system.
>>>>
>>>> I propose deprecating these mappings and instead just set them to the
>>>> substitution "==" meaning that it will use the format for the decimal
>>>> system which will be used instead.
>>>>
>>>> Note that when using locale resolution mechanisms to find the
>>>> appropriate number system to use for formatting numbers, it will (if you
>>>> don't care about it) map it again to the same traditional algorithmic
>>>> system so this would recurse infinitely:
>>>>
>>>> - the "==" substition must look for a mapping for the locale in the
>>>> *default* decimal number variant,
>>>>
>>>> - but it could also map to the "native" decimal number variant mapped
>>>> for that locale (replacing the "traditional" variant which is algorithmic,
>>>> using the substitution "=-native=", so that native digits will still be
>>>> used (instead of just the Latin digits, when these locales are using by
>>>> default the Latin digits, and not the native ones)
>>>>
>>>> With this proposal, the CLDR data for number systems would no longer
>>>> contain any data using "=#...=" or "=0...=" substitutions; the traditional
>>>> systems would still be able to format all numbers even those they do not
>>>> support internally, using the native digits, and the appropriate separators
>>>> (decimal, grouping), and appropriate grouping.
>>>>
>>>> One way to implement it however does not require changing the CLDR
>>>> data: the implementation can autodetect the "=#...=" or "=0...=" substition
>>>> rules found in algorithmic number systems, consider them all equivalent to
>>>> just "==": it would first try to map the locale to a "native" decimal
>>>> variant, and use it (note that the "native" variant already has fallbacks
>>>> for all locales to use the default decimal variant: this is the case for
>>>> most non Indian locales that are alone to have "native" mappings).
>>>>
>>>> In summary the resolution for algorithmic systems would use the
>>>> following path:
>>>> - use "traditional" rules if it works (it uses the RBNF data)
>>>> - when it finds a "==" substitution (or any "=0...=" or "=#...="
>>>> substitution), find the decimal number system in the "native" variant, and
>>>> format numbers in that system, and use the appropriate separators and
>>>> groupings
>>>> - if there's no "native" variant mapped for that locale, it will
>>>> fallback to use the default system (in CLDR data charts, we see that it is
>>>> the case because there's an entry mapping "All other locales" to the Latin
>>>> number system which will also use the same separators nad groupings.
>>>>
>>>> This will be a major improvement for number systems used in lots of
>>>> languages (including Latin-written languages) such as the "roman" number
>>>> system.
>>>>
>>>> One more note:
>>>>
>>>> The East-Asian scripts in traditional scripts prefer to use their own
>>>> algorithmic system which cannot format all numbers. As they are rendered
>>>> using sinographic squares, the fallback "native" digits should use the
>>>> "fullwidth" variant: this can be specific using "=-native=" or more
>>>> specifically the "=-fullwidth=".
>>>>
>>>> Note that for now no "==" substituon rule can start by a minus sign
>>>> ("-"), it must only be:
>>>> - a valid ruleset name (starting by % or %%), or
>>>> - a decimal format (starting by "0" or "#", that I want to deprecate),
>>>> or
>>>> - empty (but the current implementation in ICU creates an infinite
>>>> loop, or only use Basic Latin decimal digits in a fixed number format,
>>>> independant of the locale)
>>>>
>>>> So there absolutely no conflict when we use a "==" substitution rule
>>>> starting by minus (-) to mean that it should use another specified number
>>>> system (such as "native" or "fullwidth" or any specific non-algorithmic
>>>> number system) which is named just after this minus sign.
>>>>
>>>> ----
>>>>
>>>> Alternatively, the standard code of a locale (starting by a letter 'a'
>>>> to 'z') could be used in these "==" sustitutions, for example:
>>>> - "=ja=" (it would be used only for spellout number formaters for
>>>> specific to the Japanese locale),
>>>> - "=ar-TN=" (for spellout number formatter in Arabic as spoken in
>>>> Tunisia, when words cannot be used, and the Tunisian Arabic rules should be
>>>> used, which is different from standard Arabic [ar], as it uses Latin digits
>>>> instead of Arabic digits: it would still use the separators and groupings
>>>> specified for the Tunisian Arabic locale, which are also not using the
>>>> Arabic comma)
>>>>
>>>> In that case, the standard way to designate another number system
>>>> (without reference to a specific language) should use the Unicode locale
>>>> tags for number systems, but without any leading language subtags (ie.
>>>> "=-u-ns-native=", instead of just "=-native=") as number formating rules
>>>> are not expected in most cases to replace the language itself, just to
>>>> replace the number system): this is the reason for using the leading minus
>>>> for such usage (but we could also replace the region code only such as
>>>> "=-CN=" or the script code unly such as "=-Bopo="): this is different from
>>>> using "=und-CN=" or "=und-Bopo=" because we don't want to replace the
>>>> language to an undetermined language, which would use only default digits,
>>>> default grouping separators and default groupings formats instead of
>>>> keeping them in their current locale.
>>>>
>>>>
>>>> -- Philippe.
>>>>
>>>>
>>>> _______________________________________________
>>>> CLDR-Users mailing list
>>>> CLDR-Users at unicode.org
>>>> http://unicode.org/mailman/listinfo/cldr-users
>>>>
>>>>
>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/cldr-users/attachments/20150416/f84379d6/attachment.html>