alternate formatting data for algorithmic number systems when they fallback to a decimal system

Cameron Dutro cameron at lumoslabs.com
Wed Apr 15 12:39:17 CDT 2015


Hey Philippe,

My understanding is that the implementer should just use the number system
for the given locale. ICU actually lets you specify the number system, see
the docs here:
http://www.icu-project.org/apiref/icu4c/classRuleBasedNumberFormat.html
(see specifically the icu::RuleBasedNumberFormat::RuleBasedNumberFormat
constructor). I understand from your email that converting to a different
number system isn't always as straightforward as a 1:1 text replace, but I
believe the current CLDR number formatting rules handle these cases, yes?
I've noticed that ICU at least formats numbers in RBNF rules using the
correct numbering system for the locale.

-Cameron

On Wed, Apr 15, 2015 at 6:42 AM, Philippe Verdy <verdy_p at wanadoo.fr> wrote:

> For now the CLDR data for algorithmic number systems are using RBNF rules
> when this is possible but the last mapping when this does not work is to
> use a specific decimal format (starting by 0 or #).
>
> One problem is that this decimal format is the same independantly of the
> actual locale (language or number style in that language) for which the
> number system has been mapped.
>
> Different locales using the same number system have in fact different
> rules for formatting numbers when they are forced to use a fallback to a
> decimal system.
>
> These fallbacks are typically currently specified as the substitution
> "=#,##0.00=", which is clearly wrong (e.g. for Traditional Tamil): these
> formats are assuming in fact a specific language, and it is not the same
> for all locales using this number system.
>
> I propose deprecating these mappings and instead just set them to the
> substitution "==" meaning that it will use the format for the decimal
> system which will be used instead.
>
> Note that when using locale resolution mechanisms to find the appropriate
> number system to use for formatting numbers, it will (if you don't care
> about it) map it again to the same traditional algorithmic system so this
> would recurse infinitely:
>
> - the "==" substition must look for a mapping for the locale in the
> *default* decimal number variant,
>
> - but it could also map to the "native" decimal number variant mapped for
> that locale (replacing the "traditional" variant which is algorithmic,
> using the substitution "=-native=", so that native digits will still be
> used (instead of just the Latin digits, when these locales are using by
> default the Latin digits, and not the native ones)
>
> With this proposal, the CLDR data for number systems would no longer
> contain any data using "=#...=" or "=0...=" substitutions; the traditional
> systems would still be able to format all numbers even those they do not
> support internally, using the native digits, and the appropriate separators
> (decimal, grouping), and appropriate grouping.
>
> One way to implement it however does not require changing the CLDR data:
> the implementation can autodetect the "=#...=" or "=0...=" substition rules
> found in algorithmic number systems, consider them all equivalent to just
> "==": it would first try to map the locale to a "native" decimal variant,
> and use it (note that the "native" variant already has fallbacks for all
> locales to use the default decimal variant: this is the case for most non
> Indian locales that are alone to have "native" mappings).
>
> In summary the resolution for algorithmic systems would use the following
> path:
> - use "traditional" rules if it works (it uses the RBNF data)
> - when it finds a "==" substitution (or any "=0...=" or "=#...="
> substitution), find the decimal number system in the "native" variant, and
> format numbers in that system, and use the appropriate separators and
> groupings
> - if there's no "native" variant mapped for that locale, it will fallback
> to use the default system (in CLDR data charts, we see that it is the case
> because there's an entry mapping "All other locales" to the Latin number
> system which will also use the same separators nad groupings.
>
> This will be a major improvement for number systems used in lots of
> languages (including Latin-written languages) such as the "roman" number
> system.
>
> One more note:
>
> The East-Asian scripts in traditional scripts prefer to use their own
> algorithmic system which cannot format all numbers. As they are rendered
> using sinographic squares, the fallback "native" digits should use the
> "fullwidth" variant: this can be specific using "=-native=" or more
> specifically the "=-fullwidth=".
>
> Note that for now no "==" substituon rule can start by a minus sign ("-"),
> it must only be:
> - a valid ruleset name (starting by % or %%), or
> - a decimal format (starting by "0" or "#", that I want to deprecate), or
> - empty (but the current implementation in ICU creates an infinite loop,
> or only use Basic Latin decimal digits in a fixed number format,
> independant of the locale)
>
> So there absolutely no conflict when we use a "==" substitution rule
> starting by minus (-) to mean that it should use another specified number
> system (such as "native" or "fullwidth" or any specific non-algorithmic
> number system) which is named just after this minus sign.
>
> ----
>
> Alternatively, the standard code of a locale (starting by a letter 'a' to
> 'z') could be used in these "==" sustitutions, for example:
> - "=ja=" (it would be used only for spellout number formaters for specific
> to the Japanese locale),
> - "=ar-TN=" (for spellout number formatter in Arabic as spoken in Tunisia,
> when words cannot be used, and the Tunisian Arabic rules should be used,
> which is different from standard Arabic [ar], as it uses Latin digits
> instead of Arabic digits: it would still use the separators and groupings
> specified for the Tunisian Arabic locale, which are also not using the
> Arabic comma)
>
> In that case, the standard way to designate another number system (without
> reference to a specific language) should use the Unicode locale tags for
> number systems, but without any leading language subtags (ie.
> "=-u-ns-native=", instead of just "=-native=") as number formating rules
> are not expected in most cases to replace the language itself, just to
> replace the number system): this is the reason for using the leading minus
> for such usage (but we could also replace the region code only such as
> "=-CN=" or the script code unly such as "=-Bopo="): this is different from
> using "=und-CN=" or "=und-Bopo=" because we don't want to replace the
> language to an undetermined language, which would use only default digits,
> default grouping separators and default groupings formats instead of
> keeping them in their current locale.
>
>
> -- Philippe.
>
>
> _______________________________________________
> CLDR-Users mailing list
> CLDR-Users at unicode.org
> http://unicode.org/mailman/listinfo/cldr-users
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/cldr-users/attachments/20150415/f96bbf12/attachment-0001.html>


More information about the CLDR-Users mailing list