alternate formatting data for algorithmic number systems when they fallback to a decimal system

Wed Apr 15 08:42:35 CDT 2015

For now the CLDR data for algorithmic number systems are using RBNF rules
when this is possible but the last mapping when this does not work is to
use a specific decimal format (starting by 0 or #).

One problem is that this decimal format is the same independantly of the
actual locale (language or number style in that language) for which the
number system has been mapped.

Different locales using the same number system have in fact different rules
for formatting numbers when they are forced to use a fallback to a decimal
system.

These fallbacks are typically currently specified as the substitution
"=#,##0.00=", which is clearly wrong (e.g. for Traditional Tamil): these
formats are assuming in fact a specific language, and it is not the same
for all locales using this number system.

I propose deprecating these mappings and instead just set them to the
substitution "==" meaning that it will use the format for the decimal
system which will be used instead.

Note that when using locale resolution mechanisms to find the appropriate
number system to use for formatting numbers, it will (if you don't care
about it) map it again to the same traditional algorithmic system so this
would recurse infinitely:

- the "==" substition must look for a mapping for the locale in the
*default* decimal number variant,

- but it could also map to the "native" decimal number variant mapped for
that locale (replacing the "traditional" variant which is algorithmic,
using the substitution "=-native=", so that native digits will still be
used (instead of just the Latin digits, when these locales are using by
default the Latin digits, and not the native ones)

With this proposal, the CLDR data for number systems would no longer
contain any data using "=#...=" or "=0...=" substitutions; the traditional
systems would still be able to format all numbers even those they do not
support internally, using the native digits, and the appropriate separators
(decimal, grouping), and appropriate grouping.

One way to implement it however does not require changing the CLDR data:
the implementation can autodetect the "=#...=" or "=0...=" substition rules
found in algorithmic number systems, consider them all equivalent to just
"==": it would first try to map the locale to a "native" decimal variant,
and use it (note that the "native" variant already has fallbacks for all
locales to use the default decimal variant: this is the case for most non
Indian locales that are alone to have "native" mappings).

In summary the resolution for algorithmic systems would use the following
path:
- use "traditional" rules if it works (it uses the RBNF data)
- when it finds a "==" substitution (or any "=0...=" or "=#...="
substitution), find the decimal number system in the "native" variant, and
format numbers in that system, and use the appropriate separators and
groupings
- if there's no "native" variant mapped for that locale, it will fallback
to use the default system (in CLDR data charts, we see that it is the case
because there's an entry mapping "All other locales" to the Latin number
system which will also use the same separators nad groupings.

This will be a major improvement for number systems used in lots of
languages (including Latin-written languages) such as the "roman" number
system.

One more note:

The East-Asian scripts in traditional scripts prefer to use their own
algorithmic system which cannot format all numbers. As they are rendered
using sinographic squares, the fallback "native" digits should use the
"fullwidth" variant: this can be specific using "=-native=" or more
specifically the "=-fullwidth=".

Note that for now no "==" substituon rule can start by a minus sign ("-"),
it must only be:
- a valid ruleset name (starting by % or %%), or
- a decimal format (starting by "0" or "#", that I want to deprecate), or
- empty (but the current implementation in ICU creates an infinite loop, or
only use Basic Latin decimal digits in a fixed number format, independant
of the locale)

So there absolutely no conflict when we use a "==" substitution rule
starting by minus (-) to mean that it should use another specified number
system (such as "native" or "fullwidth" or any specific non-algorithmic
number system) which is named just after this minus sign.

----

Alternatively, the standard code of a locale (starting by a letter 'a' to
'z') could be used in these "==" sustitutions, for example:
- "=ja=" (it would be used only for spellout number formaters for specific
to the Japanese locale),
- "=ar-TN=" (for spellout number formatter in Arabic as spoken in Tunisia,
when words cannot be used, and the Tunisian Arabic rules should be used,
which is different from standard Arabic [ar], as it uses Latin digits
instead of Arabic digits: it would still use the separators and groupings
specified for the Tunisian Arabic locale, which are also not using the
Arabic comma)

In that case, the standard way to designate another number system (without
reference to a specific language) should use the Unicode locale tags for
number systems, but without any leading language subtags (ie.
"=-u-ns-native=", instead of just "=-native=") as number formating rules
are not expected in most cases to replace the language itself, just to
replace the number system): this is the reason for using the leading minus
for such usage (but we could also replace the region code only such as
"=-CN=" or the script code unly such as "=-Bopo="): this is different from
using "=und-CN=" or "=und-Bopo=" because we don't want to replace the
language to an undetermined language, which would use only default digits,
default grouping separators and default groupings formats instead of
keeping them in their current locale.

-- Philippe.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/cldr-users/attachments/20150415/a5a4d482/attachment.html>