From emmo at us.ibm.com  Wed Apr  1 08:00:00 2015
From: emmo at us.ibm.com (John Emmons)
Date: Wed, 1 Apr 2015 08:00:00 -0500
Subject: "no inheritance marker"
In-Reply-To: <551B35C7.5000500@oracle.com>
References: <551B35C7.5000500@oracle.com>
Message-ID: <OF8FBEF254.7325FB09-ON86257E1A.0010A1F1-86257E1A.004769CE@us.ibm.com>

So in the case of your example, ( en_GB for America/Los_Angeles ) - you 
hit the no-inheritance marker, which means there is no recognized short 
abbreviation for the metazone in this locale.

So per the LDML specification, the value should default to the localized 
GMT format ( i.e. "GMT-08:00" during standard time, or "GMT-07:00" during 
daylight savings ).


Regards,

John C. Emmons
Globalization Architect & Unicode CLDR TC Chairman
IBM Software Group
Internet: emmo at us.ibm.com


From:   Naoto Sato <naoto.sato at oracle.com>
To:     cldr-users at unicode.org
Date:   03/31/2015 07:06 PM
Subject:        "no inheritance marker"
Sent by:        "CLDR-Users" <cldr-users-bounces at unicode.org>


Hello,

I have a question on this "no inheritance marker", used in the short 
form of time zone "metazone" names. In LDML spec, it reads:

---
If a given short metazone form is known NOT to be understood in a given 
locale and the parent locale has this value such that it would normally 
be inherited, the inheritance of this value can be explicitly disabled 
by use of the 'no inheritance marker' as the value, which is 3 
simultaneous empty set characters ( U+2205 ). [1]
---

So if an app tries to display the short names with this marker, what 
should they actually be?

For example, in case of "en_GB" locale, lookup for "America_Pacific" 
short names ends up with this "U+2205U+2205U+2205" marker in "en_001" 
locale, which disables inheriting "PT"/"PST"/"PDT" in "en".

Naoto

[1] 
http://www.unicode.org/reports/tr35/tr35-39/tr35-dates.html#Metazone_Names
_______________________________________________
CLDR-Users mailing list
CLDR-Users at unicode.org
http://unicode.org/mailman/listinfo/cldr-users


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/cldr-users/attachments/20150401/8f2d86b6/attachment.html>

From naoto.sato at oracle.com  Wed Apr  1 10:27:18 2015
From: naoto.sato at oracle.com (Naoto Sato)
Date: Wed, 01 Apr 2015 08:27:18 -0700
Subject: "no inheritance marker"
In-Reply-To: <OF8FBEF254.7325FB09-ON86257E1A.0010A1F1-86257E1A.004769CE@us.ibm.com>
References: <551B35C7.5000500@oracle.com>
 <OF8FBEF254.7325FB09-ON86257E1A.0010A1F1-86257E1A.004769CE@us.ibm.com>
Message-ID: <551C0E56.5050509@oracle.com>

Thanks, John. That makes sense.

Naoto

On 4/1/15 6:00 AM, John Emmons wrote:
> So in the case of your example, ( en_GB for America/Los_Angeles ) - you
> hit the no-inheritance marker, which means there is no recognized short
> abbreviation for the metazone in this locale.
>
> So per the LDML specification, the value should default to the localized
> GMT format ( i.e. "GMT-08:00" during standard time, or "GMT-07:00"
> during daylight savings ).
>
>
> Regards,
>
> John C. Emmons
> Globalization Architect & Unicode CLDR TC Chairman
> IBM Software Group
> Internet: emmo at us.ibm.com
>
>
>
>
> From: Naoto Sato <naoto.sato at oracle.com>
> To: cldr-users at unicode.org
> Date: 03/31/2015 07:06 PM
> Subject: "no inheritance marker"
> Sent by: "CLDR-Users" <cldr-users-bounces at unicode.org>
> ------------------------------------------------------------------------
>
>
>
> Hello,
>
> I have a question on this "no inheritance marker", used in the short
> form of time zone "metazone" names. In LDML spec, it reads:
>
> ---
> If a given short metazone form is known NOT to be understood in a given
> locale and the parent locale has this value such that it would normally
> be inherited, the inheritance of this value can be explicitly disabled
> by use of the 'no inheritance marker' as the value, which is 3
> simultaneous empty set characters ( U+2205 ). [1]
> ---
>
> So if an app tries to display the short names with this marker, what
> should they actually be?
>
> For example, in case of "en_GB" locale, lookup for "America_Pacific"
> short names ends up with this "U+2205U+2205U+2205" marker in "en_001"
> locale, which disables inheriting "PT"/"PST"/"PDT" in "en".
>
> Naoto
>
> [1]
> http://www.unicode.org/reports/tr35/tr35-39/tr35-dates.html#Metazone_Names
> _______________________________________________
> CLDR-Users mailing list
> CLDR-Users at unicode.org
> http://unicode.org/mailman/listinfo/cldr-users
>
>
>
>
> _______________________________________________
> CLDR-Users mailing list
> CLDR-Users at unicode.org
> http://unicode.org/mailman/listinfo/cldr-users
>

From rick at unicode.org  Thu Apr  2 14:36:06 2015
From: rick at unicode.org (Rick McGowan)
Date: Thu, 02 Apr 2015 12:36:06 -0700
Subject: CLDR 27.0.1 Maintenance Release
Message-ID: <551D9A26.2070107@unicode.org>

Hello everyone,

Unicode CLDR 27.0.1 is a very small maintenance release that is intended 
to fix some specific problems that were found shortly after CLDR 27 was 
published. If you have already downloaded version 27 and are not 
impacted by any of the specific issues mentioned in the release note, 
then there is no specific need to upgrade from 27 to 27.0.1.  All data 
in common/main is identical between version 27 and version 27.0.1.

Further information can be found on the release page:

http://cldr.unicode.org/index/downloads/cldr-27#27-0-1

Note: this was finalized late on March 31, but rather than announce on 
April Fool's day we waited overnight... :-)


From markus.icu at gmail.com  Fri Apr  3 15:59:50 2015
From: markus.icu at gmail.com (Markus Scherer)
Date: Fri, 3 Apr 2015 13:59:50 -0700
Subject: CLDR proposal: Move collator CLDR settings into ICU format
Message-ID: <CAN49p6oUDbykNvZChsebqUcNvUHU2+MYNQAJ1D-x0MEvRx6b4A@mail.gmail.com>

Dear CLDR team & users,

I would like to propose the following spec & data changes for CLDR 28.
Please provide *feedback by next Thursday, 2015-apr-09*.
CLDR ticket: http://unicode.org/cldr/trac/ticket/8289

Proposal:
- Deprecate XML elements under <collation>:
    import, settings, suppress_contractions, optimize
  together with their specific attributes
- Change the CLDR collation tailorings data to
  replace the use of these XML elements with equivalent ICU syntax

For example:

<settings caseFirst="upper"/>
<import source="da" type="standard"/>
<suppress_contractions>[?-? ?-? ? ? ? ? ?]</suppress_contractions>
<settings normalization="on" alternate="shifted" reorder="Thai"/>

->

[caseFirst upper]
[import da-u-co-standard]
[suppressContractions [?-? ?-? ? ? ? ? ?]]
[normalization on][alternate shifted][reorder Thai]

Rationale:

The LDML collation spec
<http://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Element>
provides for two ways for parametric settings and special rules in
collation tailoring data: via special XML elements, or as part of the ICU
syntax rules in <cr><![CDATA[...]]></cr>. See the underlined elements in
the following line copied from the spec:

<!ELEMENT collation (alias | ( *import*, settings?, suppress_contractions?,
optimize?*, cr*, special*)) >

Two ways of doing the same thing lead to inconsistencies.

CLDR tools and tests would not have to convert these elements to ICU syntax
any more.

The spec would be simpler.

This change makes it clearer that the settings get *import*ed too, not just
the rules.

Note that CLDR 24
<http://www.unicode.org/reports/tr35/tr35-33/tr35.html#Modifications>
deprecated the XML syntax for rules and replaced the XML syntax rules data
with equivalent ICU syntax rules.

Sincerely,
markus
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/cldr-users/attachments/20150403/6eb7f45a/attachment.html>

From mark at macchiato.com  Sat Apr  4 01:05:07 2015
From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=)
Date: Sat, 4 Apr 2015 08:05:07 +0200
Subject: CLDR proposal: Move collator CLDR settings into ICU format
In-Reply-To: <CAN49p6oUDbykNvZChsebqUcNvUHU2+MYNQAJ1D-x0MEvRx6b4A@mail.gmail.com>
References: <CAN49p6oUDbykNvZChsebqUcNvUHU2+MYNQAJ1D-x0MEvRx6b4A@mail.gmail.com>
Message-ID: <CAJ2xs_GkoqPTSiGzVCPrz_DtNOvpMa_Ev5BnqOeQV4xfde01eg@mail.gmail.com>

I'm strongly in favor of these changes.


Mark <https://google.com/+MarkDavis>

*? Il meglio ? l?inimico del bene ?*

On Fri, Apr 3, 2015 at 10:59 PM, Markus Scherer <markus.icu at gmail.com>
wrote:

> Dear CLDR team & users,
>
> I would like to propose the following spec & data changes for CLDR 28.
> Please provide *feedback by next Thursday, 2015-apr-09*.
> CLDR ticket: http://unicode.org/cldr/trac/ticket/8289
>
> Proposal:
> - Deprecate XML elements under <collation>:
>     import, settings, suppress_contractions, optimize
>   together with their specific attributes
> - Change the CLDR collation tailorings data to
>   replace the use of these XML elements with equivalent ICU syntax
>
> For example:
>
> <settings caseFirst="upper"/>
> <import source="da" type="standard"/>
> <suppress_contractions>[?-? ?-? ? ? ? ? ?]</suppress_contractions>
> <settings normalization="on" alternate="shifted" reorder="Thai"/>
>
> ->
>
> [caseFirst upper]
> [import da-u-co-standard]
> [suppressContractions [?-? ?-? ? ? ? ? ?]]
> [normalization on][alternate shifted][reorder Thai]
>
> Rationale:
>
> The LDML collation spec
> <http://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Element>
> provides for two ways for parametric settings and special rules in
> collation tailoring data: via special XML elements, or as part of the ICU
> syntax rules in <cr><![CDATA[...]]></cr>. See the underlined elements in
> the following line copied from the spec:
>
> <!ELEMENT collation (alias | ( *import*, settings?,
> suppress_contractions?, optimize?*, cr*, special*)) >
>
> Two ways of doing the same thing lead to inconsistencies.
>
> CLDR tools and tests would not have to convert these elements to ICU
> syntax any more.
>
> The spec would be simpler.
>
> This change makes it clearer that the settings get *import*ed too, not
> just the rules.
>
> Note that CLDR 24
> <http://www.unicode.org/reports/tr35/tr35-33/tr35.html#Modifications>
> deprecated the XML syntax for rules and replaced the XML syntax rules data
> with equivalent ICU syntax rules.
>
> Sincerely,
> markus
>
> _______________________________________________
> CLDR-Users mailing list
> CLDR-Users at unicode.org
> http://unicode.org/mailman/listinfo/cldr-users
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/cldr-users/attachments/20150404/6e06667c/attachment-0001.html>

From verdy_p at wanadoo.fr  Sat Apr  4 02:05:56 2015
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Sat, 4 Apr 2015 09:05:56 +0200
Subject: CLDR proposal: Move collator CLDR settings into ICU format
In-Reply-To: <CAJ2xs_GkoqPTSiGzVCPrz_DtNOvpMa_Ev5BnqOeQV4xfde01eg@mail.gmail.com>
References: <CAN49p6oUDbykNvZChsebqUcNvUHU2+MYNQAJ1D-x0MEvRx6b4A@mail.gmail.com>
 <CAJ2xs_GkoqPTSiGzVCPrz_DtNOvpMa_Ev5BnqOeQV4xfde01eg@mail.gmail.com>
Message-ID: <CAGa7JC3z0epeXnt=KMjhfgGCd1RPwHbop4uz96VKjRVO=DjvZw@mail.gmail.com>

May be there's a way to use (or create) a converter tool that will
automatically generate an equivalent XML version for at least some versions
(allow transitions).
These generated files would be explicitly marked as "derived" (so that they
are no longer directly supported as references, only provided to be
informative).

Or put the sources of such conversion tool in an opensource repository
(should compile at least on Linux, possibly on Windows too, or written in a
portable and widely used language available across platforms such as
Javascript.or Java).

This open-sourced tool does not need to be optimized (this is a one-shot
conversion), it should be demonstrative, so its sources should remain as
simple as possible without lots of dependencies with various external
libraries or API's and complex data structures. In fact this source can be
a useful informative companion of the specifications (often it is just
simpler and faster to look at the sources instead of deciphering natural
English text and its ambiguities that occur too easily).

But this source can also give programming hints to implementers about how
to parse correctly the reference data for their applications, even if in
fact they will use another appropriate internal format for betrer
performance at runtime : collation in applications is a critical
functionality where performance is highly desired, in order
to efficiently manage large volumes of text, for example in plain text
searches or when sorting query result sets, so they in fact do not even use
the ICU public syntax or XML syntax internally using parsers repeatedly).


2015-04-04 8:05 GMT+02:00 Mark Davis [image: ?]? <mark at macchiato.com>:

> I'm strongly in favor of these changes.
>
>
> Mark <https://google.com/+MarkDavis>
>
> *? Il meglio ? l?inimico del bene ?*
>
> On Fri, Apr 3, 2015 at 10:59 PM, Markus Scherer <markus.icu at gmail.com>
> wrote:
>
>> Dear CLDR team & users,
>>
>> I would like to propose the following spec & data changes for CLDR 28.
>> Please provide *feedback by next Thursday, 2015-apr-09*.
>> CLDR ticket: http://unicode.org/cldr/trac/ticket/8289
>>
>> Proposal:
>> - Deprecate XML elements under <collation>:
>>     import, settings, suppress_contractions, optimize
>>   together with their specific attributes
>> - Change the CLDR collation tailorings data to
>>   replace the use of these XML elements with equivalent ICU syntax
>>
>> For example:
>>
>> <settings caseFirst="upper"/>
>> <import source="da" type="standard"/>
>> <suppress_contractions>[?-? ?-? ? ? ? ? ?]</suppress_contractions>
>> <settings normalization="on" alternate="shifted" reorder="Thai"/>
>>
>> ->
>>
>> [caseFirst upper]
>> [import da-u-co-standard]
>> [suppressContractions [?-? ?-? ? ? ? ? ?]]
>> [normalization on][alternate shifted][reorder Thai]
>>
>> Rationale:
>>
>> The LDML collation spec
>> <http://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Element>
>> provides for two ways for parametric settings and special rules in
>> collation tailoring data: via special XML elements, or as part of the ICU
>> syntax rules in <cr><![CDATA[...]]></cr>. See the underlined elements in
>> the following line copied from the spec:
>>
>> <!ELEMENT collation (alias | ( *import*, settings?,
>> suppress_contractions?, optimize?*, cr*, special*)) >
>>
>> Two ways of doing the same thing lead to inconsistencies.
>>
>> CLDR tools and tests would not have to convert these elements to ICU
>> syntax any more.
>>
>> The spec would be simpler.
>>
>> This change makes it clearer that the settings get *import*ed too, not
>> just the rules.
>>
>> Note that CLDR 24
>> <http://www.unicode.org/reports/tr35/tr35-33/tr35.html#Modifications>
>> deprecated the XML syntax for rules and replaced the XML syntax rules data
>> with equivalent ICU syntax rules.
>>
>> Sincerely,
>> markus
>>
>> _______________________________________________
>> CLDR-Users mailing list
>> CLDR-Users at unicode.org
>> http://unicode.org/mailman/listinfo/cldr-users
>>
>>
>
> _______________________________________________
> CLDR-Users mailing list
> CLDR-Users at unicode.org
> http://unicode.org/mailman/listinfo/cldr-users
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/cldr-users/attachments/20150404/85998746/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: emoji_u2615.png
Type: image/png
Size: 1890 bytes
Desc: not available
URL: <http://unicode.org/pipermail/cldr-users/attachments/20150404/85998746/attachment.png>

From verdy_p at wanadoo.fr  Wed Apr 15 08:42:35 2015
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Wed, 15 Apr 2015 15:42:35 +0200
Subject: alternate formatting data for algorithmic number systems when they
 fallback to a decimal system
Message-ID: <CAGa7JC3WrYYU4hEftbC+3Fd0u9xLpOE_59Nk5oGyLHhS3mCwmg@mail.gmail.com>

For now the CLDR data for algorithmic number systems are using RBNF rules
when this is possible but the last mapping when this does not work is to
use a specific decimal format (starting by 0 or #).

One problem is that this decimal format is the same independantly of the
actual locale (language or number style in that language) for which the
number system has been mapped.

Different locales using the same number system have in fact different rules
for formatting numbers when they are forced to use a fallback to a decimal
system.

These fallbacks are typically currently specified as the substitution
"=#,##0.00=", which is clearly wrong (e.g. for Traditional Tamil): these
formats are assuming in fact a specific language, and it is not the same
for all locales using this number system.

I propose deprecating these mappings and instead just set them to the
substitution "==" meaning that it will use the format for the decimal
system which will be used instead.

Note that when using locale resolution mechanisms to find the appropriate
number system to use for formatting numbers, it will (if you don't care
about it) map it again to the same traditional algorithmic system so this
would recurse infinitely:

- the "==" substition must look for a mapping for the locale in the
*default* decimal number variant,

- but it could also map to the "native" decimal number variant mapped for
that locale (replacing the "traditional" variant which is algorithmic,
using the substitution "=-native=", so that native digits will still be
used (instead of just the Latin digits, when these locales are using by
default the Latin digits, and not the native ones)

With this proposal, the CLDR data for number systems would no longer
contain any data using "=#...=" or "=0...=" substitutions; the traditional
systems would still be able to format all numbers even those they do not
support internally, using the native digits, and the appropriate separators
(decimal, grouping), and appropriate grouping.

One way to implement it however does not require changing the CLDR data:
the implementation can autodetect the "=#...=" or "=0...=" substition rules
found in algorithmic number systems, consider them all equivalent to just
"==": it would first try to map the locale to a "native" decimal variant,
and use it (note that the "native" variant already has fallbacks for all
locales to use the default decimal variant: this is the case for most non
Indian locales that are alone to have "native" mappings).

In summary the resolution for algorithmic systems would use the following
path:
- use "traditional" rules if it works (it uses the RBNF data)
- when it finds a "==" substitution (or any "=0...=" or "=#...="
substitution), find the decimal number system in the "native" variant, and
format numbers in that system, and use the appropriate separators and
groupings
- if there's no "native" variant mapped for that locale, it will fallback
to use the default system (in CLDR data charts, we see that it is the case
because there's an entry mapping "All other locales" to the Latin number
system which will also use the same separators nad groupings.

This will be a major improvement for number systems used in lots of
languages (including Latin-written languages) such as the "roman" number
system.

One more note:

The East-Asian scripts in traditional scripts prefer to use their own
algorithmic system which cannot format all numbers. As they are rendered
using sinographic squares, the fallback "native" digits should use the
"fullwidth" variant: this can be specific using "=-native=" or more
specifically the "=-fullwidth=".

Note that for now no "==" substituon rule can start by a minus sign ("-"),
it must only be:
- a valid ruleset name (starting by % or %%), or
- a decimal format (starting by "0" or "#", that I want to deprecate), or
- empty (but the current implementation in ICU creates an infinite loop, or
only use Basic Latin decimal digits in a fixed number format, independant
of the locale)

So there absolutely no conflict when we use a "==" substitution rule
starting by minus (-) to mean that it should use another specified number
system (such as "native" or "fullwidth" or any specific non-algorithmic
number system) which is named just after this minus sign.

----

Alternatively, the standard code of a locale (starting by a letter 'a' to
'z') could be used in these "==" sustitutions, for example:
- "=ja=" (it would be used only for spellout number formaters for specific
to the Japanese locale),
- "=ar-TN=" (for spellout number formatter in Arabic as spoken in Tunisia,
when words cannot be used, and the Tunisian Arabic rules should be used,
which is different from standard Arabic [ar], as it uses Latin digits
instead of Arabic digits: it would still use the separators and groupings
specified for the Tunisian Arabic locale, which are also not using the
Arabic comma)

In that case, the standard way to designate another number system (without
reference to a specific language) should use the Unicode locale tags for
number systems, but without any leading language subtags (ie.
"=-u-ns-native=", instead of just "=-native=") as number formating rules
are not expected in most cases to replace the language itself, just to
replace the number system): this is the reason for using the leading minus
for such usage (but we could also replace the region code only such as
"=-CN=" or the script code unly such as "=-Bopo="): this is different from
using "=und-CN=" or "=und-Bopo=" because we don't want to replace the
language to an undetermined language, which would use only default digits,
default grouping separators and default groupings formats instead of
keeping them in their current locale.


-- Philippe.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/cldr-users/attachments/20150415/a5a4d482/attachment.html>

From cameron at lumoslabs.com  Wed Apr 15 12:39:17 2015
From: cameron at lumoslabs.com (Cameron Dutro)
Date: Wed, 15 Apr 2015 10:39:17 -0700
Subject: alternate formatting data for algorithmic number systems when
 they fallback to a decimal system
In-Reply-To: <CAGa7JC3WrYYU4hEftbC+3Fd0u9xLpOE_59Nk5oGyLHhS3mCwmg@mail.gmail.com>
References: <CAGa7JC3WrYYU4hEftbC+3Fd0u9xLpOE_59Nk5oGyLHhS3mCwmg@mail.gmail.com>
Message-ID: <CAECedD8pq0wT8jD4JMUynUvOMoYXSkmKNj-qt8znqDZ4+HR4gQ@mail.gmail.com>

Hey Philippe,

My understanding is that the implementer should just use the number system
for the given locale. ICU actually lets you specify the number system, see
the docs here:
http://www.icu-project.org/apiref/icu4c/classRuleBasedNumberFormat.html
(see specifically the icu::RuleBasedNumberFormat::RuleBasedNumberFormat
constructor). I understand from your email that converting to a different
number system isn't always as straightforward as a 1:1 text replace, but I
believe the current CLDR number formatting rules handle these cases, yes?
I've noticed that ICU at least formats numbers in RBNF rules using the
correct numbering system for the locale.

-Cameron

On Wed, Apr 15, 2015 at 6:42 AM, Philippe Verdy <verdy_p at wanadoo.fr> wrote:

> For now the CLDR data for algorithmic number systems are using RBNF rules
> when this is possible but the last mapping when this does not work is to
> use a specific decimal format (starting by 0 or #).
>
> One problem is that this decimal format is the same independantly of the
> actual locale (language or number style in that language) for which the
> number system has been mapped.
>
> Different locales using the same number system have in fact different
> rules for formatting numbers when they are forced to use a fallback to a
> decimal system.
>
> These fallbacks are typically currently specified as the substitution
> "=#,##0.00=", which is clearly wrong (e.g. for Traditional Tamil): these
> formats are assuming in fact a specific language, and it is not the same
> for all locales using this number system.
>
> I propose deprecating these mappings and instead just set them to the
> substitution "==" meaning that it will use the format for the decimal
> system which will be used instead.
>
> Note that when using locale resolution mechanisms to find the appropriate
> number system to use for formatting numbers, it will (if you don't care
> about it) map it again to the same traditional algorithmic system so this
> would recurse infinitely:
>
> - the "==" substition must look for a mapping for the locale in the
> *default* decimal number variant,
>
> - but it could also map to the "native" decimal number variant mapped for
> that locale (replacing the "traditional" variant which is algorithmic,
> using the substitution "=-native=", so that native digits will still be
> used (instead of just the Latin digits, when these locales are using by
> default the Latin digits, and not the native ones)
>
> With this proposal, the CLDR data for number systems would no longer
> contain any data using "=#...=" or "=0...=" substitutions; the traditional
> systems would still be able to format all numbers even those they do not
> support internally, using the native digits, and the appropriate separators
> (decimal, grouping), and appropriate grouping.
>
> One way to implement it however does not require changing the CLDR data:
> the implementation can autodetect the "=#...=" or "=0...=" substition rules
> found in algorithmic number systems, consider them all equivalent to just
> "==": it would first try to map the locale to a "native" decimal variant,
> and use it (note that the "native" variant already has fallbacks for all
> locales to use the default decimal variant: this is the case for most non
> Indian locales that are alone to have "native" mappings).
>
> In summary the resolution for algorithmic systems would use the following
> path:
> - use "traditional" rules if it works (it uses the RBNF data)
> - when it finds a "==" substitution (or any "=0...=" or "=#...="
> substitution), find the decimal number system in the "native" variant, and
> format numbers in that system, and use the appropriate separators and
> groupings
> - if there's no "native" variant mapped for that locale, it will fallback
> to use the default system (in CLDR data charts, we see that it is the case
> because there's an entry mapping "All other locales" to the Latin number
> system which will also use the same separators nad groupings.
>
> This will be a major improvement for number systems used in lots of
> languages (including Latin-written languages) such as the "roman" number
> system.
>
> One more note:
>
> The East-Asian scripts in traditional scripts prefer to use their own
> algorithmic system which cannot format all numbers. As they are rendered
> using sinographic squares, the fallback "native" digits should use the
> "fullwidth" variant: this can be specific using "=-native=" or more
> specifically the "=-fullwidth=".
>
> Note that for now no "==" substituon rule can start by a minus sign ("-"),
> it must only be:
> - a valid ruleset name (starting by % or %%), or
> - a decimal format (starting by "0" or "#", that I want to deprecate), or
> - empty (but the current implementation in ICU creates an infinite loop,
> or only use Basic Latin decimal digits in a fixed number format,
> independant of the locale)
>
> So there absolutely no conflict when we use a "==" substitution rule
> starting by minus (-) to mean that it should use another specified number
> system (such as "native" or "fullwidth" or any specific non-algorithmic
> number system) which is named just after this minus sign.
>
> ----
>
> Alternatively, the standard code of a locale (starting by a letter 'a' to
> 'z') could be used in these "==" sustitutions, for example:
> - "=ja=" (it would be used only for spellout number formaters for specific
> to the Japanese locale),
> - "=ar-TN=" (for spellout number formatter in Arabic as spoken in Tunisia,
> when words cannot be used, and the Tunisian Arabic rules should be used,
> which is different from standard Arabic [ar], as it uses Latin digits
> instead of Arabic digits: it would still use the separators and groupings
> specified for the Tunisian Arabic locale, which are also not using the
> Arabic comma)
>
> In that case, the standard way to designate another number system (without
> reference to a specific language) should use the Unicode locale tags for
> number systems, but without any leading language subtags (ie.
> "=-u-ns-native=", instead of just "=-native=") as number formating rules
> are not expected in most cases to replace the language itself, just to
> replace the number system): this is the reason for using the leading minus
> for such usage (but we could also replace the region code only such as
> "=-CN=" or the script code unly such as "=-Bopo="): this is different from
> using "=und-CN=" or "=und-Bopo=" because we don't want to replace the
> language to an undetermined language, which would use only default digits,
> default grouping separators and default groupings formats instead of
> keeping them in their current locale.
>
>
> -- Philippe.
>
>
> _______________________________________________
> CLDR-Users mailing list
> CLDR-Users at unicode.org
> http://unicode.org/mailman/listinfo/cldr-users
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/cldr-users/attachments/20150415/f96bbf12/attachment-0001.html>

From cameron at lumoslabs.com  Thu Apr 16 11:09:52 2015
From: cameron at lumoslabs.com (Cameron Dutro)
Date: Thu, 16 Apr 2015 09:09:52 -0700
Subject: Fwd: alternate formatting data for algorithmic number systems when
 they fallback to a decimal system
In-Reply-To: <CAECedD8F17eeNKmwqTN8BSzNozFUXWiPHyP0+T-gWR6zb6LKkw@mail.gmail.com>
References: <CAGa7JC3WrYYU4hEftbC+3Fd0u9xLpOE_59Nk5oGyLHhS3mCwmg@mail.gmail.com>
 <CAECedD8pq0wT8jD4JMUynUvOMoYXSkmKNj-qt8znqDZ4+HR4gQ@mail.gmail.com>
 <CAGa7JC1RjZ=aUEh9kM=9MUGbw=UXGmcG_mb45aHpFgMY3nUSqw@mail.gmail.com>
 <CAGa7JC1n17Og7vSK2zCMfeTo55xpuwCAq9MGiwv7_1R2HUFbGQ@mail.gmail.com>
 <CAECedD8F17eeNKmwqTN8BSzNozFUXWiPHyP0+T-gWR6zb6LKkw@mail.gmail.com>
Message-ID: <CAECedD_AN10boBfVPqx7qjWpQTb4EqhDcQHqZE+tm_PMYORtnQ@mail.gmail.com>

---------- Forwarded message ----------
From: Cameron Dutro <cameron at lumoslabs.com>
Date: Thu, Apr 16, 2015 at 9:09 AM
Subject: Re: alternate formatting data for algorithmic number systems when
they fallback to a decimal system
To: Philippe Verdy <verdy_p at wanadoo.fr>


Thank you for the clarification Philippe. In my previous email I was not
trying necessarily to respond with approval or disapproval of your
proposal, but instead understand the issue better. I am in no position to
affect any kind of change in CLDR or ICU. Having read your second and third
emails, I think I agree with you. I'd like to hear what Mark and Markus
have to say about this too, however.

-Cameron

On Thu, Apr 16, 2015 at 3:09 AM, Philippe Verdy <verdy_p at wanadoo.fr> wrote:

> My proposal concerns in fact all types of number formatters currently
> supported in CLDR data and that could all be algorithmic:
> - number systems (cardinals),
> - ordinal,
> - year numbering,
> - month numbering,
> - day numbering,
> - century numbering (in French it uses the roman-lower system with
> ordinals),
> - millenium numbering (in French it uses the roman-upper system with
> ordinal),
> - accounting amounts,
> - currency amounts (displayed prices),
> - measurement with unit,
> - spellout using translated words for all the usages above...
>
> It also concerns number parsers, that are built to parse and accept all
> these formatted numbers using the same rulesets, plus a lenient parsing
> ruleset for accepting numbers not formatted this way (e.g. a "roman-lower"
> parser will typically contain lenient parsing rules for accepting all
> numbers formatted with a decimal system, as well as numbers formatted in
> "roman-upper")...
>  Le 16 avr. 2015 11:27, "Philippe Verdy" <verdy_p at wanadoo.fr> a ?crit :
>
>> No ICU does NOT handle this case.
>>
>> When using a locale whose number system is algorithmic, yes it uses that
>> system, as specified in CLDR data, and yes it yes the RBNF rulesets
>> associated.
>>
>> But the problem is within these rulesets when one of the rules specifies
>> a substitution which is neither another ruleset name and neither an empty
>> substitution (such as == or << or >>) but a decimal format starting by 0 or
>> #.
>>
>> On that case the decimal format is used blindly and does not use the
>> native decimal digits or the native separators or the native grouping and
>> decimal formats or that locale.
>>
>> The problem being in fact in CLDR data where the rule specifies a
>> substitution like this one in the "roman-lower" system:
>>
>> "5000: =##,##0="
>>
>> which should really be
>>
>> "5000: =="
>>
>> to ignore the specified decimal format but instead select an appropriate
>> decimal format for the locale in ANOTHER number system that will not be
>> algorithmic but decimal, and searched by default first for the "native"
>> system when it is mapped for that locale (in CLDR data all locales have a
>> mapping of the effective number system to use when we use the "native"
>> number system alias, this is mot the case for the "finance" or "traditio"
>> number system alias) before the defaut number system for that locale (in
>> CLDR data, all locales have a decimal system mapped there which is not
>> necessarily the modern latin system but is formatable with ten digits and
>> standard separators and signs which are still localized.)
>>
>> On summary you have still not understood why this an issue not just
>> inside ICU but in fact in CLDR data itself independantly of the ICU
>> implementation. The problem is NOT:
>> ? in the mapping of locales to their number systems in several variants
>> (default, native, traditio, finance) and possibly also aliased,
>> ? in the mapping of number system to a decimal or algorithmic type.
>> ? in the definition of each algorithmic number system by a group of
>> rulesets including one which is public (not named with a %% prefix) and
>> designated as the main ruleset to use.
>> ? in the definition of each ruleset widget several rules, each file being
>> keyed either by special rule type (proper fraction, improper fraction, or
>> master) or by value (an integer or fraction).
>>
>> The problem is in the definition of an individual RBNF rule, where it
>> uses a substitution to a decimal format starting by 0 or # (such
>> substitution may be surrounded by == or << or >> to soecify hiw to compute
>> the value to firmat): this is something that I propose to deprecate and
>> even completely from CLDR data as it is clearly wrong or insufficient as it
>> bypasses the per-locale settings of their prefered decimal system if not
>> using their prefered algorithmic system.
>>
>> However I maintain the role of == or << or >> to compute the value that
>> will be passed down the decimal formater.
>>
>> So your reply in fact gives absolutely no hint and even the link to the
>> ICU constructor is inappropriate for this issue (I know what it does, and I
>> had already inspected this code before sending my first email with the
>> proposal). You had clearly not understood the issue that i have just
>> reformulated here with more explicit details.
>> Le 15 avr. 2015 19:39, "Cameron Dutro" <cameron at lumoslabs.com> a ?crit :
>>
>>> Hey Philippe,
>>>
>>> My understanding is that the implementer should just use the number
>>> system for the given locale. ICU actually lets you specify the number
>>> system, see the docs here:
>>> http://www.icu-project.org/apiref/icu4c/classRuleBasedNumberFormat.html
>>> (see specifically the icu::RuleBasedNumberFormat::RuleBasedNumberFormat
>>> constructor). I understand from your email that converting to a different
>>> number system isn't always as straightforward as a 1:1 text replace, but I
>>> believe the current CLDR number formatting rules handle these cases, yes?
>>> I've noticed that ICU at least formats numbers in RBNF rules using the
>>> correct numbering system for the locale.
>>>
>>> -Cameron
>>>
>>> On Wed, Apr 15, 2015 at 6:42 AM, Philippe Verdy <verdy_p at wanadoo.fr>
>>> wrote:
>>>
>>>> For now the CLDR data for algorithmic number systems are using RBNF
>>>> rules when this is possible but the last mapping when this does not work is
>>>> to use a specific decimal format (starting by 0 or #).
>>>>
>>>> One problem is that this decimal format is the same independantly of
>>>> the actual locale (language or number style in that language) for which the
>>>> number system has been mapped.
>>>>
>>>> Different locales using the same number system have in fact different
>>>> rules for formatting numbers when they are forced to use a fallback to a
>>>> decimal system.
>>>>
>>>> These fallbacks are typically currently specified as the substitution
>>>> "=#,##0.00=", which is clearly wrong (e.g. for Traditional Tamil): these
>>>> formats are assuming in fact a specific language, and it is not the same
>>>> for all locales using this number system.
>>>>
>>>> I propose deprecating these mappings and instead just set them to the
>>>> substitution "==" meaning that it will use the format for the decimal
>>>> system which will be used instead.
>>>>
>>>> Note that when using locale resolution mechanisms to find the
>>>> appropriate number system to use for formatting numbers, it will (if you
>>>> don't care about it) map it again to the same traditional algorithmic
>>>> system so this would recurse infinitely:
>>>>
>>>> - the "==" substition must look for a mapping for the locale in the
>>>> *default* decimal number variant,
>>>>
>>>> - but it could also map to the "native" decimal number variant mapped
>>>> for that locale (replacing the "traditional" variant which is algorithmic,
>>>> using the substitution "=-native=", so that native digits will still be
>>>> used (instead of just the Latin digits, when these locales are using by
>>>> default the Latin digits, and not the native ones)
>>>>
>>>> With this proposal, the CLDR data for number systems would no longer
>>>> contain any data using "=#...=" or "=0...=" substitutions; the traditional
>>>> systems would still be able to format all numbers even those they do not
>>>> support internally, using the native digits, and the appropriate separators
>>>> (decimal, grouping), and appropriate grouping.
>>>>
>>>> One way to implement it however does not require changing the CLDR
>>>> data: the implementation can autodetect the "=#...=" or "=0...=" substition
>>>> rules found in algorithmic number systems, consider them all equivalent to
>>>> just "==": it would first try to map the locale to a "native" decimal
>>>> variant, and use it (note that the "native" variant already has fallbacks
>>>> for all locales to use the default decimal variant: this is the case for
>>>> most non Indian locales that are alone to have "native" mappings).
>>>>
>>>> In summary the resolution for algorithmic systems would use the
>>>> following path:
>>>> - use "traditional" rules if it works (it uses the RBNF data)
>>>> - when it finds a "==" substitution (or any "=0...=" or "=#...="
>>>> substitution), find the decimal number system in the "native" variant, and
>>>> format numbers in that system, and use the appropriate separators and
>>>> groupings
>>>> - if there's no "native" variant mapped for that locale, it will
>>>> fallback to use the default system (in CLDR data charts, we see that it is
>>>> the case because there's an entry mapping "All other locales" to the Latin
>>>> number system which will also use the same separators nad groupings.
>>>>
>>>> This will be a major improvement for number systems used in lots of
>>>> languages (including Latin-written languages) such as the "roman" number
>>>> system.
>>>>
>>>> One more note:
>>>>
>>>> The East-Asian scripts in traditional scripts prefer to use their own
>>>> algorithmic system which cannot format all numbers. As they are rendered
>>>> using sinographic squares, the fallback "native" digits should use the
>>>> "fullwidth" variant: this can be specific using "=-native=" or more
>>>> specifically the "=-fullwidth=".
>>>>
>>>> Note that for now no "==" substituon rule can start by a minus sign
>>>> ("-"), it must only be:
>>>> - a valid ruleset name (starting by % or %%), or
>>>> - a decimal format (starting by "0" or "#", that I want to deprecate),
>>>> or
>>>> - empty (but the current implementation in ICU creates an infinite
>>>> loop, or only use Basic Latin decimal digits in a fixed number format,
>>>> independant of the locale)
>>>>
>>>> So there absolutely no conflict when we use a "==" substitution rule
>>>> starting by minus (-) to mean that it should use another specified number
>>>> system (such as "native" or "fullwidth" or any specific non-algorithmic
>>>> number system) which is named just after this minus sign.
>>>>
>>>> ----
>>>>
>>>> Alternatively, the standard code of a locale (starting by a letter 'a'
>>>> to 'z') could be used in these "==" sustitutions, for example:
>>>> - "=ja=" (it would be used only for spellout number formaters for
>>>> specific to the Japanese locale),
>>>> - "=ar-TN=" (for spellout number formatter in Arabic as spoken in
>>>> Tunisia, when words cannot be used, and the Tunisian Arabic rules should be
>>>> used, which is different from standard Arabic [ar], as it uses Latin digits
>>>> instead of Arabic digits: it would still use the separators and groupings
>>>> specified for the Tunisian Arabic locale, which are also not using the
>>>> Arabic comma)
>>>>
>>>> In that case, the standard way to designate another number system
>>>> (without reference to a specific language) should use the Unicode locale
>>>> tags for number systems, but without any leading language subtags (ie.
>>>> "=-u-ns-native=", instead of just "=-native=") as number formating rules
>>>> are not expected in most cases to replace the language itself, just to
>>>> replace the number system): this is the reason for using the leading minus
>>>> for such usage (but we could also replace the region code only such as
>>>> "=-CN=" or the script code unly such as "=-Bopo="): this is different from
>>>> using "=und-CN=" or "=und-Bopo=" because we don't want to replace the
>>>> language to an undetermined language, which would use only default digits,
>>>> default grouping separators and default groupings formats instead of
>>>> keeping them in their current locale.
>>>>
>>>>
>>>> -- Philippe.
>>>>
>>>>
>>>> _______________________________________________
>>>> CLDR-Users mailing list
>>>> CLDR-Users at unicode.org
>>>> http://unicode.org/mailman/listinfo/cldr-users
>>>>
>>>>
>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/cldr-users/attachments/20150416/f84379d6/attachment.html>