From markus.icu at gmail.com  Thu Apr  3 12:01:02 2014
From: markus.icu at gmail.com (Markus Scherer)
Date: Thu, 3 Apr 2014 10:01:02 -0700
Subject: CLDR proposal: Unicode algorithms should fall back to root, not to
 unrelated default locale
Message-ID: <CAN49p6rc5wK1F0ftSEwkdJqsFmCRdga8=9v+hA7Q2zuc9L1ASw@mail.gmail.com>

Dear CLDR team & users,

We have consensus in the ICU team for a modified fallback policy for when
data is requested for a service based on a Unicode algorithm.

Assuming that such a policy is appropriate for the LDML spec (I have not
looked whether the spec currently mentions fallbacks in the absence of
data), I propose that we add the following:

When requesting a specific locale for collation, break iteration, or case
mapping, when we do not have any data for even the locale's base language,
then we should fall back to the root locale rather than the default locale.

Note: This will not change behavior for languages for which we do have
specific data for the service, even if it is an empty data file.

Each of these services tailors a Unicode algorithm which is explicitly
designed to provide reasonable default behavior when no language-specific
behavior is known or available.

For example, in 2012/ICU 52m1, we had an ?environment test? failure (
ticket:10277 <http://bugs.icu-project.org/trac/ticket/10277>) that was
caused by requesting Basque (eu) collation and AlphabeticIndex when the
default locale was Azerbaijani (az), Lithuanian, or Ethiopian (et) (and
maybe more languages); in Azerbaijani, x sorts between h and i; this is
undesirable when the request was for Basque. In the absence of specific
Basque data, we should assume that the all-Unicode root sort order is
appropriate.

Similarly, it is undesirable to fall back from French to Turkish case
mappings, or from Italian to Finnish line breaking.

By contrast, for UI languages, display names, and formatting, the root
locale is not useful: No UI messages, ISO codes instead of display names,
minimal patterns. By falling back to a default locale, the user gets
strings in what is hopefully a language they understand, even if not the
language they requested.
Sincerely,
markus
-- 
Google Internationalization Engineering
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/cldr-users/attachments/20140403/956c3827/attachment.html>

From richard.wordingham at ntlworld.com  Thu Apr  3 15:21:09 2014
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Thu, 3 Apr 2014 21:21:09 +0100
Subject: CLDR proposal: Unicode algorithms should fall back to root, not
 to unrelated default locale
In-Reply-To: <CAN49p6rc5wK1F0ftSEwkdJqsFmCRdga8=9v+hA7Q2zuc9L1ASw@mail.gmail.com>
References: <CAN49p6rc5wK1F0ftSEwkdJqsFmCRdga8=9v+hA7Q2zuc9L1ASw@mail.gmail.com>
Message-ID: <20140403212109.33833276@JRWUBU2>

On Thu, 3 Apr 2014 10:01:02 -0700
Markus Scherer <markus.icu at gmail.com> wrote:

> When requesting a specific locale for collation, break iteration, or
> case mapping, when we do not have any data for even the locale's base
> language, then we should fall back to the root locale rather than the
> default locale.

Would language matching data take preference over either?  I can see
deserving use cases where the default language is the national language
and the selected locale is for a minority language.

How are break iteration rules meant to interact with dictionary-based
word and line-breakers?

> Note: This will not change behavior for languages for which we do have
> specific data for the service, even if it is an empty data file.

Richard.

From richard.wordingham at ntlworld.com  Thu Apr  3 16:30:36 2014
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Thu, 3 Apr 2014 22:30:36 +0100
Subject: Non-primary Weights of U+FFFE
In-Reply-To: <CAN49p6ogEHP+G=vrER=1XNoCE_YPhoLKMSKP-MDFPJ6+mgyg+Q@mail.gmail.com>
References: <20140330132445.43398a4e@JRWUBU2>
 <CAN49p6ogEHP+G=vrER=1XNoCE_YPhoLKMSKP-MDFPJ6+mgyg+Q@mail.gmail.com>
Message-ID: <20140403223036.3ab46070@JRWUBU2>

On Sun, 30 Mar 2014 09:17:44 -0700
Markus Scherer <markus.icu at gmail.com> wrote:

> On Sun, Mar 30, 2014 at 5:24 AM, Richard Wordingham <
> richard.wordingham at ntlworld.com> wrote:
> 
> > Is there any reason that a CLDR-compliant collation algorithm should
> > particularly care about the non-primary weights of U+FFFE?  So long
> > as they satisfy the well-formedness conditions, all I can see is
> > that having unique values *may* simplify sort key formation for
> > reversed levels.
> >
> 
> The non-primary weights need to be greater than the level
> separator(s)

Guaranteed by WF1 and S3.2

> and less than the weights of CEs that are ignorable on
> previous levels.

Guaranteed by WF2 plus case-related rules, even if U+FFFE is not
treated as a special case.

> It is also important to generate the special weights
> on primary to tertiary levels for shifted CEs, so that
> alternate=shifted works properly.

Can you expand on this, because I don't see any such need at the
primary to tertiary levels.

>From your comment on ICU below, I can now see that you are specifying
a behaviour for the quaternary level.  Now, in full strength
comparisons, we have, whatever the alternate setting,

"op" < "?p"
"o p" < "op"

Now, "o\uFFFE p" < "o\uFFFEp" < "o \uFFFEp" for alternate=non-ignorable.
However, if the quaternary level weight of \uFFFE was calculated by the
the Unicode Collation Algorithm using allkeys_CLDR.txt as its collation
element table, we would have

"o \uFFFEp" < "o\uFFFE p" < "o\uFFFEp" for alternate=non-ignorable

To get the same ordering for these strings as for
alternate=non-ignorable, one needs U+FFFE to have a minimal quaternary
weight.  I don't see a test for this in CollationTest_CLDR_SHIFTED.txt.

It seems that the UCA should be adjusted (in Section 3.6, variable
weighting) so that L4 weights for L1 non-variable but less than a
variable weight is 'as L1', rather than FFFF.  If I formally report
this, should it be via a CLDR ticket or through the general Unicode
mechanism?

> In ICU, we have test code that expects the same sort keys generated
> from concatenating two strings with U+FFFE vs. calling
> ucol_mergeSortkeys() on the two separate sort keys. The latter merges
> sort keys by copying each level (separated by byte 01) from each sort
> key and inserting a byte 02 between the bytes from different sort
> keys. (see
> ucol.h<http://www.icu-project.org/apiref/icu4c/ucol_8h.html> )

So is the reason for unique weights at the secondary to tertiary levels
simply that you don't want to have to unpick ICU's run-length
compression for your test?

Richard.


From markus.icu at gmail.com  Thu Apr  3 22:01:40 2014
From: markus.icu at gmail.com (Markus Scherer)
Date: Thu, 3 Apr 2014 20:01:40 -0700
Subject: CLDR proposal: Unicode algorithms should fall back to root, not
 to unrelated default locale
In-Reply-To: <20140403212109.33833276@JRWUBU2>
References: <CAN49p6rc5wK1F0ftSEwkdJqsFmCRdga8=9v+hA7Q2zuc9L1ASw@mail.gmail.com>
 <20140403212109.33833276@JRWUBU2>
Message-ID: <CAN49p6pyGDvuTEDzc6BQTr7dtNcAXR=ckVz6xmKX0C8jgv_AWw@mail.gmail.com>

On Thu, Apr 3, 2014 at 1:21 PM, Richard Wordingham <
richard.wordingham at ntlworld.com> wrote:

> Would language matching data take preference over either?
>

Language matching should happen earlier. You would match a desired language
against the list of known available languages. Then when you open a service
object there with the resulting language, you don't get into this situation.

How are break iteration rules meant to interact with dictionary-based
> word and line-breakers?
>

In CLDR and ICU, the rules specify the set of characters that need
dictionary support. (It's triggered by script, not by language.)

I expect that there will generally be data for language-specific
exceptions, overrides and such for more languages than character-level
segmentation rules. Those low-level rules should always fall back to root
when there is no language-specific data. I think the higher-level
exceptions should probably also avoid going through some default language.

markus
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/cldr-users/attachments/20140403/681c51ce/attachment.html>

From markus.icu at gmail.com  Thu Apr  3 23:17:10 2014
From: markus.icu at gmail.com (Markus Scherer)
Date: Thu, 3 Apr 2014 21:17:10 -0700
Subject: Non-primary Weights of U+FFFE
In-Reply-To: <20140403223036.3ab46070@JRWUBU2>
References: <20140330132445.43398a4e@JRWUBU2>
 <CAN49p6ogEHP+G=vrER=1XNoCE_YPhoLKMSKP-MDFPJ6+mgyg+Q@mail.gmail.com>
 <20140403223036.3ab46070@JRWUBU2>
Message-ID: <CAN49p6r_uqPZ5DVZUGMi63xVartFbSL7Zf+8xwVf1e1RYpuuww@mail.gmail.com>

On Thu, Apr 3, 2014 at 2:30 PM, Richard Wordingham <
richard.wordingham at ntlworld.com> wrote:

> > It is also important to generate the special weights
> > on primary to tertiary levels for shifted CEs, so that
> > alternate=shifted works properly.
>
> Can you expand on this, because I don't see any such need at the
> primary to tertiary levels.
>

I think I confused myself. Please ignore this sentence and instead read
what I put into the spec:

1.1.1 U+FFFE<http://www.unicode.org/reports/tr35/tr35-collation.html#Algorithm_FFFE>

U+FFFE maps to a CE with special minimal weights on all levels, including
case, quaternary and identical levels ? which may require special code for
those levels. Its primary weight is not "variable": U+FFFE must not become
ignorable in alternate handling.

>From your comment on ICU below, I can now see that you are specifying
> a behaviour for the quaternary level.


"all levels" includes quaternary and identical.

Now, in full strength
> comparisons, we have, whatever the alternate setting,
>
> "op" < "?p"
> "o p" < "op"
>
> Now, "o\uFFFE p" < "o\uFFFEp" < "o \uFFFEp" for alternate=non-ignorable.
> However, if the quaternary level weight of \uFFFE was calculated by the
> the Unicode Collation Algorithm using allkeys_CLDR.txt as its collation
> element table, we would have
>
> "o \uFFFEp" < "o\uFFFE p" < "o\uFFFEp" for alternate=non-ignorable
>
> To get the same ordering for these strings as for
> alternate=non-ignorable, one needs U+FFFE to have a minimal quaternary
> weight.  I don't see a test for this in CollationTest_CLDR_SHIFTED.txt.
>
> It seems that the UCA should be adjusted (in Section 3.6, variable
> weighting) so that L4 weights for L1 non-variable but less than a
> variable weight is 'as L1', rather than FFFF.  If I formally report
> this, should it be via a CLDR ticket or through the general Unicode
> mechanism?
>

I am not sure what you mean. The special mapping and behavior exist in CLDR
but not in the UCA, so none of this applies to UTS #10.
With ICU 53 which implements this, I get
<1 o\uFFFE p
    45 02 47 , 05 02 05 , 05 02 05 , 1C 02 04 1C .
<4 o\uFFFEp
    45 02 47 , 05 02 05 , 05 02 05 , 1C 02 1C .
<4 o \uFFFEp
    45 02 47 , 05 02 05 , 05 02 05 , 1C 04 02 1C .

(http://demo.icu-project.org/icu-bin/collation.html with
strength=quaternary, alternate=shifted, sort keys=on, and your input
strings)

> In ICU, we have test code that expects the same sort keys generated
> > from concatenating two strings with U+FFFE vs. calling
> > ucol_mergeSortkeys() on the two separate sort keys. The latter merges
> > sort keys by copying each level (separated by byte 01) from each sort
> > key and inserting a byte 02 between the bytes from different sort
> > keys. (see
> > ucol.h<http://www.icu-project.org/apiref/icu4c/ucol_8h.html> )
>
> So is the reason for unique weights at the secondary to tertiary levels
> simply that you don't want to have to unpick ICU's run-length
> compression for your test?
>

For ICU, we use weights and code to make U+FFFE behave exactly like the
function that works on finished sort keys. It makes it easy to test that it
works right.

This behavior might not otherwise be necessary. It might even work if you
give U+FFFE "common" non-primary weights and apply the run-length
compression across it. At least I can't find a reason why it would not
work. If this is true, then we could weaken the spec and turn some of the
current requirement into a recommendation.

markus
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/cldr-users/attachments/20140403/adcf4b0e/attachment.html>

From markus.icu at gmail.com  Fri Apr  4 10:49:27 2014
From: markus.icu at gmail.com (Markus Scherer)
Date: Fri, 4 Apr 2014 08:49:27 -0700
Subject: Non-primary Weights of U+FFFE
In-Reply-To: <CAN49p6r_uqPZ5DVZUGMi63xVartFbSL7Zf+8xwVf1e1RYpuuww@mail.gmail.com>
References: <20140330132445.43398a4e@JRWUBU2>
 <CAN49p6ogEHP+G=vrER=1XNoCE_YPhoLKMSKP-MDFPJ6+mgyg+Q@mail.gmail.com>
 <20140403223036.3ab46070@JRWUBU2>
 <CAN49p6r_uqPZ5DVZUGMi63xVartFbSL7Zf+8xwVf1e1RYpuuww@mail.gmail.com>
Message-ID: <CAN49p6peoYv3XT5Z0X2u7Ojm3ppDMXPNkjOw1ZpOWrde0inYAg@mail.gmail.com>

Now I know: U+FFFE needs special low weights on all levels because we have
always done it that way!

Just kidding. I submitted http://unicode.org/cldr/trac/ticket/7202

markus
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/cldr-users/attachments/20140404/8ffecd02/attachment.html>

From richard.wordingham at ntlworld.com  Fri Apr  4 14:36:42 2014
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Fri, 4 Apr 2014 20:36:42 +0100
Subject: Non-primary Weights of U+FFFE
In-Reply-To: <CAN49p6r_uqPZ5DVZUGMi63xVartFbSL7Zf+8xwVf1e1RYpuuww@mail.gmail.com>
References: <20140330132445.43398a4e@JRWUBU2>
 <CAN49p6ogEHP+G=vrER=1XNoCE_YPhoLKMSKP-MDFPJ6+mgyg+Q@mail.gmail.com>
 <20140403223036.3ab46070@JRWUBU2>
 <CAN49p6r_uqPZ5DVZUGMi63xVartFbSL7Zf+8xwVf1e1RYpuuww@mail.gmail.com>
Message-ID: <20140404203642.611149db@JRWUBU2>

On Thu, 3 Apr 2014 21:17:10 -0700
Markus Scherer <markus.icu at gmail.com> wrote:

> On Thu, Apr 3, 2014 at 2:30 PM, Richard Wordingham <
> richard.wordingham at ntlworld.com> wrote:

> > Now, in full strength
> > comparisons, we have, whatever the alternate setting,
> >
> > "op" < "?p"
> > "o p" < "op"
> >
> > Now, "o\uFFFE p" < "o\uFFFEp" < "o \uFFFEp" for
> > alternate=non-ignorable. However, if the quaternary level weight of
> > \uFFFE was calculated by the the Unicode Collation Algorithm using
> > allkeys_CLDR.txt as its collation element table, we would have
> >
> > "o \uFFFEp" < "o\uFFFE p" < "o\uFFFEp" for alternate=non-ignorable

Sorry, I meant to write
"o \uFFFEp" < "o\uFFFE p" < "o\uFFFEp" for alternate=shifted

> > To get the same ordering for these strings as for
> > alternate=non-ignorable, one needs U+FFFE to have a minimal
> > quaternary weight.  I don't see a test for this in
> > CollationTest_CLDR_SHIFTED.txt.

The problem here is that the collation test is passed whether one uses
the UCA or the CLDR collation algorithm, whereas these currently define
different orders for these three strings with alternate=shifted.  

> > It seems that the UCA should be adjusted (in Section 3.6, variable
> > weighting) so that L4 weights for L1 non-variable but less than a
> > variable weight is 'as L1', rather than FFFF.  If I formally report
> > this, should it be via a CLDR ticket or through the general Unicode
> > mechanism?
 
> I am not sure what you mean. The special mapping and behavior exist
> in CLDR but not in the UCA, so none of this applies to UTS #10.

Non-variable primary weights less than variable primary weights exist
in the UCA, and are established by allkeys_CLDR.txt.  It so happens
that there aren't any such weights in *DUCET* - just as there aren't any
tertiary collation elements.

Returning to the LDML specification, Markus pointed out that in the
account of U+FFFE,
> "all levels" includes quaternary and identical.

The concept of a collation element does not really apply at the
identical level - its formation does not respect the division of a
string into collating elements. For example,  <U+0443 CYRILLIC SMALL
LETTER U, U+0308 COMBINING DIAERESIS, U+0334 COMBINING TILDE> has
collating elements <U+0443, U+0334> and <U+0308>, but the identical
level contribution to the sort key is 0443, 0308, 0334.  Now the
concept of U+FFFE requires that at the 'identical' level,
"a\u0000\uFFFE" sort after "a\uFFFE".  At its simplest, this requires
that U+FFFE be transformed to a negative scalar value!

Now, as I understand it, the identical level is not intended to address
any cultural concepts of ordering, but simply as a convenience in
handling inequivalent strings, so that (a) distinct strings need not
compare as equal, and (b) canonically equivalent strings are ordered
together.  However, there are cases where changing the ordering of
indecomposable codepoints might have benefits - non-spacing Hebrew
accents (all ignorable) and kashida (U+0640 ARABIC TATWEEL) come to
mind.  The simplest mechanism I can see is for the UCA to allow a
tailoring to permute scalar values for the purposes of the identical
level.  Thus, for CLDR root, we would have the permutation (U+0000 ..
U+FFFE), and for CLDR we would require that U+FFFE be permuted to
U+0000.  (For collation, a permutation of all scalar values is
equivalent to a permutation of all indecomposable scalar values, and
allowing a formal permutation of all scalar values is simpler.)  It is
not necessary for CLDR to support any other permutations - it has no
mechanisms for tailoring casing for collation and only limited
mechanisms for creating extra levels.

Richard.


From markus.icu at gmail.com  Fri Apr  4 18:55:42 2014
From: markus.icu at gmail.com (Markus Scherer)
Date: Fri, 4 Apr 2014 16:55:42 -0700
Subject: Non-primary Weights of U+FFFE
In-Reply-To: <20140404203642.611149db@JRWUBU2>
References: <20140330132445.43398a4e@JRWUBU2>
 <CAN49p6ogEHP+G=vrER=1XNoCE_YPhoLKMSKP-MDFPJ6+mgyg+Q@mail.gmail.com>
 <20140403223036.3ab46070@JRWUBU2>
 <CAN49p6r_uqPZ5DVZUGMi63xVartFbSL7Zf+8xwVf1e1RYpuuww@mail.gmail.com>
 <20140404203642.611149db@JRWUBU2>
Message-ID: <CAN49p6qczE3XoJ=2qWbKrgW-nFUZjMZU4tiVZ4U-zdCcBHt9Lg@mail.gmail.com>

On Fri, Apr 4, 2014 at 12:36 PM, Richard Wordingham <
richard.wordingham at ntlworld.com> wrote:

> Non-variable primary weights less than variable primary weights exist
> in the UCA, and are established by allkeys_CLDR.txt.


Only for U+FFFE.

Returning to the LDML specification, Markus pointed out that in the
> account of U+FFFE,
> > "all levels" includes quaternary and identical.
>
> The concept of a collation element does not really apply at the
> identical level - its formation does not respect the division of a
> string into collating elements. For example,  <U+0443 CYRILLIC SMALL
> LETTER U, U+0308 COMBINING DIAERESIS, U+0334 COMBINING TILDE> has
> collating elements <U+0443, U+0334> and <U+0308>, but the identical
> level contribution to the sort key is 0443, 0308, 0334.  Now the
> concept of U+FFFE requires that at the 'identical' level,
> "a\u0000\uFFFE" sort after "a\uFFFE".


Right. With ICU 53:

<1 a\uFFFE
    29 02 , 05 02 , 05 02 , 02 , 92 02 .
<i a\u0000\uFFFE
    29 02 , 05 02 , 05 02 , 02 , 92 31 02 .

At its simplest, this requires
> that U+FFFE be transformed to a negative scalar value!
>

That depends on how you encode the identical level. In the UCA as written,
you could do a transformation like this:
FFFE->0000
0000->0001 0001
0001->0001 0002

In ICU, we use a simple "compression" scheme (a delta encoding) that
preserves binary order, and we reserved byte values 00 (terminator), 01
(level separator), 02 (for U+FFFE).

Now, as I understand it, the identical level is not intended to address
> any cultural concepts of ordering, but simply as a convenience in
> handling inequivalent strings, so that (a) distinct strings need not
> compare as equal, and (b) canonically equivalent strings are ordered
> together.


Yes. It's mostly a semi-arbitrary tie-breaker, except that in the CLDR
Japanese tailoring it provides the distinctions of JIS X 4061 level 5
(compatibility forms of Japanese characters sort after their regular forms).

markus
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/cldr-users/attachments/20140404/61bc6b00/attachment-0001.html>

From richard.wordingham at ntlworld.com  Sat Apr  5 11:30:31 2014
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Sat, 5 Apr 2014 17:30:31 +0100
Subject: CLDR proposal: Unicode algorithms should fall back to root, not
 to unrelated default locale
In-Reply-To: <CAN49p6pyGDvuTEDzc6BQTr7dtNcAXR=ckVz6xmKX0C8jgv_AWw@mail.gmail.com>
References: <CAN49p6rc5wK1F0ftSEwkdJqsFmCRdga8=9v+hA7Q2zuc9L1ASw@mail.gmail.com>
 <20140403212109.33833276@JRWUBU2>
 <CAN49p6pyGDvuTEDzc6BQTr7dtNcAXR=ckVz6xmKX0C8jgv_AWw@mail.gmail.com>
Message-ID: <20140405173031.4d4eb558@JRWUBU2>

On Thu, 3 Apr 2014 20:01:40 -0700
Markus Scherer <markus.icu at gmail.com> wrote:

> On Thu, Apr 3, 2014 at 1:21 PM, Richard Wordingham <
> richard.wordingham at ntlworld.com> wrote:

>> How are break iteration rules meant to interact with
>> dictionary-based word and line-breakers?

> In CLDR and ICU, the rules specify the set of characters that need
> dictionary support. (It's triggered by script, not by language.)

In CLDR, which rules are these?  I can't find them.  All I can find is
statements outside CLDR such as "For Thai, Lao, Khmer, Myanmar, and
other scripts that do not typically use spaces between words, a good
implementation should not depend on the default word boundary
specification" in UAX#29 'Unicode Text Segmentation'.

Now, some minority languages in these scripts use spaces between words,
as can be seen in the Northern Khmer bible (e.g. at
http://www.amazon.com/Bible-Northern-Khmer-Black-Cover/dp/9749141083).
While Thai might be a good fallback language for kxm-Thai-TH (there is
some usage of kxm-Khmr-TH), a Thai dictionary-based break iterator would
be a disaster.  On the other hand, I would hope for tolerable breaking
performance from a Thai dictionary-based break iterator for
North-Eastern Thai (tts-Thai-TH), which does not separate words.  By
contrast, I would describe the performance for phonetically written
Northern Thai, as revealed by the Thai spell-checker in LibreOffice, as
unsurprisingly poor.

> I expect that there will generally be data for language-specific
> exceptions, overrides and such for more languages than character-level
> segmentation rules. Those low-level rules should always fall back to
> root when there is no language-specific data. I think the higher-level
> exceptions should probably also avoid going through some default
> language.

If breakers just ignore the segmentation rules, then it should always
help to define rough and ready segmentation rules for every language
that uses a mainland SE Asian script as identified by Line_Break=SA.
Syllable breaking is generally a good approximation to word and
line-breaking, and in the visually ordered scripts, the preposed vowels
start syllables.  One needs a good reason to default the segmentation
rules to root for such languages.

Turning to collation, is the way to provide defaulting for collation
tag in collation/root.xml to list all languages as valid sublocales?  I
am a bit confused as to the point of having the file collation/en.xml.
What does it achieve?  Does it exist purely for the sake of its comment?

Richard.

From markus.icu at gmail.com  Sat Apr  5 12:12:10 2014
From: markus.icu at gmail.com (Markus Scherer)
Date: Sat, 5 Apr 2014 10:12:10 -0700
Subject: CLDR proposal: Unicode algorithms should fall back to root, not
 to unrelated default locale
In-Reply-To: <20140405173031.4d4eb558@JRWUBU2>
References: <CAN49p6rc5wK1F0ftSEwkdJqsFmCRdga8=9v+hA7Q2zuc9L1ASw@mail.gmail.com>
 <20140403212109.33833276@JRWUBU2>
 <CAN49p6pyGDvuTEDzc6BQTr7dtNcAXR=ckVz6xmKX0C8jgv_AWw@mail.gmail.com>
 <20140405173031.4d4eb558@JRWUBU2>
Message-ID: <CAN49p6pTHtn55L-sp2XQj5hYmZRnWVo15bCgMhGkS3Y1QixPDA@mail.gmail.com>

On Sat, Apr 5, 2014 at 9:30 AM, Richard Wordingham <
richard.wordingham at ntlworld.com> wrote:

> > In CLDR and ICU, the rules specify the set of characters that need
> > dictionary support. (It's triggered by script, not by language.)
>
> In CLDR, which rules are these?


I think it's
    <variable id="$SA">\p{Line_Break=Complex_Context}</variable>
which you can find in the line-break rules in
    http://unicode.org/cldr/trac/browser/trunk/common/segments/root.xml

Also, as far as I know, the ICU rule syntax is different enough from the
CLDR syntax that the conversion is manual. The ICU dictionary support might
need a manual addition.

(Others know a lot more about segmentation than I do.)

Turning to collation, is the way to provide defaulting for collation
> tag in collation/root.xml to list all languages as valid sublocales?


The validSubLocales data was removed from CLDR. Instead, we have some empty
base-language collation files to document that the root order is known to
be appropriate; as opposed to the absence of a base-language collation file
which basically means "don't know".

I am a bit confused as to the point of having the file collation/en.xml.
> What does it achieve?  Does it exist purely for the sake of its comment?
>

Yes.

In addition, in the current ICU implementation (I am not sure about the
LDML spec), an empty base-language file means we find something and don't
go through the default locale. When we agree that collation should go
directly to root, rather than to the default locale, then we could remove
the empty resource bundles from ICU (although they are very small). We
would keep the empty CLDR files for documentation.

markus
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/cldr-users/attachments/20140405/a7fbb8d6/attachment.html>

From richard.wordingham at ntlworld.com  Mon Apr  7 18:39:32 2014
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Tue, 8 Apr 2014 00:39:32 +0100
Subject: CLDR proposal: Unicode algorithms should fall back to root, not
 to unrelated default locale
In-Reply-To: <CAN49p6pTHtn55L-sp2XQj5hYmZRnWVo15bCgMhGkS3Y1QixPDA@mail.gmail.com>
References: <CAN49p6rc5wK1F0ftSEwkdJqsFmCRdga8=9v+hA7Q2zuc9L1ASw@mail.gmail.com>
 <20140403212109.33833276@JRWUBU2>
 <CAN49p6pyGDvuTEDzc6BQTr7dtNcAXR=ckVz6xmKX0C8jgv_AWw@mail.gmail.com>
 <20140405173031.4d4eb558@JRWUBU2>
 <CAN49p6pTHtn55L-sp2XQj5hYmZRnWVo15bCgMhGkS3Y1QixPDA@mail.gmail.com>
Message-ID: <20140408003932.66fb779c@JRWUBU2>

On Sat, 5 Apr 2014 10:12:10 -0700
Markus Scherer <markus.icu at gmail.com> wrote:

> On Sat, Apr 5, 2014 at 9:30 AM, Richard Wordingham <
> richard.wordingham at ntlworld.com> wrote:
> 
> > > In CLDR and ICU, the rules specify the set of characters that need
> > > dictionary support. (It's triggered by script, not by language.)
> >
> > In CLDR, which rules are these?

> I think it's
>     <variable id="$SA">\p{Line_Break=Complex_Context}</variable>
> which you can find in the line-break rules in
>     http://unicode.org/cldr/trac/browser/trunk/common/segments/root.xml

If the dictionary is chosen only by script and not by language, then
the design of ICU is currently broken as far as minority languages are
concerned.  I can't see how a Thai dictionary and a Northern or NE Thai
dictionary can co-exist.  (The usual script for writing these languages
is the Thai script, despite attempts to reinvigorate old regional
scripts.)

Going back to the CLDR level, there's another complexity.  Good Thai
typography inserts a space before U+0E46 THAI CHARACTER MAIYAMOK, and
does not break lines before the U+0E46.  It may be possible to fix the
line breaking by a rule something like "? \u0e46".  The sequence
<U+0020, U+0E46> should usually be considered the end of a word - the
truth of Line_Break=Complex_Context can vary within a word.  (There are
a few dictionary entries where <U+0020, U+0E46> occurs within the
non-compound lexical item - U+0E46 is then also followed by a space.)

I haven't yet experimented with these rules in ICU.  Might these tweaks
work?  Would tailoring Thai characters not to be
Line_Break=Complex_Context succeed in disabling the use of the Thai
dictionary for a locale?  The following rule in root.xml diminishes
hope: 

	<variable id="$AL">[$AI $AL $XX $SA $SG]</variable>

In all the examples of Pali I've seen in the Thai script, words are
separated by spaces. 

I think U+0E46 should be Line_Break=Exclamation.

Now some people get round the problem by omitting the space but
starting the glyph of mai yamok with a space.  ICU does this with words
that end in mai yamok - there is no preceding space character.  When
looking at serials in Thai magazines, I've noticed that spaces are
omitted before question and exclamation marks when there is a risk of
justification moving them onto the next line.  I suspect the rule
"? EX" is often not implemented.  It is possible that changing the line
break property of mai yamok could inconvenience these people -
removing <space, mai yamok> from the end of a word in the (Thai) Royal
Institute Dictionary does not always yield a word.

The immediate consequence of all this is that changing the inheritance
rules for segmentation would only be depriving certain people of a
benefit they probably don't yet have.

> In addition, in the current ICU implementation (I am not sure about
> the LDML spec), an empty base-language file means we find something
> and don't go through the default locale.

Formally, that looks like a non-compliance!

Richard.


From rxaviers at gmail.com  Thu Apr 17 06:41:39 2014
From: rxaviers at gmail.com (Rafael Xavier)
Date: Thu, 17 Apr 2014 08:41:39 -0300
Subject: CLDR JSON CDN?
Message-ID: <CADdLYsqGgUGwnH-wadgy35WKCZqhHpx6+KB11EPpzSaUpSW0Eg@mail.gmail.com>

Hello fellows,

Unicode hosts a copy of the latest CLDR JSONs at its repository trunk
http://www.unicode.org/repos/cldr-aux/json/25/main/. I guess this URL is
meant for download, not for direct usage (ie as a CDN), right?

Is there any official CDN for CLDR JSONs?

Thanks

-- 
<http://www.rafael.xavier.blog.br/>+55 (16) 8138-1583, skype: rxaviers
http://rafael.xavier.blog.br
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/cldr-users/attachments/20140417/ba31ac81/attachment.html>

From emmo at us.ibm.com  Thu Apr 17 08:32:16 2014
From: emmo at us.ibm.com (John Emmons)
Date: Thu, 17 Apr 2014 08:32:16 -0500
Subject: CLDR JSON CDN?
In-Reply-To: <CADdLYsqGgUGwnH-wadgy35WKCZqhHpx6+KB11EPpzSaUpSW0Eg@mail.gmail.com>
References: <CADdLYsqGgUGwnH-wadgy35WKCZqhHpx6+KB11EPpzSaUpSW0Eg@mail.gmail.com>
Message-ID: <OF3EC04C41.EA07B10F-ON86257CBD.004A460E-86257CBD.004A5DB6@us.ibm.com>


No there isn't, and we don't really plan to.


Regards,

John C. Emmons
Globalization Architect & Unicode CLDR TC Chairman
IBM Software Group
Internet: emmo at us.ibm.com


From:	Rafael Xavier <rxaviers at gmail.com>
To:	"cldr-users at unicode.org" <cldr-users at unicode.org>,
Date:	04/17/2014 06:49 AM
Subject:	CLDR JSON CDN?
Sent by:	"CLDR-Users" <cldr-users-bounces at unicode.org>


Hello fellows,

Unicode hosts a copy of the latest CLDR JSONs at its repository trunk
http://www.unicode.org/repos/cldr-aux/json/25/main/. I guess this URL is
meant for download, not for direct usage (ie as a CDN), right?

Is there any official CDN for CLDR JSONs?

Thanks

--
+55 (16) 8138-1583, skype: rxaviers
http://rafael.xavier.blog.br
_______________________________________________
CLDR-Users mailing list
CLDR-Users at unicode.org
http://unicode.org/mailman/listinfo/cldr-users
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/cldr-users/attachments/20140417/d4300426/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: graycol.gif
Type: image/gif
Size: 105 bytes
Desc: not available
URL: <http://unicode.org/pipermail/cldr-users/attachments/20140417/d4300426/attachment.gif>

From markus.icu at gmail.com  Fri Apr 18 16:41:28 2014
From: markus.icu at gmail.com (Markus Scherer)
Date: Fri, 18 Apr 2014 14:41:28 -0700
Subject: CLDR/ICU proposal: collation rules for import only
Message-ID: <CAN49p6rUE+BRE_=25vbbryj0iPQK3asHujKm6aqPcpQ97eKaEw@mail.gmail.com>

Dear CLDR & ICU teams & users,

Summary: I propose that we distinguish for-import-only rules from
create-a-sort-order rules via a naming convention rather than flags in the
data.

Details:

In collation rules, we can "import" the rules of another tailoring. For
example, common/collation/bs.xml<http://unicode.org/cldr/trac/browser/trunk/common/collation/bs.xml>has
<import
source="hr"/>.

We want to extend this by writing partial rules that are not intended as
their own sort orders but only for import into other rules. See
http://cldr.unicode.org/development/development-process/design-proposals/collation-additions#TOC-Collation-Importand
http://unicode.org/cldr/trac/ticket/3949

The idea was to use <settings private="true"> in CLDR, and I see that that
attribute exists in
common/dtd/ldml.dtd<http://unicode.org/cldr/trac/browser/trunk/common/dtd/ldml.dtd>but
it is marked as deprecated, and it is not documented in the LDML
collation spec. In ICU we would turn it into something like NoBinary{""} (
http://bugs.icu-project.org/trac/ticket/8082).

However, we also want to suppress such for-import-only rules from the lists
of "available" keyword values and collators (
http://bugs.icu-project.org/trac/ticket/8983). If we did this via a data
flag, then we would have to load the data before we can find out that we
want to exclude it from the list.

In addition, collation types are normally added to the
common/bcp47/collation.xml file. This is undesirable for what are really
internal identifiers. We don't want to advertise them as available, *we
don't want to collect display names for them*, and we don't want to have to
keep them stable.

I have a simpler proposal:

- I propose that we use a naming convention to distinguish for-import-only
rules.
- I propose that the first character of the collation type be digit '0' if
an only if the rules are only to be used for import, not for establishing
complete sort orders nor creating collators.
- We would not need an XML attribute, nor an ICU resource bundle entry, nor
would we add such types into bcp47/collation.xml.

For example, we might create a type="0kana" tailoring that would be
imported into the Japanese standard and unihan tailorings; and we might
create a type="0pinyin" tailoring that would be imported into the Chinese
pinyin and unihan tailorings.

Please let me know if you disagree.

Sincerely,
markus
-- 
Google Internationalization Engineering
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/cldr-users/attachments/20140418/e2d02599/attachment.html>

From richard.wordingham at ntlworld.com  Mon Apr 21 04:23:13 2014
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Mon, 21 Apr 2014 10:23:13 +0100
Subject: More Plural Categories?
Message-ID: <20140421102313.45e047dc@JRWUBU2>

I fear I've seen found a need for more plural categories.  I was
running my own English language data exploration program and came across
the following grammatical error in my output:

'... is a 11-element table.'

This fragment should, of course, have been

'... is an 11-element table.'

I'd not noticed this issue before; perhaps I'd been sensitised by
pondering the production of the Latin locale.

Does the 'others' category need to have a category extracted for
numbers that start with vowels?  These numbers would be something like

<pluralRule count="few">i in 11, 18, 80..89, 800..899,
1100..1199, 1800..1899, 8000..8999, 11000..11999, 18000..18999,
80000..89999, 800000..899999</pluralRule>

I don't see a nice way of carrying it on beyond a million.  There may
well be national variation in the validity of the 1100..1199 and
1800..1899 ranges.

This complication will extend to quite a few languages.

Are negative numbers supposed to be supported?  Negative numbers belong
to the 'other' category in English, but CLDR seems to put -1 in the
'one' category for English.  There seems to be a subtle dependency on
whether the word 'minus' denotes a relative value or an absolute value.

The Welsh numbers are complicated enough for natural numbers.  They
deviate from taking the unmutated singular noun as follows:

zero: plural form for nouns
one: Soft mutation for feminine nouns
two: Soft mutation for all nouns
few (i.e. 3): Spirant mutation for masculine nouns
many (i.e. 6): Spirant mutation for all nouns
other: No mutation

However, it is not quite as simple as that, even ignoring the argument
that Welsh ought to be localised.  The complication arises with the
numerative forms of _blwyddyn_ 'year', namely _blynedd_ 'years' and
_blwydd_ 'years old'. While in general they unusually take the nasal
mutation for 'other' (yielding _mlynedd_ and _mlwydd_), the standard
form for '4 years' is 'pedair blynedd', with no mutation!  'Pedair
blwydd' is the standard form for '4 years old', though 'pedair mlwydd'
is quite common.  This makes a seventh category, for '4', but only
significant with _blynedd_ and, less so, _blwydd_, and archaic diction
with _diwrnod_ 'day'.

Welsh may precede numbers by the definite article as English does, so
there is variation between _y_ and _yr_ depending on whether the
following number starts with a vowel or not.  This splits 'other' much
as in English, with the complication that Welsh has both vigesimal and
decimal systems - see http://en.wikipedia.org/wiki/Welsh_numerals for a
quick summary.  The RBNF rules have gone for the decimal system.
Apparently the choice between the two systems is affected by what is
being counted.

Possibly the words for 'year' should be special-cased - it seems to
have exceptional usage with numbers in several languages.  For example,
in Thai, the ages of childen should be expressed using ??? (tr. 'khuap')
instead of ?? (tr. 'pi') as the word for 'year'.

Talking of Thai, although usage seems quite variable, there is a rule
that the number for 'one' should follow the classifier rather than
precede it like other numbers.   Does this justify Thai having a
separate category 'one'?  (At present, it just has  the sole
category 'other'.)  Possibly this is covered by the advice to consider
special-casing 0 and 1 anyway. There are several cases in Thai where
the numeral '1' normally disappears in speech, e.g. times of the day.
I am also wondering if the existence of what are translated as plural
forms of the demonstrative adjectives calls for a separate category
'one' in Thai.  Possibly one can just avoid using these plural forms
when the number of items (one v. more than one) is not known beforehand.

Richard.


From verdy_p at wanadoo.fr  Mon Apr 21 05:21:56 2014
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Mon, 21 Apr 2014 12:21:56 +0200
Subject: More Plural Categories?
In-Reply-To: <20140421102313.45e047dc@JRWUBU2>
References: <20140421102313.45e047dc@JRWUBU2>
Message-ID: <CAGa7JC0TTRM3=VmBXBc-MV8qo4fFw2iJjbSCiowCUaxYWVxQ2Q@mail.gmail.com>

This is not a question for determining the plural form, it's completely
orthogoanl and is a phonologic mutation that can apply to lots of words
pairs; someti,es (not always) extended to the orthography. The rules are
extremely complex but do not depend on plurals, for example:

* In English you have "an egg" vs. "a chicken" (before a noun starting by a
vowel), "a year" or "a yellow car" ("y" starting a noun or adjectifve is
considered a consonnant here)

* In French the mutation of the nasal to a denasalizied vowel+/n/
consonnant in "un enfant" occurs before a vowel (or a mure "h") starting
the next noun or adjective but does not influence the orthography  there
are cases of mutations by elision of a final mute "e" replaced by an
apostrophe (also in Italian) before a noun or adjective or verb starting by
vowel or mute "h") but there are exceptions ("un enfant de onze ans" and
usually not "d'onze ans", but "un enfant d'un an" and usually not "un
enfant de un an").

* many examples in many languages much more complex that English or French

Such phonologuical and sometimes orthographic/grammatical mutations are not
suitable for inclusion in plural rules, they do not depend (only) on the
value of numbers when they are present.


2014-04-21 11:23 GMT+02:00 Richard Wordingham <
richard.wordingham at ntlworld.com>:

> I fear I've seen found a need for more plural categories.  I was
> running my own English language data exploration program and came across
> the following grammatical error in my output:
>
> '... is a 11-element table.'
>
> This fragment should, of course, have been
>
> '... is an 11-element table.'
>
> I'd not noticed this issue before; perhaps I'd been sensitised by
> pondering the production of the Latin locale.
>
> Does the 'others' category need to have a category extracted for
> numbers that start with vowels?  These numbers would be something like
>
> <pluralRule count="few">i in 11, 18, 80..89, 800..899,
> 1100..1199, 1800..1899, 8000..8999, 11000..11999, 18000..18999,
> 80000..89999, 800000..899999</pluralRule>
>
> I don't see a nice way of carrying it on beyond a million.  There may
> well be national variation in the validity of the 1100..1199 and
> 1800..1899 ranges.
>
> This complication will extend to quite a few languages.
>
> Are negative numbers supposed to be supported?  Negative numbers belong
> to the 'other' category in English, but CLDR seems to put -1 in the
> 'one' category for English.  There seems to be a subtle dependency on
> whether the word 'minus' denotes a relative value or an absolute value.
>
> The Welsh numbers are complicated enough for natural numbers.  They
> deviate from taking the unmutated singular noun as follows:
>
> zero: plural form for nouns
> one: Soft mutation for feminine nouns
> two: Soft mutation for all nouns
> few (i.e. 3): Spirant mutation for masculine nouns
> many (i.e. 6): Spirant mutation for all nouns
> other: No mutation
>
> However, it is not quite as simple as that, even ignoring the argument
> that Welsh ought to be localised.  The complication arises with the
> numerative forms of _blwyddyn_ 'year', namely _blynedd_ 'years' and
> _blwydd_ 'years old'. While in general they unusually take the nasal
> mutation for 'other' (yielding _mlynedd_ and _mlwydd_), the standard
> form for '4 years' is 'pedair blynedd', with no mutation!  'Pedair
> blwydd' is the standard form for '4 years old', though 'pedair mlwydd'
> is quite common.  This makes a seventh category, for '4', but only
> significant with _blynedd_ and, less so, _blwydd_, and archaic diction
> with _diwrnod_ 'day'.
>
> Welsh may precede numbers by the definite article as English does, so
> there is variation between _y_ and _yr_ depending on whether the
> following number starts with a vowel or not.  This splits 'other' much
> as in English, with the complication that Welsh has both vigesimal and
> decimal systems - see http://en.wikipedia.org/wiki/Welsh_numerals for a
> quick summary.  The RBNF rules have gone for the decimal system.
> Apparently the choice between the two systems is affected by what is
> being counted.
>
> Possibly the words for 'year' should be special-cased - it seems to
> have exceptional usage with numbers in several languages.  For example,
> in Thai, the ages of childen should be expressed using ??? (tr. 'khuap')
> instead of ?? (tr. 'pi') as the word for 'year'.
>
> Talking of Thai, although usage seems quite variable, there is a rule
> that the number for 'one' should follow the classifier rather than
> precede it like other numbers.   Does this justify Thai having a
> separate category 'one'?  (At present, it just has  the sole
> category 'other'.)  Possibly this is covered by the advice to consider
> special-casing 0 and 1 anyway. There are several cases in Thai where
> the numeral '1' normally disappears in speech, e.g. times of the day.
> I am also wondering if the existence of what are translated as plural
> forms of the demonstrative adjectives calls for a separate category
> 'one' in Thai.  Possibly one can just avoid using these plural forms
> when the number of items (one v. more than one) is not known beforehand.
>
> Richard.
>
> _______________________________________________
> CLDR-Users mailing list
> CLDR-Users at unicode.org
> http://unicode.org/mailman/listinfo/cldr-users
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/cldr-users/attachments/20140421/8f43ea6b/attachment-0001.html>

From richard.wordingham at ntlworld.com  Mon Apr 21 08:27:49 2014
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Mon, 21 Apr 2014 14:27:49 +0100
Subject: More Plural Categories?
In-Reply-To: <CAGa7JC0TTRM3=VmBXBc-MV8qo4fFw2iJjbSCiowCUaxYWVxQ2Q@mail.gmail.com>
References: <20140421102313.45e047dc@JRWUBU2>
 <CAGa7JC0TTRM3=VmBXBc-MV8qo4fFw2iJjbSCiowCUaxYWVxQ2Q@mail.gmail.com>
Message-ID: <20140421142749.0fc6db13@JRWUBU2>

On Mon, 21 Apr 2014 12:21:56 +0200
Philippe Verdy <verdy_p at wanadoo.fr> wrote:

> This is not a question for determining the plural form, it's
> completely orthogoanl and is a phonologic mutation that can apply to
> lots of words pairs; someti,es (not always) extended to the
> orthography.

What do you think the origin of the Welsh categories is? The
distinction between the Welsh categories two/few/many/other is in origin
a phonological distinction, as is most of the distinction in numeric
forms between 'one' and the others.  For 'one' v. the other four,
there are also the effects of the singular v. plural distinction, for
example on accompanying demonstratives and referring pronouns. 

> The rules are extremely complex but do not depend on
> plurals, for example:

> * In English you have "an egg" vs. "a chicken" (before a noun
> starting by a vowel), "a year" or "a yellow car" ("y" starting a noun
> or adjectifve is considered a consonnant here)

The idea is that a program slotting these words into a frame would
select a set of associated forms to be placed in the various
positions.  For English, the set would be at least the noun and
the indefinite article.  With numbers, there is the potential problem
that the number of such sets is unbounded.  The concept of the plural
categories is that the number then selects one of no more than say six
sets.

For example, the general form of a question may be, 'You have
selected 6 files; delete them?'.  Based on the number, one has to
select in English not only between between 'files' and 'file' but also
'them' and 'it'.  In some languages, there might be a 3-way choice of
pronouns, and in some languages the value of the number may affect the
various verbs.

> * In French ... "un enfant de onze ans" and usually not "d'onze ans",
> but "un enfant d'un an" and usually not "un enfant de un an"...

Should not this be captured by CLDR?

> Such phonologuical and sometimes orthographic/grammatical mutations
> are not suitable for inclusion in plural rules, they do not depend
> (only) on the value of numbers when they are present.

One can select the form from the number.  The only question is whether
it would be better to apply a phonological rule to the composed form.
If that were the decision, then CLDR ought to contain the
transformation.  However, in your example, it does not just involve a
simple phonological rule; there is the difficult decision of whether to
apply it.

Now, spelt out numbers in Sanskrit might be a good case for the
mechanical application of sandhi.

Richard.

From emmo at us.ibm.com  Mon Apr 21 10:51:34 2014
From: emmo at us.ibm.com (John Emmons)
Date: Mon, 21 Apr 2014 10:51:34 -0500
Subject: CLDR/ICU proposal: collation rules for import only
In-Reply-To: <CAN49p6rUE+BRE_=25vbbryj0iPQK3asHujKm6aqPcpQ97eKaEw@mail.gmail.com>
References: <CAN49p6rUE+BRE_=25vbbryj0iPQK3asHujKm6aqPcpQ97eKaEw@mail.gmail.com>
Message-ID: <OFA45A463A.23228510-ON86257CC1.0056BC84-86257CC1.00571E5D@us.ibm.com>


I would prefer that we have an attribute for it, so that it is crystal
clear to everyone exactly what is going on.  I really don't like the idea
of "0" + ruleset naming convention.

We have a similar situation in the RBNF rules.  There we use:

<ruleset type="and-feminine" access="private">

I would think that the most logical thing would be to extend the use of the
access attribute, such that we have:

<rules access="private">


Regards,

John C. Emmons
Globalization Architect & Unicode CLDR TC Chairman
IBM Software Group
Internet: emmo at us.ibm.com


From:	Markus Scherer <markus.icu at gmail.com>
To:	"cldr-users at unicode.org" <cldr-users at unicode.org>, icu-design
            <icu-design at lists.sourceforge.net>,
Date:	04/18/2014 04:44 PM
Subject:	CLDR/ICU proposal: collation rules for import only
Sent by:	"CLDR-Users" <cldr-users-bounces at unicode.org>


Dear CLDR & ICU teams & users,

Summary: I propose that we distinguish for-import-only rules from
create-a-sort-order rules via a naming convention rather than flags in the
data.

Details:

In collation rules, we can "import" the rules of another tailoring. For
example,?common/collation/bs.xml has?<import source="hr"/>.

We want to extend this by writing partial rules that are not intended as
their own sort orders but only for import into other rules. See
http://cldr.unicode.org/development/development-process/design-proposals/collation-additions#TOC-Collation-Import
 and?http://unicode.org/cldr/trac/ticket/3949

The idea was to use?<settings private="true">?in CLDR, and I see that that
attribute exists in common/dtd/ldml.dtd but it is marked as deprecated, and
it is not documented in the LDML collation spec. In ICU we would turn it
into something like?NoBinary{""}?(
http://bugs.icu-project.org/trac/ticket/8082).

However, we also want to suppress such for-import-only rules from the lists
of "available" keyword values and collators (
http://bugs.icu-project.org/trac/ticket/8983). If we did this via a data
flag, then we would have to load the data before we can find out that we
want to exclude it from the list.

In addition, collation types are normally added to the
common/bcp47/collation.xml file. This is undesirable for what are really
internal identifiers. We don't want to advertise them as available, we
don't want to collect display names for them, and we don't want to have to
keep them stable.

I have a simpler proposal:

- I propose that we use a naming convention to distinguish for-import-only
rules.
- I propose that the first character of the collation type be digit '0' if
an only if the rules are only to be used for import, not for establishing
complete sort orders nor creating collators.
- We would not need an XML attribute, nor an ICU resource bundle entry, nor
would we add such types into bcp47/collation.xml.

For example, we might create a type="0kana" tailoring that would be
imported into the Japanese standard and unihan tailorings; and we might
create a type="0pinyin" tailoring that would be imported into the Chinese
pinyin and unihan tailorings.

Please let me know if you disagree.

Sincerely,
markus
--
Google Internationalization Engineering
_______________________________________________
CLDR-Users mailing list
CLDR-Users at unicode.org
http://unicode.org/mailman/listinfo/cldr-users
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/cldr-users/attachments/20140421/4a000ba2/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: graycol.gif
Type: image/gif
Size: 105 bytes
Desc: not available
URL: <http://unicode.org/pipermail/cldr-users/attachments/20140421/4a000ba2/attachment.gif>

From markus.icu at gmail.com  Mon Apr 21 12:14:41 2014
From: markus.icu at gmail.com (Markus Scherer)
Date: Mon, 21 Apr 2014 10:14:41 -0700
Subject: CLDR/ICU proposal: collation rules for import only
In-Reply-To: <OFA45A463A.23228510-ON86257CC1.0056BC84-86257CC1.00571E5D@us.ibm.com>
References: <CAN49p6rUE+BRE_=25vbbryj0iPQK3asHujKm6aqPcpQ97eKaEw@mail.gmail.com>
 <OFA45A463A.23228510-ON86257CC1.0056BC84-86257CC1.00571E5D@us.ibm.com>
Message-ID: <CAN49p6ocrOkDaDa0-8kobFF35mdLg5MEvid-XxNdrzTSOMH98w@mail.gmail.com>

On Mon, Apr 21, 2014 at 8:51 AM, John Emmons <emmo at us.ibm.com> wrote:

> I would prefer that we have an attribute for it, so that it is crystal
> clear to everyone exactly what is going on.  I really don't like the idea
> of "0" + ruleset naming convention.
>
Well, the attribute approach has problems, as I said:
- I don't want to have to load the data just to find out if it's
"available".
- I want it to be clear which collation types we add to bcp47/collation.xml
and which we don't.
- I want it to be clear for which collation types to collect display names.

If the CLDR committee feels strongly, then maybe we can use both an
attribute and a naming convention, and make sure that they are used
together (both or neither).

> We have a similar situation in the RBNF rules.  There we use:
>
> <ruleset type="and-feminine" access="private">
>
> I would think that the most logical thing would be to extend the use of
> the access attribute, such that we have:
>
> <rules access="private">
>
Well, <rules> is deprecated and not used any more.

The design doc says <settings private="true">

However, if it's an attribute, then it should really be on the <collation>
element -- and I don't care if it's <collation type="0kana" private="true">
or <collation type="0kana" access="private">.

Or maybe it should be an element, to avoid adding a non-distinguishing
attribute<http://cldr.unicode.org/development/updating-dtds#TOC-Attributes->
.

(All <settings> change collation behavior, but "private" is something
totally different.)

markus
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/cldr-users/attachments/20140421/5b466064/attachment.html>

From mark at macchiato.com  Mon Apr 21 14:19:08 2014
From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=)
Date: Mon, 21 Apr 2014 22:19:08 +0300
Subject: [icu-design] CLDR/ICU proposal: collation rules for import only
In-Reply-To: <CAN49p6ocrOkDaDa0-8kobFF35mdLg5MEvid-XxNdrzTSOMH98w@mail.gmail.com>
References: <CAN49p6rUE+BRE_=25vbbryj0iPQK3asHujKm6aqPcpQ97eKaEw@mail.gmail.com>
 <OFA45A463A.23228510-ON86257CC1.0056BC84-86257CC1.00571E5D@us.ibm.com>
 <CAN49p6ocrOkDaDa0-8kobFF35mdLg5MEvid-XxNdrzTSOMH98w@mail.gmail.com>
Message-ID: <CAJ2xs_Fn6B+9qOgSGjqaHG+6JkNfC+TDYg5V-KOzkMoK-0Q9BQ@mail.gmail.com>

I
? agree with John that 0kana is obscure. I prefer as well the private
attribute.

?On the other hand, I can also see a convention that makes it easier to
know that something is private. That would make some of the RBNF rules
clearer, for example.

What I suggest is both the attribute and naming convention (and a test to
ensure they match). But 0 is way too ugly. My suggestions would be along
the following lines:

?
<
?foo type="_foobar" access="private"
>

?That is, _x signals private. This follows the convention that some people
follow for _x being a local variable.

?
?
<
?foo type="private_foobar" access="private"
>

?This convention would make it *very* clear what was expected to be private!

foo would be rbnf, collation, transliteration, etc.?

{phone}
On Apr 21, 2014 8:15 PM, "Markus Scherer" <markus.icu at gmail.com> wrote:

> On Mon, Apr 21, 2014 at 8:51 AM, John Emmons <emmo at us.ibm.com> wrote:
>
>>  I would prefer that we have an attribute for it, so that it is crystal
>> clear to everyone exactly what is going on.  I really don't like the idea
>> of "0" + ruleset naming convention.
>>
> Well, the attribute approach has problems, as I said:
> - I don't want to have to load the data just to find out if it's
> "available".
> - I want it to be clear which collation types we add to
> bcp47/collation.xml and which we don't.
> - I want it to be clear for which collation types to collect display names.
>
> If the CLDR committee feels strongly, then maybe we can use both an
> attribute and a naming convention, and make sure that they are used
> together (both or neither).
>
>> We have a similar situation in the RBNF rules.  There we use:
>>
>> <ruleset type="and-feminine" access="private">
>>
>> I would think that the most logical thing would be to extend the use of
>> the access attribute, such that we have:
>>
>> <rules access="private">
>>
> Well, <rules> is deprecated and not used any more.
>
> The design doc says <settings private="true">
>
> However, if it's an attribute, then it should really be on the <collation>
> element -- and I don't care if it's <collation type="0kana" private="true">
> or <collation type="0kana" access="private">.
>
> Or maybe it should be an element, to avoid adding a non-distinguishing
> attribute<http://cldr.unicode.org/development/updating-dtds#TOC-Attributes->
> .
>
> (All <settings> change collation behavior, but "private" is something
> totally different.)
>
> markus
>
>
> ------------------------------------------------------------------------------
> Start Your Social Network Today - Download eXo Platform
> Build your Enterprise Intranet with eXo Platform Software
> Java Based Open Source Intranet - Social, Extensible, Cloud Ready
> Get Started Now And Turn Your Intranet Into A Collaboration Platform
> http://p.sf.net/sfu/ExoPlatform
> _______________________________________________
> icu-design mailing list
> icu-design at lists.sourceforge.net
> To Un/Subscribe: https://lists.sourceforge.net/lists/listinfo/icu-design
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/cldr-users/attachments/20140421/329deaf9/attachment.html>

From markus.icu at gmail.com  Mon Apr 21 17:08:04 2014
From: markus.icu at gmail.com (Markus Scherer)
Date: Mon, 21 Apr 2014 15:08:04 -0700
Subject: [icu-design] CLDR/ICU proposal: collation rules for import only
In-Reply-To: <CAJ2xs_Fn6B+9qOgSGjqaHG+6JkNfC+TDYg5V-KOzkMoK-0Q9BQ@mail.gmail.com>
References: <CAN49p6rUE+BRE_=25vbbryj0iPQK3asHujKm6aqPcpQ97eKaEw@mail.gmail.com>
 <OFA45A463A.23228510-ON86257CC1.0056BC84-86257CC1.00571E5D@us.ibm.com>
 <CAN49p6ocrOkDaDa0-8kobFF35mdLg5MEvid-XxNdrzTSOMH98w@mail.gmail.com>
 <CAJ2xs_Fn6B+9qOgSGjqaHG+6JkNfC+TDYg5V-KOzkMoK-0Q9BQ@mail.gmail.com>
Message-ID: <CAN49p6oLwz7K=rvvgJcUwQFhVJnDb8a3qccXQa=Qvie3A-SCnw@mail.gmail.com>

On Mon, Apr 21, 2014 at 12:19 PM, Mark Davis ?? <mark at macchiato.com> wrote:

> What I suggest is both the attribute and naming convention (and a test to
> ensure they match). But 0 is way too ugly. My suggestions would be along
> the following lines:
>
I picked a prefix '0' because I assume that even these internal types need
to be valid in language tags.
At least in ICU we assemble something like sr_Latn's <import source="hr"
type="search"/> into syntax with a language tag like [import
hr-u-co-search].
Therefore, the type needs to be a valid subtag, with [a-z0-9] and at most 8
characters.

I agree that '0' is not pretty, but it seemed like the best possible prefix
given the constraints, and given that none of the existing types begins
with a digit.

Also, as I said in my previous email, if y'all do want more than a naming
convention, then it should probably be an element, not a non-distinguishing
attribute.

markus
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/cldr-users/attachments/20140421/fada3290/attachment.html>

From markus.icu at gmail.com  Wed Apr 23 11:01:40 2014
From: markus.icu at gmail.com (Markus Scherer)
Date: Wed, 23 Apr 2014 09:01:40 -0700
Subject: [icu-design] CLDR/ICU proposal: collation rules for import only
In-Reply-To: <CAN49p6oLwz7K=rvvgJcUwQFhVJnDb8a3qccXQa=Qvie3A-SCnw@mail.gmail.com>
References: <CAN49p6rUE+BRE_=25vbbryj0iPQK3asHujKm6aqPcpQ97eKaEw@mail.gmail.com>
 <OFA45A463A.23228510-ON86257CC1.0056BC84-86257CC1.00571E5D@us.ibm.com>
 <CAN49p6ocrOkDaDa0-8kobFF35mdLg5MEvid-XxNdrzTSOMH98w@mail.gmail.com>
 <CAJ2xs_Fn6B+9qOgSGjqaHG+6JkNfC+TDYg5V-KOzkMoK-0Q9BQ@mail.gmail.com>
 <CAN49p6oLwz7K=rvvgJcUwQFhVJnDb8a3qccXQa=Qvie3A-SCnw@mail.gmail.com>
Message-ID: <CAN49p6q=hYCQmAthzbiasUn1+KGZnRf1AvgChZaPygrZeO=0VA@mail.gmail.com>

In CLDR team discussion today we settled on a more obvious, less "ugly"
naming convention, using a two-part type that turns into two language
subtags.

In CLDR data:
    <collation type="private-kana">
        <cr>...

    <collation type="standard">
        <import source="ja" type="private-kana">
        <cr>...

which in ICU would turn into
    [import ja-u-co-private-kana]

Multi-part keyword values are already used for ca (calendar type e.g.
islamic-tbla), kr (script reodering), vt (deprecated variableTop) and maybe
more.

Thanks for the feedback!
markus
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/cldr-users/attachments/20140423/20df701f/attachment.html>