From markus.icu at gmail.com Thu Apr 3 12:01:02 2014 From: markus.icu at gmail.com (Markus Scherer) Date: Thu, 3 Apr 2014 10:01:02 -0700 Subject: CLDR proposal: Unicode algorithms should fall back to root, not to unrelated default locale Message-ID: Dear CLDR team & users, We have consensus in the ICU team for a modified fallback policy for when data is requested for a service based on a Unicode algorithm. Assuming that such a policy is appropriate for the LDML spec (I have not looked whether the spec currently mentions fallbacks in the absence of data), I propose that we add the following: When requesting a specific locale for collation, break iteration, or case mapping, when we do not have any data for even the locale's base language, then we should fall back to the root locale rather than the default locale. Note: This will not change behavior for languages for which we do have specific data for the service, even if it is an empty data file. Each of these services tailors a Unicode algorithm which is explicitly designed to provide reasonable default behavior when no language-specific behavior is known or available. For example, in 2012/ICU 52m1, we had an ?environment test? failure ( ticket:10277 ) that was caused by requesting Basque (eu) collation and AlphabeticIndex when the default locale was Azerbaijani (az), Lithuanian, or Ethiopian (et) (and maybe more languages); in Azerbaijani, x sorts between h and i; this is undesirable when the request was for Basque. In the absence of specific Basque data, we should assume that the all-Unicode root sort order is appropriate. Similarly, it is undesirable to fall back from French to Turkish case mappings, or from Italian to Finnish line breaking. By contrast, for UI languages, display names, and formatting, the root locale is not useful: No UI messages, ISO codes instead of display names, minimal patterns. By falling back to a default locale, the user gets strings in what is hopefully a language they understand, even if not the language they requested. Sincerely, markus -- Google Internationalization Engineering -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Thu Apr 3 15:21:09 2014 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Thu, 3 Apr 2014 21:21:09 +0100 Subject: CLDR proposal: Unicode algorithms should fall back to root, not to unrelated default locale In-Reply-To: References: Message-ID: <20140403212109.33833276@JRWUBU2> On Thu, 3 Apr 2014 10:01:02 -0700 Markus Scherer wrote: > When requesting a specific locale for collation, break iteration, or > case mapping, when we do not have any data for even the locale's base > language, then we should fall back to the root locale rather than the > default locale. Would language matching data take preference over either? I can see deserving use cases where the default language is the national language and the selected locale is for a minority language. How are break iteration rules meant to interact with dictionary-based word and line-breakers? > Note: This will not change behavior for languages for which we do have > specific data for the service, even if it is an empty data file. Richard. From richard.wordingham at ntlworld.com Thu Apr 3 16:30:36 2014 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Thu, 3 Apr 2014 22:30:36 +0100 Subject: Non-primary Weights of U+FFFE In-Reply-To: References: <20140330132445.43398a4e@JRWUBU2> Message-ID: <20140403223036.3ab46070@JRWUBU2> On Sun, 30 Mar 2014 09:17:44 -0700 Markus Scherer wrote: > On Sun, Mar 30, 2014 at 5:24 AM, Richard Wordingham < > richard.wordingham at ntlworld.com> wrote: > > > Is there any reason that a CLDR-compliant collation algorithm should > > particularly care about the non-primary weights of U+FFFE? So long > > as they satisfy the well-formedness conditions, all I can see is > > that having unique values *may* simplify sort key formation for > > reversed levels. > > > > The non-primary weights need to be greater than the level > separator(s) Guaranteed by WF1 and S3.2 > and less than the weights of CEs that are ignorable on > previous levels. Guaranteed by WF2 plus case-related rules, even if U+FFFE is not treated as a special case. > It is also important to generate the special weights > on primary to tertiary levels for shifted CEs, so that > alternate=shifted works properly. Can you expand on this, because I don't see any such need at the primary to tertiary levels. >From your comment on ICU below, I can now see that you are specifying a behaviour for the quaternary level. Now, in full strength comparisons, we have, whatever the alternate setting, "op" < "?p" "o p" < "op" Now, "o\uFFFE p" < "o\uFFFEp" < "o \uFFFEp" for alternate=non-ignorable. However, if the quaternary level weight of \uFFFE was calculated by the the Unicode Collation Algorithm using allkeys_CLDR.txt as its collation element table, we would have "o \uFFFEp" < "o\uFFFE p" < "o\uFFFEp" for alternate=non-ignorable To get the same ordering for these strings as for alternate=non-ignorable, one needs U+FFFE to have a minimal quaternary weight. I don't see a test for this in CollationTest_CLDR_SHIFTED.txt. It seems that the UCA should be adjusted (in Section 3.6, variable weighting) so that L4 weights for L1 non-variable but less than a variable weight is 'as L1', rather than FFFF. If I formally report this, should it be via a CLDR ticket or through the general Unicode mechanism? > In ICU, we have test code that expects the same sort keys generated > from concatenating two strings with U+FFFE vs. calling > ucol_mergeSortkeys() on the two separate sort keys. The latter merges > sort keys by copying each level (separated by byte 01) from each sort > key and inserting a byte 02 between the bytes from different sort > keys. (see > ucol.h ) So is the reason for unique weights at the secondary to tertiary levels simply that you don't want to have to unpick ICU's run-length compression for your test? Richard. From markus.icu at gmail.com Thu Apr 3 22:01:40 2014 From: markus.icu at gmail.com (Markus Scherer) Date: Thu, 3 Apr 2014 20:01:40 -0700 Subject: CLDR proposal: Unicode algorithms should fall back to root, not to unrelated default locale In-Reply-To: <20140403212109.33833276@JRWUBU2> References: <20140403212109.33833276@JRWUBU2> Message-ID: On Thu, Apr 3, 2014 at 1:21 PM, Richard Wordingham < richard.wordingham at ntlworld.com> wrote: > Would language matching data take preference over either? > Language matching should happen earlier. You would match a desired language against the list of known available languages. Then when you open a service object there with the resulting language, you don't get into this situation. How are break iteration rules meant to interact with dictionary-based > word and line-breakers? > In CLDR and ICU, the rules specify the set of characters that need dictionary support. (It's triggered by script, not by language.) I expect that there will generally be data for language-specific exceptions, overrides and such for more languages than character-level segmentation rules. Those low-level rules should always fall back to root when there is no language-specific data. I think the higher-level exceptions should probably also avoid going through some default language. markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From markus.icu at gmail.com Thu Apr 3 23:17:10 2014 From: markus.icu at gmail.com (Markus Scherer) Date: Thu, 3 Apr 2014 21:17:10 -0700 Subject: Non-primary Weights of U+FFFE In-Reply-To: <20140403223036.3ab46070@JRWUBU2> References: <20140330132445.43398a4e@JRWUBU2> <20140403223036.3ab46070@JRWUBU2> Message-ID: On Thu, Apr 3, 2014 at 2:30 PM, Richard Wordingham < richard.wordingham at ntlworld.com> wrote: > > It is also important to generate the special weights > > on primary to tertiary levels for shifted CEs, so that > > alternate=shifted works properly. > > Can you expand on this, because I don't see any such need at the > primary to tertiary levels. > I think I confused myself. Please ignore this sentence and instead read what I put into the spec: 1.1.1 U+FFFE U+FFFE maps to a CE with special minimal weights on all levels, including case, quaternary and identical levels ? which may require special code for those levels. Its primary weight is not "variable": U+FFFE must not become ignorable in alternate handling. >From your comment on ICU below, I can now see that you are specifying > a behaviour for the quaternary level. "all levels" includes quaternary and identical. Now, in full strength > comparisons, we have, whatever the alternate setting, > > "op" < "?p" > "o p" < "op" > > Now, "o\uFFFE p" < "o\uFFFEp" < "o \uFFFEp" for alternate=non-ignorable. > However, if the quaternary level weight of \uFFFE was calculated by the > the Unicode Collation Algorithm using allkeys_CLDR.txt as its collation > element table, we would have > > "o \uFFFEp" < "o\uFFFE p" < "o\uFFFEp" for alternate=non-ignorable > > To get the same ordering for these strings as for > alternate=non-ignorable, one needs U+FFFE to have a minimal quaternary > weight. I don't see a test for this in CollationTest_CLDR_SHIFTED.txt. > > It seems that the UCA should be adjusted (in Section 3.6, variable > weighting) so that L4 weights for L1 non-variable but less than a > variable weight is 'as L1', rather than FFFF. If I formally report > this, should it be via a CLDR ticket or through the general Unicode > mechanism? > I am not sure what you mean. The special mapping and behavior exist in CLDR but not in the UCA, so none of this applies to UTS #10. With ICU 53 which implements this, I get <1 o\uFFFE p 45 02 47 , 05 02 05 , 05 02 05 , 1C 02 04 1C . <4 o\uFFFEp 45 02 47 , 05 02 05 , 05 02 05 , 1C 02 1C . <4 o \uFFFEp 45 02 47 , 05 02 05 , 05 02 05 , 1C 04 02 1C . (http://demo.icu-project.org/icu-bin/collation.html with strength=quaternary, alternate=shifted, sort keys=on, and your input strings) > In ICU, we have test code that expects the same sort keys generated > > from concatenating two strings with U+FFFE vs. calling > > ucol_mergeSortkeys() on the two separate sort keys. The latter merges > > sort keys by copying each level (separated by byte 01) from each sort > > key and inserting a byte 02 between the bytes from different sort > > keys. (see > > ucol.h ) > > So is the reason for unique weights at the secondary to tertiary levels > simply that you don't want to have to unpick ICU's run-length > compression for your test? > For ICU, we use weights and code to make U+FFFE behave exactly like the function that works on finished sort keys. It makes it easy to test that it works right. This behavior might not otherwise be necessary. It might even work if you give U+FFFE "common" non-primary weights and apply the run-length compression across it. At least I can't find a reason why it would not work. If this is true, then we could weaken the spec and turn some of the current requirement into a recommendation. markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From markus.icu at gmail.com Fri Apr 4 10:49:27 2014 From: markus.icu at gmail.com (Markus Scherer) Date: Fri, 4 Apr 2014 08:49:27 -0700 Subject: Non-primary Weights of U+FFFE In-Reply-To: References: <20140330132445.43398a4e@JRWUBU2> <20140403223036.3ab46070@JRWUBU2> Message-ID: Now I know: U+FFFE needs special low weights on all levels because we have always done it that way! Just kidding. I submitted http://unicode.org/cldr/trac/ticket/7202 markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Fri Apr 4 14:36:42 2014 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Fri, 4 Apr 2014 20:36:42 +0100 Subject: Non-primary Weights of U+FFFE In-Reply-To: References: <20140330132445.43398a4e@JRWUBU2> <20140403223036.3ab46070@JRWUBU2> Message-ID: <20140404203642.611149db@JRWUBU2> On Thu, 3 Apr 2014 21:17:10 -0700 Markus Scherer wrote: > On Thu, Apr 3, 2014 at 2:30 PM, Richard Wordingham < > richard.wordingham at ntlworld.com> wrote: > > Now, in full strength > > comparisons, we have, whatever the alternate setting, > > > > "op" < "?p" > > "o p" < "op" > > > > Now, "o\uFFFE p" < "o\uFFFEp" < "o \uFFFEp" for > > alternate=non-ignorable. However, if the quaternary level weight of > > \uFFFE was calculated by the the Unicode Collation Algorithm using > > allkeys_CLDR.txt as its collation element table, we would have > > > > "o \uFFFEp" < "o\uFFFE p" < "o\uFFFEp" for alternate=non-ignorable Sorry, I meant to write "o \uFFFEp" < "o\uFFFE p" < "o\uFFFEp" for alternate=shifted > > To get the same ordering for these strings as for > > alternate=non-ignorable, one needs U+FFFE to have a minimal > > quaternary weight. I don't see a test for this in > > CollationTest_CLDR_SHIFTED.txt. The problem here is that the collation test is passed whether one uses the UCA or the CLDR collation algorithm, whereas these currently define different orders for these three strings with alternate=shifted. > > It seems that the UCA should be adjusted (in Section 3.6, variable > > weighting) so that L4 weights for L1 non-variable but less than a > > variable weight is 'as L1', rather than FFFF. If I formally report > > this, should it be via a CLDR ticket or through the general Unicode > > mechanism? > I am not sure what you mean. The special mapping and behavior exist > in CLDR but not in the UCA, so none of this applies to UTS #10. Non-variable primary weights less than variable primary weights exist in the UCA, and are established by allkeys_CLDR.txt. It so happens that there aren't any such weights in *DUCET* - just as there aren't any tertiary collation elements. Returning to the LDML specification, Markus pointed out that in the account of U+FFFE, > "all levels" includes quaternary and identical. The concept of a collation element does not really apply at the identical level - its formation does not respect the division of a string into collating elements. For example, has collating elements and , but the identical level contribution to the sort key is 0443, 0308, 0334. Now the concept of U+FFFE requires that at the 'identical' level, "a\u0000\uFFFE" sort after "a\uFFFE". At its simplest, this requires that U+FFFE be transformed to a negative scalar value! Now, as I understand it, the identical level is not intended to address any cultural concepts of ordering, but simply as a convenience in handling inequivalent strings, so that (a) distinct strings need not compare as equal, and (b) canonically equivalent strings are ordered together. However, there are cases where changing the ordering of indecomposable codepoints might have benefits - non-spacing Hebrew accents (all ignorable) and kashida (U+0640 ARABIC TATWEEL) come to mind. The simplest mechanism I can see is for the UCA to allow a tailoring to permute scalar values for the purposes of the identical level. Thus, for CLDR root, we would have the permutation (U+0000 .. U+FFFE), and for CLDR we would require that U+FFFE be permuted to U+0000. (For collation, a permutation of all scalar values is equivalent to a permutation of all indecomposable scalar values, and allowing a formal permutation of all scalar values is simpler.) It is not necessary for CLDR to support any other permutations - it has no mechanisms for tailoring casing for collation and only limited mechanisms for creating extra levels. Richard. From markus.icu at gmail.com Fri Apr 4 18:55:42 2014 From: markus.icu at gmail.com (Markus Scherer) Date: Fri, 4 Apr 2014 16:55:42 -0700 Subject: Non-primary Weights of U+FFFE In-Reply-To: <20140404203642.611149db@JRWUBU2> References: <20140330132445.43398a4e@JRWUBU2> <20140403223036.3ab46070@JRWUBU2> <20140404203642.611149db@JRWUBU2> Message-ID: On Fri, Apr 4, 2014 at 12:36 PM, Richard Wordingham < richard.wordingham at ntlworld.com> wrote: > Non-variable primary weights less than variable primary weights exist > in the UCA, and are established by allkeys_CLDR.txt. Only for U+FFFE. Returning to the LDML specification, Markus pointed out that in the > account of U+FFFE, > > "all levels" includes quaternary and identical. > > The concept of a collation element does not really apply at the > identical level - its formation does not respect the division of a > string into collating elements. For example, LETTER U, U+0308 COMBINING DIAERESIS, U+0334 COMBINING TILDE> has > collating elements and , but the identical > level contribution to the sort key is 0443, 0308, 0334. Now the > concept of U+FFFE requires that at the 'identical' level, > "a\u0000\uFFFE" sort after "a\uFFFE". Right. With ICU 53: <1 a\uFFFE 29 02 , 05 02 , 05 02 , 02 , 92 02 . that U+FFFE be transformed to a negative scalar value! > That depends on how you encode the identical level. In the UCA as written, you could do a transformation like this: FFFE->0000 0000->0001 0001 0001->0001 0002 In ICU, we use a simple "compression" scheme (a delta encoding) that preserves binary order, and we reserved byte values 00 (terminator), 01 (level separator), 02 (for U+FFFE). Now, as I understand it, the identical level is not intended to address > any cultural concepts of ordering, but simply as a convenience in > handling inequivalent strings, so that (a) distinct strings need not > compare as equal, and (b) canonically equivalent strings are ordered > together. Yes. It's mostly a semi-arbitrary tie-breaker, except that in the CLDR Japanese tailoring it provides the distinctions of JIS X 4061 level 5 (compatibility forms of Japanese characters sort after their regular forms). markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Sat Apr 5 11:30:31 2014 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Sat, 5 Apr 2014 17:30:31 +0100 Subject: CLDR proposal: Unicode algorithms should fall back to root, not to unrelated default locale In-Reply-To: References: <20140403212109.33833276@JRWUBU2> Message-ID: <20140405173031.4d4eb558@JRWUBU2> On Thu, 3 Apr 2014 20:01:40 -0700 Markus Scherer wrote: > On Thu, Apr 3, 2014 at 1:21 PM, Richard Wordingham < > richard.wordingham at ntlworld.com> wrote: >> How are break iteration rules meant to interact with >> dictionary-based word and line-breakers? > In CLDR and ICU, the rules specify the set of characters that need > dictionary support. (It's triggered by script, not by language.) In CLDR, which rules are these? I can't find them. All I can find is statements outside CLDR such as "For Thai, Lao, Khmer, Myanmar, and other scripts that do not typically use spaces between words, a good implementation should not depend on the default word boundary specification" in UAX#29 'Unicode Text Segmentation'. Now, some minority languages in these scripts use spaces between words, as can be seen in the Northern Khmer bible (e.g. at http://www.amazon.com/Bible-Northern-Khmer-Black-Cover/dp/9749141083). While Thai might be a good fallback language for kxm-Thai-TH (there is some usage of kxm-Khmr-TH), a Thai dictionary-based break iterator would be a disaster. On the other hand, I would hope for tolerable breaking performance from a Thai dictionary-based break iterator for North-Eastern Thai (tts-Thai-TH), which does not separate words. By contrast, I would describe the performance for phonetically written Northern Thai, as revealed by the Thai spell-checker in LibreOffice, as unsurprisingly poor. > I expect that there will generally be data for language-specific > exceptions, overrides and such for more languages than character-level > segmentation rules. Those low-level rules should always fall back to > root when there is no language-specific data. I think the higher-level > exceptions should probably also avoid going through some default > language. If breakers just ignore the segmentation rules, then it should always help to define rough and ready segmentation rules for every language that uses a mainland SE Asian script as identified by Line_Break=SA. Syllable breaking is generally a good approximation to word and line-breaking, and in the visually ordered scripts, the preposed vowels start syllables. One needs a good reason to default the segmentation rules to root for such languages. Turning to collation, is the way to provide defaulting for collation tag in collation/root.xml to list all languages as valid sublocales? I am a bit confused as to the point of having the file collation/en.xml. What does it achieve? Does it exist purely for the sake of its comment? Richard. From markus.icu at gmail.com Sat Apr 5 12:12:10 2014 From: markus.icu at gmail.com (Markus Scherer) Date: Sat, 5 Apr 2014 10:12:10 -0700 Subject: CLDR proposal: Unicode algorithms should fall back to root, not to unrelated default locale In-Reply-To: <20140405173031.4d4eb558@JRWUBU2> References: <20140403212109.33833276@JRWUBU2> <20140405173031.4d4eb558@JRWUBU2> Message-ID: On Sat, Apr 5, 2014 at 9:30 AM, Richard Wordingham < richard.wordingham at ntlworld.com> wrote: > > In CLDR and ICU, the rules specify the set of characters that need > > dictionary support. (It's triggered by script, not by language.) > > In CLDR, which rules are these? I think it's \p{Line_Break=Complex_Context} which you can find in the line-break rules in http://unicode.org/cldr/trac/browser/trunk/common/segments/root.xml Also, as far as I know, the ICU rule syntax is different enough from the CLDR syntax that the conversion is manual. The ICU dictionary support might need a manual addition. (Others know a lot more about segmentation than I do.) Turning to collation, is the way to provide defaulting for collation > tag in collation/root.xml to list all languages as valid sublocales? The validSubLocales data was removed from CLDR. Instead, we have some empty base-language collation files to document that the root order is known to be appropriate; as opposed to the absence of a base-language collation file which basically means "don't know". I am a bit confused as to the point of having the file collation/en.xml. > What does it achieve? Does it exist purely for the sake of its comment? > Yes. In addition, in the current ICU implementation (I am not sure about the LDML spec), an empty base-language file means we find something and don't go through the default locale. When we agree that collation should go directly to root, rather than to the default locale, then we could remove the empty resource bundles from ICU (although they are very small). We would keep the empty CLDR files for documentation. markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Mon Apr 7 18:39:32 2014 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Tue, 8 Apr 2014 00:39:32 +0100 Subject: CLDR proposal: Unicode algorithms should fall back to root, not to unrelated default locale In-Reply-To: References: <20140403212109.33833276@JRWUBU2> <20140405173031.4d4eb558@JRWUBU2> Message-ID: <20140408003932.66fb779c@JRWUBU2> On Sat, 5 Apr 2014 10:12:10 -0700 Markus Scherer wrote: > On Sat, Apr 5, 2014 at 9:30 AM, Richard Wordingham < > richard.wordingham at ntlworld.com> wrote: > > > > In CLDR and ICU, the rules specify the set of characters that need > > > dictionary support. (It's triggered by script, not by language.) > > > > In CLDR, which rules are these? > I think it's > \p{Line_Break=Complex_Context} > which you can find in the line-break rules in > http://unicode.org/cldr/trac/browser/trunk/common/segments/root.xml If the dictionary is chosen only by script and not by language, then the design of ICU is currently broken as far as minority languages are concerned. I can't see how a Thai dictionary and a Northern or NE Thai dictionary can co-exist. (The usual script for writing these languages is the Thai script, despite attempts to reinvigorate old regional scripts.) Going back to the CLDR level, there's another complexity. Good Thai typography inserts a space before U+0E46 THAI CHARACTER MAIYAMOK, and does not break lines before the U+0E46. It may be possible to fix the line breaking by a rule something like "? \u0e46". The sequence should usually be considered the end of a word - the truth of Line_Break=Complex_Context can vary within a word. (There are a few dictionary entries where occurs within the non-compound lexical item - U+0E46 is then also followed by a space.) I haven't yet experimented with these rules in ICU. Might these tweaks work? Would tailoring Thai characters not to be Line_Break=Complex_Context succeed in disabling the use of the Thai dictionary for a locale? The following rule in root.xml diminishes hope: [$AI $AL $XX $SA $SG] In all the examples of Pali I've seen in the Thai script, words are separated by spaces. I think U+0E46 should be Line_Break=Exclamation. Now some people get round the problem by omitting the space but starting the glyph of mai yamok with a space. ICU does this with words that end in mai yamok - there is no preceding space character. When looking at serials in Thai magazines, I've noticed that spaces are omitted before question and exclamation marks when there is a risk of justification moving them onto the next line. I suspect the rule "? EX" is often not implemented. It is possible that changing the line break property of mai yamok could inconvenience these people - removing from the end of a word in the (Thai) Royal Institute Dictionary does not always yield a word. The immediate consequence of all this is that changing the inheritance rules for segmentation would only be depriving certain people of a benefit they probably don't yet have. > In addition, in the current ICU implementation (I am not sure about > the LDML spec), an empty base-language file means we find something > and don't go through the default locale. Formally, that looks like a non-compliance! Richard. From rxaviers at gmail.com Thu Apr 17 06:41:39 2014 From: rxaviers at gmail.com (Rafael Xavier) Date: Thu, 17 Apr 2014 08:41:39 -0300 Subject: CLDR JSON CDN? Message-ID: Hello fellows, Unicode hosts a copy of the latest CLDR JSONs at its repository trunk http://www.unicode.org/repos/cldr-aux/json/25/main/. I guess this URL is meant for download, not for direct usage (ie as a CDN), right? Is there any official CDN for CLDR JSONs? Thanks -- +55 (16) 8138-1583, skype: rxaviers http://rafael.xavier.blog.br -------------- next part -------------- An HTML attachment was scrubbed... URL: From emmo at us.ibm.com Thu Apr 17 08:32:16 2014 From: emmo at us.ibm.com (John Emmons) Date: Thu, 17 Apr 2014 08:32:16 -0500 Subject: CLDR JSON CDN? In-Reply-To: References: Message-ID: No there isn't, and we don't really plan to. Regards, John C. Emmons Globalization Architect & Unicode CLDR TC Chairman IBM Software Group Internet: emmo at us.ibm.com From: Rafael Xavier To: "cldr-users at unicode.org" , Date: 04/17/2014 06:49 AM Subject: CLDR JSON CDN? Sent by: "CLDR-Users" Hello fellows, Unicode hosts a copy of the latest CLDR JSONs at its repository trunk http://www.unicode.org/repos/cldr-aux/json/25/main/. I guess this URL is meant for download, not for direct usage (ie as a CDN), right? Is there any official CDN for CLDR JSONs? Thanks -- +55 (16) 8138-1583, skype: rxaviers http://rafael.xavier.blog.br _______________________________________________ CLDR-Users mailing list CLDR-Users at unicode.org http://unicode.org/mailman/listinfo/cldr-users -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From markus.icu at gmail.com Fri Apr 18 16:41:28 2014 From: markus.icu at gmail.com (Markus Scherer) Date: Fri, 18 Apr 2014 14:41:28 -0700 Subject: CLDR/ICU proposal: collation rules for import only Message-ID: Dear CLDR & ICU teams & users, Summary: I propose that we distinguish for-import-only rules from create-a-sort-order rules via a naming convention rather than flags in the data. Details: In collation rules, we can "import" the rules of another tailoring. For example, common/collation/bs.xmlhas . We want to extend this by writing partial rules that are not intended as their own sort orders but only for import into other rules. See http://cldr.unicode.org/development/development-process/design-proposals/collation-additions#TOC-Collation-Importand http://unicode.org/cldr/trac/ticket/3949 The idea was to use in CLDR, and I see that that attribute exists in common/dtd/ldml.dtdbut it is marked as deprecated, and it is not documented in the LDML collation spec. In ICU we would turn it into something like NoBinary{""} ( http://bugs.icu-project.org/trac/ticket/8082). However, we also want to suppress such for-import-only rules from the lists of "available" keyword values and collators ( http://bugs.icu-project.org/trac/ticket/8983). If we did this via a data flag, then we would have to load the data before we can find out that we want to exclude it from the list. In addition, collation types are normally added to the common/bcp47/collation.xml file. This is undesirable for what are really internal identifiers. We don't want to advertise them as available, *we don't want to collect display names for them*, and we don't want to have to keep them stable. I have a simpler proposal: - I propose that we use a naming convention to distinguish for-import-only rules. - I propose that the first character of the collation type be digit '0' if an only if the rules are only to be used for import, not for establishing complete sort orders nor creating collators. - We would not need an XML attribute, nor an ICU resource bundle entry, nor would we add such types into bcp47/collation.xml. For example, we might create a type="0kana" tailoring that would be imported into the Japanese standard and unihan tailorings; and we might create a type="0pinyin" tailoring that would be imported into the Chinese pinyin and unihan tailorings. Please let me know if you disagree. Sincerely, markus -- Google Internationalization Engineering -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Mon Apr 21 04:23:13 2014 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Mon, 21 Apr 2014 10:23:13 +0100 Subject: More Plural Categories? Message-ID: <20140421102313.45e047dc@JRWUBU2> I fear I've seen found a need for more plural categories. I was running my own English language data exploration program and came across the following grammatical error in my output: '... is a 11-element table.' This fragment should, of course, have been '... is an 11-element table.' I'd not noticed this issue before; perhaps I'd been sensitised by pondering the production of the Latin locale. Does the 'others' category need to have a category extracted for numbers that start with vowels? These numbers would be something like i in 11, 18, 80..89, 800..899, 1100..1199, 1800..1899, 8000..8999, 11000..11999, 18000..18999, 80000..89999, 800000..899999 I don't see a nice way of carrying it on beyond a million. There may well be national variation in the validity of the 1100..1199 and 1800..1899 ranges. This complication will extend to quite a few languages. Are negative numbers supposed to be supported? Negative numbers belong to the 'other' category in English, but CLDR seems to put -1 in the 'one' category for English. There seems to be a subtle dependency on whether the word 'minus' denotes a relative value or an absolute value. The Welsh numbers are complicated enough for natural numbers. They deviate from taking the unmutated singular noun as follows: zero: plural form for nouns one: Soft mutation for feminine nouns two: Soft mutation for all nouns few (i.e. 3): Spirant mutation for masculine nouns many (i.e. 6): Spirant mutation for all nouns other: No mutation However, it is not quite as simple as that, even ignoring the argument that Welsh ought to be localised. The complication arises with the numerative forms of _blwyddyn_ 'year', namely _blynedd_ 'years' and _blwydd_ 'years old'. While in general they unusually take the nasal mutation for 'other' (yielding _mlynedd_ and _mlwydd_), the standard form for '4 years' is 'pedair blynedd', with no mutation! 'Pedair blwydd' is the standard form for '4 years old', though 'pedair mlwydd' is quite common. This makes a seventh category, for '4', but only significant with _blynedd_ and, less so, _blwydd_, and archaic diction with _diwrnod_ 'day'. Welsh may precede numbers by the definite article as English does, so there is variation between _y_ and _yr_ depending on whether the following number starts with a vowel or not. This splits 'other' much as in English, with the complication that Welsh has both vigesimal and decimal systems - see http://en.wikipedia.org/wiki/Welsh_numerals for a quick summary. The RBNF rules have gone for the decimal system. Apparently the choice between the two systems is affected by what is being counted. Possibly the words for 'year' should be special-cased - it seems to have exceptional usage with numbers in several languages. For example, in Thai, the ages of childen should be expressed using ??? (tr. 'khuap') instead of ?? (tr. 'pi') as the word for 'year'. Talking of Thai, although usage seems quite variable, there is a rule that the number for 'one' should follow the classifier rather than precede it like other numbers. Does this justify Thai having a separate category 'one'? (At present, it just has the sole category 'other'.) Possibly this is covered by the advice to consider special-casing 0 and 1 anyway. There are several cases in Thai where the numeral '1' normally disappears in speech, e.g. times of the day. I am also wondering if the existence of what are translated as plural forms of the demonstrative adjectives calls for a separate category 'one' in Thai. Possibly one can just avoid using these plural forms when the number of items (one v. more than one) is not known beforehand. Richard. From verdy_p at wanadoo.fr Mon Apr 21 05:21:56 2014 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Mon, 21 Apr 2014 12:21:56 +0200 Subject: More Plural Categories? In-Reply-To: <20140421102313.45e047dc@JRWUBU2> References: <20140421102313.45e047dc@JRWUBU2> Message-ID: This is not a question for determining the plural form, it's completely orthogoanl and is a phonologic mutation that can apply to lots of words pairs; someti,es (not always) extended to the orthography. The rules are extremely complex but do not depend on plurals, for example: * In English you have "an egg" vs. "a chicken" (before a noun starting by a vowel), "a year" or "a yellow car" ("y" starting a noun or adjectifve is considered a consonnant here) * In French the mutation of the nasal to a denasalizied vowel+/n/ consonnant in "un enfant" occurs before a vowel (or a mure "h") starting the next noun or adjective but does not influence the orthography there are cases of mutations by elision of a final mute "e" replaced by an apostrophe (also in Italian) before a noun or adjective or verb starting by vowel or mute "h") but there are exceptions ("un enfant de onze ans" and usually not "d'onze ans", but "un enfant d'un an" and usually not "un enfant de un an"). * many examples in many languages much more complex that English or French Such phonologuical and sometimes orthographic/grammatical mutations are not suitable for inclusion in plural rules, they do not depend (only) on the value of numbers when they are present. 2014-04-21 11:23 GMT+02:00 Richard Wordingham < richard.wordingham at ntlworld.com>: > I fear I've seen found a need for more plural categories. I was > running my own English language data exploration program and came across > the following grammatical error in my output: > > '... is a 11-element table.' > > This fragment should, of course, have been > > '... is an 11-element table.' > > I'd not noticed this issue before; perhaps I'd been sensitised by > pondering the production of the Latin locale. > > Does the 'others' category need to have a category extracted for > numbers that start with vowels? These numbers would be something like > > i in 11, 18, 80..89, 800..899, > 1100..1199, 1800..1899, 8000..8999, 11000..11999, 18000..18999, > 80000..89999, 800000..899999 > > I don't see a nice way of carrying it on beyond a million. There may > well be national variation in the validity of the 1100..1199 and > 1800..1899 ranges. > > This complication will extend to quite a few languages. > > Are negative numbers supposed to be supported? Negative numbers belong > to the 'other' category in English, but CLDR seems to put -1 in the > 'one' category for English. There seems to be a subtle dependency on > whether the word 'minus' denotes a relative value or an absolute value. > > The Welsh numbers are complicated enough for natural numbers. They > deviate from taking the unmutated singular noun as follows: > > zero: plural form for nouns > one: Soft mutation for feminine nouns > two: Soft mutation for all nouns > few (i.e. 3): Spirant mutation for masculine nouns > many (i.e. 6): Spirant mutation for all nouns > other: No mutation > > However, it is not quite as simple as that, even ignoring the argument > that Welsh ought to be localised. The complication arises with the > numerative forms of _blwyddyn_ 'year', namely _blynedd_ 'years' and > _blwydd_ 'years old'. While in general they unusually take the nasal > mutation for 'other' (yielding _mlynedd_ and _mlwydd_), the standard > form for '4 years' is 'pedair blynedd', with no mutation! 'Pedair > blwydd' is the standard form for '4 years old', though 'pedair mlwydd' > is quite common. This makes a seventh category, for '4', but only > significant with _blynedd_ and, less so, _blwydd_, and archaic diction > with _diwrnod_ 'day'. > > Welsh may precede numbers by the definite article as English does, so > there is variation between _y_ and _yr_ depending on whether the > following number starts with a vowel or not. This splits 'other' much > as in English, with the complication that Welsh has both vigesimal and > decimal systems - see http://en.wikipedia.org/wiki/Welsh_numerals for a > quick summary. The RBNF rules have gone for the decimal system. > Apparently the choice between the two systems is affected by what is > being counted. > > Possibly the words for 'year' should be special-cased - it seems to > have exceptional usage with numbers in several languages. For example, > in Thai, the ages of childen should be expressed using ??? (tr. 'khuap') > instead of ?? (tr. 'pi') as the word for 'year'. > > Talking of Thai, although usage seems quite variable, there is a rule > that the number for 'one' should follow the classifier rather than > precede it like other numbers. Does this justify Thai having a > separate category 'one'? (At present, it just has the sole > category 'other'.) Possibly this is covered by the advice to consider > special-casing 0 and 1 anyway. There are several cases in Thai where > the numeral '1' normally disappears in speech, e.g. times of the day. > I am also wondering if the existence of what are translated as plural > forms of the demonstrative adjectives calls for a separate category > 'one' in Thai. Possibly one can just avoid using these plural forms > when the number of items (one v. more than one) is not known beforehand. > > Richard. > > _______________________________________________ > CLDR-Users mailing list > CLDR-Users at unicode.org > http://unicode.org/mailman/listinfo/cldr-users > -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Mon Apr 21 08:27:49 2014 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Mon, 21 Apr 2014 14:27:49 +0100 Subject: More Plural Categories? In-Reply-To: References: <20140421102313.45e047dc@JRWUBU2> Message-ID: <20140421142749.0fc6db13@JRWUBU2> On Mon, 21 Apr 2014 12:21:56 +0200 Philippe Verdy wrote: > This is not a question for determining the plural form, it's > completely orthogoanl and is a phonologic mutation that can apply to > lots of words pairs; someti,es (not always) extended to the > orthography. What do you think the origin of the Welsh categories is? The distinction between the Welsh categories two/few/many/other is in origin a phonological distinction, as is most of the distinction in numeric forms between 'one' and the others. For 'one' v. the other four, there are also the effects of the singular v. plural distinction, for example on accompanying demonstratives and referring pronouns. > The rules are extremely complex but do not depend on > plurals, for example: > * In English you have "an egg" vs. "a chicken" (before a noun > starting by a vowel), "a year" or "a yellow car" ("y" starting a noun > or adjectifve is considered a consonnant here) The idea is that a program slotting these words into a frame would select a set of associated forms to be placed in the various positions. For English, the set would be at least the noun and the indefinite article. With numbers, there is the potential problem that the number of such sets is unbounded. The concept of the plural categories is that the number then selects one of no more than say six sets. For example, the general form of a question may be, 'You have selected 6 files; delete them?'. Based on the number, one has to select in English not only between between 'files' and 'file' but also 'them' and 'it'. In some languages, there might be a 3-way choice of pronouns, and in some languages the value of the number may affect the various verbs. > * In French ... "un enfant de onze ans" and usually not "d'onze ans", > but "un enfant d'un an" and usually not "un enfant de un an"... Should not this be captured by CLDR? > Such phonologuical and sometimes orthographic/grammatical mutations > are not suitable for inclusion in plural rules, they do not depend > (only) on the value of numbers when they are present. One can select the form from the number. The only question is whether it would be better to apply a phonological rule to the composed form. If that were the decision, then CLDR ought to contain the transformation. However, in your example, it does not just involve a simple phonological rule; there is the difficult decision of whether to apply it. Now, spelt out numbers in Sanskrit might be a good case for the mechanical application of sandhi. Richard. From emmo at us.ibm.com Mon Apr 21 10:51:34 2014 From: emmo at us.ibm.com (John Emmons) Date: Mon, 21 Apr 2014 10:51:34 -0500 Subject: CLDR/ICU proposal: collation rules for import only In-Reply-To: References: Message-ID: I would prefer that we have an attribute for it, so that it is crystal clear to everyone exactly what is going on. I really don't like the idea of "0" + ruleset naming convention. We have a similar situation in the RBNF rules. There we use: I would think that the most logical thing would be to extend the use of the access attribute, such that we have: Regards, John C. Emmons Globalization Architect & Unicode CLDR TC Chairman IBM Software Group Internet: emmo at us.ibm.com From: Markus Scherer To: "cldr-users at unicode.org" , icu-design , Date: 04/18/2014 04:44 PM Subject: CLDR/ICU proposal: collation rules for import only Sent by: "CLDR-Users" Dear CLDR & ICU teams & users, Summary: I propose that we distinguish for-import-only rules from create-a-sort-order rules via a naming convention rather than flags in the data. Details: In collation rules, we can "import" the rules of another tailoring. For example,?common/collation/bs.xml has?. We want to extend this by writing partial rules that are not intended as their own sort orders but only for import into other rules. See http://cldr.unicode.org/development/development-process/design-proposals/collation-additions#TOC-Collation-Import and?http://unicode.org/cldr/trac/ticket/3949 The idea was to use??in CLDR, and I see that that attribute exists in common/dtd/ldml.dtd but it is marked as deprecated, and it is not documented in the LDML collation spec. In ICU we would turn it into something like?NoBinary{""}?( http://bugs.icu-project.org/trac/ticket/8082). However, we also want to suppress such for-import-only rules from the lists of "available" keyword values and collators ( http://bugs.icu-project.org/trac/ticket/8983). If we did this via a data flag, then we would have to load the data before we can find out that we want to exclude it from the list. In addition, collation types are normally added to the common/bcp47/collation.xml file. This is undesirable for what are really internal identifiers. We don't want to advertise them as available, we don't want to collect display names for them, and we don't want to have to keep them stable. I have a simpler proposal: - I propose that we use a naming convention to distinguish for-import-only rules. - I propose that the first character of the collation type be digit '0' if an only if the rules are only to be used for import, not for establishing complete sort orders nor creating collators. - We would not need an XML attribute, nor an ICU resource bundle entry, nor would we add such types into bcp47/collation.xml. For example, we might create a type="0kana" tailoring that would be imported into the Japanese standard and unihan tailorings; and we might create a type="0pinyin" tailoring that would be imported into the Chinese pinyin and unihan tailorings. Please let me know if you disagree. Sincerely, markus -- Google Internationalization Engineering _______________________________________________ CLDR-Users mailing list CLDR-Users at unicode.org http://unicode.org/mailman/listinfo/cldr-users -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From markus.icu at gmail.com Mon Apr 21 12:14:41 2014 From: markus.icu at gmail.com (Markus Scherer) Date: Mon, 21 Apr 2014 10:14:41 -0700 Subject: CLDR/ICU proposal: collation rules for import only In-Reply-To: References: Message-ID: On Mon, Apr 21, 2014 at 8:51 AM, John Emmons wrote: > I would prefer that we have an attribute for it, so that it is crystal > clear to everyone exactly what is going on. I really don't like the idea > of "0" + ruleset naming convention. > Well, the attribute approach has problems, as I said: - I don't want to have to load the data just to find out if it's "available". - I want it to be clear which collation types we add to bcp47/collation.xml and which we don't. - I want it to be clear for which collation types to collect display names. If the CLDR committee feels strongly, then maybe we can use both an attribute and a naming convention, and make sure that they are used together (both or neither). > We have a similar situation in the RBNF rules. There we use: > > > > I would think that the most logical thing would be to extend the use of > the access attribute, such that we have: > > > Well, is deprecated and not used any more. The design doc says However, if it's an attribute, then it should really be on the element -- and I don't care if it's or . Or maybe it should be an element, to avoid adding a non-distinguishing attribute . (All change collation behavior, but "private" is something totally different.) markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From mark at macchiato.com Mon Apr 21 14:19:08 2014 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Mon, 21 Apr 2014 22:19:08 +0300 Subject: [icu-design] CLDR/ICU proposal: collation rules for import only In-Reply-To: References: Message-ID: I ? agree with John that 0kana is obscure. I prefer as well the private attribute. ?On the other hand, I can also see a convention that makes it easier to know that something is private. That would make some of the RBNF rules clearer, for example. What I suggest is both the attribute and naming convention (and a test to ensure they match). But 0 is way too ugly. My suggestions would be along the following lines: ? < ?foo type="_foobar" access="private" > ?That is, _x signals private. This follows the convention that some people follow for _x being a local variable. ? ? < ?foo type="private_foobar" access="private" > ?This convention would make it *very* clear what was expected to be private! foo would be rbnf, collation, transliteration, etc.? {phone} On Apr 21, 2014 8:15 PM, "Markus Scherer" wrote: > On Mon, Apr 21, 2014 at 8:51 AM, John Emmons wrote: > >> I would prefer that we have an attribute for it, so that it is crystal >> clear to everyone exactly what is going on. I really don't like the idea >> of "0" + ruleset naming convention. >> > Well, the attribute approach has problems, as I said: > - I don't want to have to load the data just to find out if it's > "available". > - I want it to be clear which collation types we add to > bcp47/collation.xml and which we don't. > - I want it to be clear for which collation types to collect display names. > > If the CLDR committee feels strongly, then maybe we can use both an > attribute and a naming convention, and make sure that they are used > together (both or neither). > >> We have a similar situation in the RBNF rules. There we use: >> >> >> >> I would think that the most logical thing would be to extend the use of >> the access attribute, such that we have: >> >> >> > Well, is deprecated and not used any more. > > The design doc says > > However, if it's an attribute, then it should really be on the > element -- and I don't care if it's > or . > > Or maybe it should be an element, to avoid adding a non-distinguishing > attribute > . > > (All change collation behavior, but "private" is something > totally different.) > > markus > > > ------------------------------------------------------------------------------ > Start Your Social Network Today - Download eXo Platform > Build your Enterprise Intranet with eXo Platform Software > Java Based Open Source Intranet - Social, Extensible, Cloud Ready > Get Started Now And Turn Your Intranet Into A Collaboration Platform > http://p.sf.net/sfu/ExoPlatform > _______________________________________________ > icu-design mailing list > icu-design at lists.sourceforge.net > To Un/Subscribe: https://lists.sourceforge.net/lists/listinfo/icu-design > -------------- next part -------------- An HTML attachment was scrubbed... URL: From markus.icu at gmail.com Mon Apr 21 17:08:04 2014 From: markus.icu at gmail.com (Markus Scherer) Date: Mon, 21 Apr 2014 15:08:04 -0700 Subject: [icu-design] CLDR/ICU proposal: collation rules for import only In-Reply-To: References: Message-ID: On Mon, Apr 21, 2014 at 12:19 PM, Mark Davis ?? wrote: > What I suggest is both the attribute and naming convention (and a test to > ensure they match). But 0 is way too ugly. My suggestions would be along > the following lines: > I picked a prefix '0' because I assume that even these internal types need to be valid in language tags. At least in ICU we assemble something like sr_Latn's into syntax with a language tag like [import hr-u-co-search]. Therefore, the type needs to be a valid subtag, with [a-z0-9] and at most 8 characters. I agree that '0' is not pretty, but it seemed like the best possible prefix given the constraints, and given that none of the existing types begins with a digit. Also, as I said in my previous email, if y'all do want more than a naming convention, then it should probably be an element, not a non-distinguishing attribute. markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From markus.icu at gmail.com Wed Apr 23 11:01:40 2014 From: markus.icu at gmail.com (Markus Scherer) Date: Wed, 23 Apr 2014 09:01:40 -0700 Subject: [icu-design] CLDR/ICU proposal: collation rules for import only In-Reply-To: References: Message-ID: In CLDR team discussion today we settled on a more obvious, less "ugly" naming convention, using a two-part type that turns into two language subtags. In CLDR data: ... ... which in ICU would turn into [import ja-u-co-private-kana] Multi-part keyword values are already used for ca (calendar type e.g. islamic-tbla), kr (script reodering), vt (deprecated variableTop) and maybe more. Thanks for the feedback! markus -------------- next part -------------- An HTML attachment was scrubbed... URL: