From cldr-users at unicode.org Wed Nov 1 10:30:48 2017 From: cldr-users at unicode.org (Yoshito Umaoka via CLDR-Users) Date: Wed, 1 Nov 2017 10:30:48 -0500 Subject: Metazone timestamps In-Reply-To: References: Message-ID: > In http://www.unicode.org/reports/tr35/tr35-dates.html#Metazone_Names > there is an example: > > > > mzone="America_Eastern"/> > > > > It also states: ?Note that the dates and times are specified in UTC, > not local time.? > Correct. > As currently presented, they appear to be local time. Is there a > reason these aren't just expressed as ISO 8601 timestamps with a > trailing Z? In other words: ?1991-10-27T07:00Z? > > Note that ISO 8601 allows for a space instead of a T, so these > timestamps are ISO 8601 compliant, but also that in absence of a Z > or an offset they are required to be interpreted as local time. I guess the original author/designer of this data did not really care about ISO 8601 conformance. > I?ve certainly made this mistake in interpreting them before. I?m > sure others have as well. Would it be possible to change this in a > future version? > I actually felt the same when I started maintaining the data. Personally, I'm fine to change the date/time string to be ISO UTC time format (and use the default date/time separator - 'T'), then update the spec to explain it's ISO 8601 UTC date/time format explicitly. But, this change may break existing code utilizing this data, so we probably need to discuss this in CLDR TC. Anyway, can you file a CLDR ticket? > Thanks, > Matt Johnson > Microsoft > -------------- next part -------------- An HTML attachment was scrubbed... URL: From cldr-users at unicode.org Wed Nov 1 14:47:40 2017 From: cldr-users at unicode.org (Matt Johnson (AZURE) via CLDR-Users) Date: Wed, 1 Nov 2017 19:47:40 +0000 Subject: Metazone timestamps In-Reply-To: References: Message-ID: Ticket filed here. Thanks. https://unicode.org/cldr/trac/ticket/10742 From: Yoshito Umaoka [mailto:yoshito_umaoka at us.ibm.com] Sent: Wednesday, November 1, 2017 8:31 AM To: Matt Johnson (AZURE) Cc: cldr-users at unicode.org Subject: Re: Metazone timestamps > In http://www.unicode.org/reports/tr35/tr35-dates.html#Metazone_Names > there is an example: > > > > mzone="America_Eastern"/> > > > > It also states: ?Note that the dates and times are specified in UTC, > not local time.? > Correct. > As currently presented, they appear to be local time. Is there a > reason these aren't just expressed as ISO 8601 timestamps with a > trailing Z? In other words: ?1991-10-27T07:00Z? > > Note that ISO 8601 allows for a space instead of a T, so these > timestamps are ISO 8601 compliant, but also that in absence of a Z > or an offset they are required to be interpreted as local time. I guess the original author/designer of this data did not really care about ISO 8601 conformance. > I?ve certainly made this mistake in interpreting them before. I?m > sure others have as well. Would it be possible to change this in a > future version? > I actually felt the same when I started maintaining the data. Personally, I'm fine to change the date/time string to be ISO UTC time format (and use the default date/time separator - 'T'), then update the spec to explain it's ISO 8601 UTC date/time format explicitly. But, this change may break existing code utilizing this data, so we probably need to discuss this in CLDR TC. Anyway, can you file a CLDR ticket? > Thanks, > Matt Johnson > Microsoft > -------------- next part -------------- An HTML attachment was scrubbed... URL: From cldr-users at unicode.org Wed Nov 8 04:27:43 2017 From: cldr-users at unicode.org (Elsebeth Flarup via CLDR-Users) Date: Wed, 08 Nov 2017 05:27:43 -0500 Subject: Comparison table with pre-CLDR locales Message-ID: I am preparing a CLDR training session, and as part of that I would like to include a few specific examples of the differences that existed between locale formats on various vendor-specific platforms prior to the widespread adoption of CLDR. I am fairly sure I remember an online table listing the locale formats from AIX, Sun Solaris, etc. that were collected at the start of the CLDR project, but I have been unable to find that table now. Does anybody remember the table, and where it was located? Alternatively, does anybody know of any other source that would have that kind of historical data? Thanks, Elsebeth -------------- next part -------------- An HTML attachment was scrubbed... URL: From cldr-users at unicode.org Thu Nov 9 05:15:10 2017 From: cldr-users at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via CLDR-Users) Date: Thu, 9 Nov 2017 12:15:10 +0100 Subject: Comparison table with pre-CLDR locales In-Reply-To: References: Message-ID: I vaguely remember something like that, but I think those were just when we were getting started, and I don't find anything like that in our repository. Mark On Wed, Nov 8, 2017 at 11:27 AM, Elsebeth Flarup via CLDR-Users < cldr-users at unicode.org> wrote: > I am preparing a CLDR training session, and as part of that I would like > to include a few specific examples of the differences that existed between > locale formats on various vendor-specific platforms prior to the widespread > adoption of CLDR. > > I am fairly sure I remember an online table listing the locale formats > from AIX, Sun Solaris, etc. that were collected at the start of the CLDR > project, but I have been unable to find that table now. Does anybody > remember the table, and where it was located? Alternatively, does anybody > know of any other source that would have that kind of historical data? > > Thanks, > Elsebeth > > > > > _______________________________________________ > CLDR-Users mailing list > CLDR-Users at unicode.org > http://unicode.org/mailman/listinfo/cldr-users > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From cldr-users at unicode.org Thu Nov 9 11:12:13 2017 From: cldr-users at unicode.org (Jan Lana via CLDR-Users) Date: Thu, 09 Nov 2017 18:12:13 +0100 Subject: Comparison table with pre-CLDR locales In-Reply-To: References: Message-ID: <151024753325.808.7567282285130647465@votice> Quoting Elsebeth Flarup via CLDR-Users (2017-11-08 11:27:43) > I am preparing a CLDR training session, and as part of that I would like to > include a few specific examples of the differences that existed between locale > formats on various vendor-specific platforms prior to the widespread adoption > of CLDR. > > I am fairly sure I remember an online table listing the locale formats from > AIX, Sun Solaris, etc. that were collected at the start of the CLDR project, > but I have been unable to find that table now. Does anybody remember the table, > and where it was located? Alternatively, does anybody know of any other source > that would have that kind of historical data? Solaris migrated most of locales to CLDR in Solaris 10 Update Release 4 in 2007 (https://docs.oracle.com/cd/E19957-01/820-2714/6nea26qkb/index.html#gevhv) But there are no public comparison reports from the time as far I know. Is there any specific information you try to find? regards, - Jan Lana From cldr-users at unicode.org Fri Nov 10 04:47:26 2017 From: cldr-users at unicode.org (Elsebeth Flarup via CLDR-Users) Date: Fri, 10 Nov 2017 05:47:26 -0500 Subject: Comparison table with pre-CLDR locales In-Reply-To: <151024753325.808.7567282285130647465@votice> References: <151024753325.808.7567282285130647465@votice> Message-ID: Thanks! I am not looking for any specific locale category or vendor for that matter. I just wanted to include a few of the most glaring differences between platforms that existed before CLDR as an illustration of the fragmentation at that time. I am fairly certain that the CLDR project started by collecting a snapshot of the data from as many platforms as possible, and then went through a process of converging on the most common formats for each locale at the time. I believe the comparison table I remember was either used during that process, or was at least a side product of it. I think it was up for several years (I remember referring to it a number of times). Unfortunately I don't even remember the URL used at the beginning of the CLDR project, otherwise the Wayback Machine might be able to help. Thanks, Elsebeth > -------- Original Message -------- > Subject: Re: Comparison table with pre-CLDR locales > Local Time: November 9, 2017 6:12 PM > UTC Time: November 9, 2017 5:12 PM > From: jan.lana at oracle.com > To: cldr-users at unicode.org , Elsebeth Flarup > > Quoting Elsebeth Flarup via CLDR-Users (2017-11-08 11:27:43) > >> I am preparing a CLDR training session, and as part of that I would like to >> include a few specific examples of the differences that existed between locale >> formats on various vendor-specific platforms prior to the widespread adoption >> of CLDR. >> I am fairly sure I remember an online table listing the locale formats from >> AIX, Sun Solaris, etc. that were collected at the start of the CLDR project, >> but I have been unable to find that table now. Does anybody remember the table, >> and where it was located? Alternatively, does anybody know of any other source >> that would have that kind of historical data? >> >> Solaris migrated most of locales to CLDR in Solaris 10 Update Release 4 >> in 2007 >> (https://docs.oracle.com/cd/E19957-01/820-2714/6nea26qkb/index.html#gevhv) >> But there are no public comparison reports from the time as far I know. >> >> Is there any specific information you try to find? >> >> regards, > > - Jan Lana -------------- next part -------------- An HTML attachment was scrubbed... URL: From cldr-users at unicode.org Fri Nov 10 10:05:29 2017 From: cldr-users at unicode.org (Steven R. Loomis via CLDR-Users) Date: Fri, 10 Nov 2017 08:05:29 -0800 Subject: Comparison table with pre-CLDR locales In-Reply-To: References: <151024753325.808.7567282285130647465@votice> Message-ID: It's kind of a 'retrocomputing' project to get this to work again, but here is a snippet of Arabic diffs from about cldr 1.6. Probably false positives and negatives here due to breakage. Some interesting things around date formats. https://www.dropbox.com/s/q1i51n0bebvdbn0/cldr-16-diff-ar.zip?dl=0 On Fri, Nov 10, 2017 at 2:47 AM, Elsebeth Flarup via CLDR-Users < cldr-users at unicode.org> wrote: > Thanks! > I am not looking for any specific locale category or vendor for that > matter. I just wanted to include a few of the most glaring differences > between platforms that existed before CLDR as an illustration of the > fragmentation at that time. > > I am fairly certain that the CLDR project started by collecting a snapshot > of the data from as many platforms as possible, and then went through a > process of converging on the most common formats for each locale at the > time. I believe the comparison table I remember was either used during that > process, or was at least a side product of it. I think it was up for > several years (I remember referring to it a number of times). Unfortunately > I don't even remember the URL used at the beginning of the CLDR project, > otherwise the Wayback Machine might be able to help. > > Thanks, > Elsebeth > > > > -------- Original Message -------- > Subject: Re: Comparison table with pre-CLDR locales > Local Time: November 9, 2017 6:12 PM > UTC Time: November 9, 2017 5:12 PM > From: jan.lana at oracle.com > To: cldr-users at unicode.org , Elsebeth Flarup < > eflarup at protonmail.ch> > > Quoting Elsebeth Flarup via CLDR-Users (2017-11-08 11:27:43) > > I am preparing a CLDR training session, and as part of that I would like to > include a few specific examples of the differences that existed between > locale > formats on various vendor-specific platforms prior to the widespread > adoption > of CLDR. > I am fairly sure I remember an online table listing the locale formats from > AIX, Sun Solaris, etc. that were collected at the start of the CLDR > project, > but I have been unable to find that table now. Does anybody remember the > table, > and where it was located? Alternatively, does anybody know of any other > source > that would have that kind of historical data? > > Solaris migrated most of locales to CLDR in Solaris 10 Update Release 4 > in 2007 > (https://docs.oracle.com/cd/E19957-01/820-2714/6nea26qkb/index.html#gevhv) > But there are no public comparison reports from the time as far I know. > > Is there any specific information you try to find? > > regards, > > > - Jan Lana > > > > _______________________________________________ > CLDR-Users mailing list > CLDR-Users at unicode.org > http://unicode.org/mailman/listinfo/cldr-users > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From cldr-users at unicode.org Sat Nov 18 11:48:38 2017 From: cldr-users at unicode.org (Patrick Andries via CLDR-Users) Date: Sat, 18 Nov 2017 12:48:38 -0500 Subject: Question on BCP 47 In-Reply-To: <5A08C8BD.9010002@i18n.ca> References: <5A08C8BD.9010002@i18n.ca> Message-ID: <865e38df-516d-2af5-8f0e-ad3caefea5ef@xcential.com> In Windows, certain cultures have an "alternate sort" order (collation). This list , according to the LCID reference, is: LCID Languagetag Collationtype 0x1007F x-IV-mathan Math alphanumeric 0x10407 de-DE_phoneb Phonebook 0x1040E hu-HU_tchncl Technical 0x10437 ka-GE_modern Modern 0x20804 zh-CN_stroke Stroke count 0x21404 zh-MO_stroke " 0x21004 zh-SG_stroke " 0x30404 zh-TW_pronun Pronunciation 0x40404 zh-TW_radstr Radical/stroke 0x40411 ja-JP_radstr " 0x40C04 zh-HK_radstr " 0x41404 zh-MO_radstr " Some of these "collation"s? have equivalent entries among the collation identifiers defined in the "Unicode locale extension" (See:unicode.org/repos/ cldr /tags/latest/common/bcp47/collation.xml ): Identifier Description big5han Pinyin orderingfor Latin, big5 charsetorderingfor CJK characters(usedin Chinese) compat A previousversion of the ordering, for compatibility dict Dictionarystyle ordering ducet The default Unicode collation elementtable order emoji Recommendedorderingfor emojicharacters eor Europeanorderingrules gb2312 Pinyin orderingfor Latin, gb2312han charsetorderingfor CJK characters(usedin Chinese) phonebk Phonebookstyle ordering(suchas in German) phonetic Phoneticordering(sortingbasedon pronunciation) pinyin Pinyin orderingfor Latin and for CJK characters(usedin Chinese) reformed Reformedordering(suchas in Swedish) search Specialcollation type for string search searchjl Specialcollation type for Koreaninitial consonant search standard Default orderingfor eachlanguage stroke Pinyin orderingfor Latin, stroke orderfor CJK characters(usedin Chinese) trad Traditionalstyle ordering(suchas in Spanish) unihan Pinyin orderingfor Latin, Unihanradical-stroke orderingfor CJK characters(usedin Chinese) zhuyin Pinyin orderingfor Latin, zhuyinorderfor Bopomofoand CJK characters(usedin Chinese) The question is thus the following: if one wants to create a BCP 47 string representing the locale and the options a user has chosen in a Windows environment, one should be able to represent the "Windows alternate sorts" in BCP 47 syntax. Some such as "phoneb" have equivalent entries ("phonebk") but some don't apparently. If some Windows alternate sorts do not have equivalent entries, should we request for these to be added to the CLDR, or rather use a "variant tag", or yet use a "private use" tag in the BCP 47 format? Patrick Andries --- L'absence de virus dans ce courrier ?lectronique a ?t? v?rifi?e par le logiciel antivirus Avast. https://www.avast.com/antivirus -------------- next part -------------- An HTML attachment was scrubbed... URL: From cldr-users at unicode.org Sat Nov 18 12:26:57 2017 From: cldr-users at unicode.org (Philippe Verdy via CLDR-Users) Date: Sat, 18 Nov 2017 19:26:57 +0100 Subject: Question on BCP 47 In-Reply-To: <865e38df-516d-2af5-8f0e-ad3caefea5ef@xcential.com> References: <5A08C8BD.9010002@i18n.ca> <865e38df-516d-2af5-8f0e-ad3caefea5ef@xcential.com> Message-ID: Where are these "BCP47"-like codes documented on Windows (not really conforming as collation subtags are encoded as if they were language variant codes) as equivalent to the Windows internal LCID ? May be it's up to Microsoft to cleanup its MSDN documentation and add notes about legacy codes that should no longer be used in applications and replaced by the registered BCP47 collation subtags. In BCP47, collation codes should use the locale extension subtags. Same question about legacy locale codes used in Unix/Linux (also using non-conforming extensions such as "@charset"). 2017-11-18 18:48 GMT+01:00 Patrick Andries via CLDR-Users < cldr-users at unicode.org>: > > In Windows, certain cultures have an "alternate sort" order (collation). > This list , according to the LCID reference, is: > > LCID > > Language tag > > Collation type > > 0x1007F > > x-IV-mathan > > Math alphanumeric > > 0x10407 > > de-DE_phoneb > > Phonebook > > 0x1040E > > hu-HU_tchncl > > Technical > > 0x10437 > > ka-GE_modern > > Modern > > 0x20804 > > zh-CN_stroke > > Stroke count > > 0x21404 > > zh-MO_stroke > > " > > 0x21004 > > zh-SG_stroke > > " > > 0x30404 > > zh-TW_pronun > > Pronunciation > > 0x40404 > > zh-TW_radstr > > Radical/stroke > > 0x40411 > > ja-JP_radstr > > " > > 0x40C04 > > zh-HK_radstr > > " > > 0x41404 > > zh-MO_radstr > > " > > Some of these "collation"s have equivalent entries among the collation > identifiers defined in the "Unicode locale extension" > (See: unicode.org/repos/ > > cldr > > /tags/latest/common/bcp47/collation.xml > > ): > > Identifier > > Description > > big5han > > Pinyin ordering for Latin, big5 charset ordering for CJK characters (used > in Chinese) > > compat > > A previous version of the ordering, for compatibility > > dict > > Dictionary style ordering > > ducet > > The default Unicode collation element table order > > emoji > > Recommended ordering for emoji characters > > eor > > European ordering rules > > gb2312 > > Pinyin ordering for Latin, gb2312han charset ordering for CJK characters ( > used in Chinese) > > phonebk > > Phonebook style ordering (such as in German) > > phonetic > > Phonetic ordering (sorting based on pronunciation) > > pinyin > > Pinyin ordering for Latin and for CJK characters (used in Chinese) > > reformed > > Reformed ordering (such as in Swedish) > > search > > Special collation type for string search > > searchjl > > Special collation type for Korean initial consonant search > > standard > > Default ordering for each language > > stroke > > Pinyin ordering for Latin, stroke order for CJK characters (used in > Chinese) > > trad > > Traditional style ordering (such as in Spanish) > > unihan > > Pinyin ordering for Latin, Unihan radical-stroke ordering for CJK > characters (used in Chinese) > > zhuyin > > Pinyin ordering for Latin, zhuyin order for Bopomofo and CJK characters ( > used in Chinese) > > > The question is thus the following: if one wants to create a BCP 47 string > representing the locale and the options a user has chosen in a Windows > environment, one should be able to represent the "Windows alternate sorts" > in BCP 47 syntax. Some such as "phoneb" have equivalent entries ("phonebk") > but some don't apparently. > > If some Windows alternate sorts do not have equivalent entries, should we > request for these to be added to the CLDR, or rather use a "variant tag", > or yet use a "private use" tag in the BCP 47 format? > > Patrick Andries > > > > ------------------------------ > [image: Avast logo] > > L'absence de virus dans ce courrier ?lectronique a ?t? v?rifi?e par le > logiciel antivirus Avast. > www.avast.com > > <#m_5473554254161149119_DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2> > > _______________________________________________ > CLDR-Users mailing list > CLDR-Users at unicode.org > http://unicode.org/mailman/listinfo/cldr-users > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From cldr-users at unicode.org Sat Nov 18 13:43:43 2017 From: cldr-users at unicode.org (Markus Scherer via CLDR-Users) Date: Sat, 18 Nov 2017 11:43:43 -0800 Subject: Question on BCP 47 In-Reply-To: References: <5A08C8BD.9010002@i18n.ca> <865e38df-516d-2af5-8f0e-ad3caefea5ef@xcential.com> Message-ID: There is a mapping between Windows LCIDs and ICU locale IDs in the ICU code: http://bugs.icu-project.org/trac/browser/trunk/icu4c/source/common/locmap.cpp Look for "@collation" there. If there is something missing, then please submit an ICU ticket. Best regards, markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From cldr-users at unicode.org Sat Nov 18 15:40:35 2017 From: cldr-users at unicode.org (Philippe Verdy via CLDR-Users) Date: Sat, 18 Nov 2017 22:40:35 +0100 Subject: Question on BCP 47 In-Reply-To: References: <5A08C8BD.9010002@i18n.ca> <865e38df-516d-2af5-8f0e-ad3caefea5ef@xcential.com> Message-ID: Line 375: {0x0491, "gd_GB"} // should have been "ga_GB" Line 380: {0x0491, "gd_GB"} // OK 2017-11-18 20:43 GMT+01:00 Markus Scherer : > There is a mapping between Windows LCIDs and ICU locale IDs in the ICU > code: > http://bugs.icu-project.org/trac/browser/trunk/icu4c/ > source/common/locmap.cpp > Look for "@collation" there. > > If there is something missing, then please submit an ICU ticket. > > Best regards, > markus > -------------- next part -------------- An HTML attachment was scrubbed... URL: From cldr-users at unicode.org Sat Nov 18 18:35:21 2017 From: cldr-users at unicode.org (Doug Ewell via CLDR-Users) Date: Sat, 18 Nov 2017 17:35:21 -0700 Subject: Question on BCP 47 In-Reply-To: References: Message-ID: Philippe Verdy wrote: > Line 375: {0x0491, "gd_GB"} // should have been "ga_GB" > Line 380: {0x0491, "gd_GB"} // OK Recte: line 375 should be {0x043c, "ga_GB"} // not 0x0491 -- Doug Ewell | Thornton, CO, US | ewellic.org From cldr-users at unicode.org Sat Nov 25 19:07:35 2017 From: cldr-users at unicode.org (Kip Cole via CLDR-Users) Date: Sun, 26 Nov 2017 09:07:35 +0800 Subject: Collation / Fractional UCA / Implicit Weights Questions In-Reply-To: References: Message-ID: <0985440F-E0A6-460B-898D-0EFA21F957E4@gmail.com> As part of my efforts to implement CLDR support for the Elixir language I?ve now started work on collations and working my way through TR10 and the relevant parts of TR35. I have some questions on implicit weight calculation I?m unable to resolve and would appreciate any help or pointers on: (1) Unified Ideograph vs Radical Is there a preferred or intended strategy - to use the Unified Ideograph or radical definitions? (2) Calculating implicit weights for radical definitions TR10/TR35 seem quiet on the topic - my working assumption is to use the [fixed first implicit byte E0] and [fixed last implicit byte E4] in FractionalUCA.txt to generate implicit weights that respect the radical order (left to right, top to bottom). Is that a reasonable working principle? (3) Implicit weight calculations in general TR10 at https://www.unicode.org/reports/tr10/#Implicit_Weights will generate weights with a top byte of 0xFB which would seem in conflict with the [fixed first implicit byte E0] and [fixed last implicit byte E4] indicators. My working assumption is to use the algorithm in TR10 to calculate implicit weights except for radical definitions which would use the [fixed first] and [fixed last] This would seem to align with TR35 which says: "Note: The particular primary lead bytes for Hani vs. IMPLICIT vs. TRAILING are only an example? suggesting that Hani is calculated with leading bytes 0xFB per TR10 and the [fixed first implicit] can be used to generate weights for radicals (and other non specified code points) Thanks in advance, ?Kip -------------- next part -------------- An HTML attachment was scrubbed... URL: From cldr-users at unicode.org Sun Nov 26 12:28:13 2017 From: cldr-users at unicode.org (Markus Scherer via CLDR-Users) Date: Sun, 26 Nov 2017 10:28:13 -0800 Subject: Collation / Fractional UCA / Implicit Weights Questions In-Reply-To: <0985440F-E0A6-460B-898D-0EFA21F957E4@gmail.com> References: <0985440F-E0A6-460B-898D-0EFA21F957E4@gmail.com> Message-ID: On Sat, Nov 25, 2017 at 5:07 PM, Kip Cole via CLDR-Users < cldr-users at unicode.org> wrote: > As part of my efforts to implement CLDR support for the Elixir language > I?ve now started work on collations and working my way through TR10 and the > relevant parts of TR35. > Have you considered calling an existing library (e.g., ICU) from your language runtime, rather than do this from scratch? I have some questions on implicit weight calculation I?m unable to resolve > and would appreciate any help or pointers on: > > (1) Unified Ideograph vs Radical > > Is there a preferred or intended strategy - to use the Unified Ideograph > or radical definitions? > This is a default, to be used when we don't know the language or desired sort order. When one of the CJK languages is selected, the tailoring provides a specific Han character order. As such, you have a choice between the DUCET order, which can be implemented with very minimal data, or the radical-stroke order, which is a bit more meaningful but large (because it's a permutation of all of the Han characters). Each Han allocation block in Unicode, including the original one which has almost all of the commonly used characters, is intended to have its share of Han characters in radical-stroke order (although the allocation is fixed, so mistakes cannot be corrected). That is, for most of the common Han characters (those in the original part of the original block), there should be little difference in the order. However, for characters outside the original Unihan block, the DUCET order is not useful. (2) Calculating implicit weights for radical definitions > > TR10/TR35 seem quiet on the topic - my working assumption is to use > the [fixed first implicit byte E0] and [fixed last implicit byte E4] in > FractionalUCA.txt to generate implicit weights that respect the radical > order (left to right, top to bottom). Is that a reasonable working > principle? > Yes, the radical-stroke data is intended to provide an order as listed. We kept the E0..E4 lead byte range in FractionalUCA.txt as is for stability. You can use more or fewer lead bytes. For ICU, I move the implicit-weight lead bytes much higher, to make more room for large Han tailorings. You can choose your implicit-weight allocation freely because I changed the primary weights of Han compatibility characters to refer to the Han code points rather than hardcode their weights. (This is also why the Han radical-stroke data comes first -- you can use a single-pass parser, establish the Han order, and then look up their weights by code point.) You just have to also move one or two "high" primary weights accordingly, such as for U+FFFD. (3) Implicit weight calculations in general > > TR10 at https://www.unicode.org/reports/tr10/#Implicit_Weights will > generate weights with a top byte of 0xFB which would seem in conflict with > the [fixed first implicit byte E0] and [fixed last implicit byte E4] > indicators. My working assumption is to use the algorithm in TR10 to > calculate implicit weights except for radical definitions which would use > the [fixed first] and [fixed last] > No, careful. The DUCET is published with 16-bit primary weights (and some weights are pairs of 16-bit values). CLDR FractionalUCA.txt uses primary weights of 1, 2, or 3 *bytes*. (ICU uses 4-byte weights for unassigned-implicit weights, and in tailorings if needed.) They are unrelated values, although they provide the same sort order (except for the intentional CLDR reshufflings of some numerical symbols and such). Conformance to the algorithms requires you to get the same order, but does not require you to get the same sort keys. This would seem to align with TR35 which says: > > "Note: The particular primary lead bytes for Hani vs. IMPLICIT vs. > TRAILING are only an example? suggesting that Hani is calculated with > leading bytes 0xFB per TR10 and the [fixed first implicit] can be used to > generate weights for radicals (and other non specified code points) > No, it refers to your freedom of choice of range and bit-distribution algorithm, as for ICU as I said above. > Thanks in advance, ?Kip > Best regards, markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From cldr-users at unicode.org Sun Nov 26 20:02:04 2017 From: cldr-users at unicode.org (Kip Cole via CLDR-Users) Date: Mon, 27 Nov 2017 13:02:04 +1100 Subject: Collation / Fractional UCA / Implicit Weights Questions In-Reply-To: References: <0985440F-E0A6-460B-898D-0EFA21F957E4@gmail.com> Message-ID: <09B55748-A847-4F55-98AE-99D94E3FDB97@gmail.com> Markus, thank you for your response. I will admit that a large part of my motivation is to learn more about collation. The peculiarities of the Erlang VM (upon which Elixir runs) makes access to native libs challenging but not impossible. Of course leveraging your work is the canonical approach and it may be where I end up. So now I understand better about the application of the radical data and I need to decide where to place them. You note: "For ICU, I move the implicit-weight lead bytes much higher, to make more room for large Han tailorings. You can choose your implicit-weight allocation freely? Where do you place them? (I know, I should read the code and I will but the learning curve is steep!) Regards, ?Kip > On 27 Nov 2017, at 5:28 am, Markus Scherer wrote: > > On Sat, Nov 25, 2017 at 5:07 PM, Kip Cole via CLDR-Users > wrote: > As part of my efforts to implement CLDR support for the Elixir language I?ve now started work on collations and working my way through TR10 and the relevant parts of TR35. > > Have you considered calling an existing library (e.g., ICU) from your language runtime, rather than do this from scratch? > > I have some questions on implicit weight calculation I?m unable to resolve and would appreciate any help or pointers on: > > (1) Unified Ideograph vs Radical > > Is there a preferred or intended strategy - to use the Unified Ideograph or radical definitions? > > This is a default, to be used when we don't know the language or desired sort order. When one of the CJK languages is selected, the tailoring provides a specific Han character order. > > As such, you have a choice between the DUCET order, which can be implemented with very minimal data, or the radical-stroke order, which is a bit more meaningful but large (because it's a permutation of all of the Han characters). > > Each Han allocation block in Unicode, including the original one which has almost all of the commonly used characters, is intended to have its share of Han characters in radical-stroke order (although the allocation is fixed, so mistakes cannot be corrected). That is, for most of the common Han characters (those in the original part of the original block), there should be little difference in the order. However, for characters outside the original Unihan block, the DUCET order is not useful. > > (2) Calculating implicit weights for radical definitions > > TR10/TR35 seem quiet on the topic - my working assumption is to use the [fixed first implicit byte E0] and [fixed last implicit byte E4] in FractionalUCA.txt to generate implicit weights that respect the radical order (left to right, top to bottom). Is that a reasonable working principle? > > Yes, the radical-stroke data is intended to provide an order as listed. > > We kept the E0..E4 lead byte range in FractionalUCA.txt as is for stability. You can use more or fewer lead bytes. For ICU, I move the implicit-weight lead bytes much higher, to make more room for large Han tailorings. You can choose your implicit-weight allocation freely because I changed the primary weights of Han compatibility characters to refer to the Han code points rather than hardcode their weights. (This is also why the Han radical-stroke data comes first -- you can use a single-pass parser, establish the Han order, and then look up their weights by code point.) You just have to also move one or two "high" primary weights accordingly, such as for U+FFFD. > > (3) Implicit weight calculations in general > > TR10 at https://www.unicode.org/reports/tr10/#Implicit_Weights will generate weights with a top byte of 0xFB which would seem in conflict with the [fixed first implicit byte E0] and [fixed last implicit byte E4] indicators. My working assumption is to use the algorithm in TR10 to calculate implicit weights except for radical definitions which would use the [fixed first] and [fixed last] > > No, careful. The DUCET is published with 16-bit primary weights (and some weights are pairs of 16-bit values). CLDR FractionalUCA.txt uses primary weights of 1, 2, or 3 *bytes*. (ICU uses 4-byte weights for unassigned-implicit weights, and in tailorings if needed.) They are unrelated values, although they provide the same sort order (except for the intentional CLDR reshufflings of some numerical symbols and such). > > Conformance to the algorithms requires you to get the same order, but does not require you to get the same sort keys. > > This would seem to align with TR35 which says: > > "Note: The particular primary lead bytes for Hani vs. IMPLICIT vs. TRAILING are only an example? suggesting that Hani is calculated with leading bytes 0xFB per TR10 and the [fixed first implicit] can be used to generate weights for radicals (and other non specified code points) > > No, it refers to your freedom of choice of range and bit-distribution algorithm, as for ICU as I said above. > Thanks in advance, ?Kip > > Best regards, > markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From cldr-users at unicode.org Sun Nov 26 23:50:06 2017 From: cldr-users at unicode.org (Markus Scherer via CLDR-Users) Date: Sun, 26 Nov 2017 21:50:06 -0800 Subject: Collation / Fractional UCA / Implicit Weights Questions In-Reply-To: <09B55748-A847-4F55-98AE-99D94E3FDB97@gmail.com> References: <0985440F-E0A6-460B-898D-0EFA21F957E4@gmail.com> <09B55748-A847-4F55-98AE-99D94E3FDB97@gmail.com> Message-ID: On Sun, Nov 26, 2017 at 6:02 PM, Kip Cole wrote: > So now I understand better about the application of the radical data and I > need to decide where to place them. You note: "For ICU, I move the > implicit-weight lead bytes much higher, to make more room for large Han > tailorings. You can choose your implicit-weight allocation freely? > > Where do you place them? (I know, I should read the code and I will but > the learning curve is steep!) > I have a piece of code in the ICU "genuca" tool (not one of the installed ICU tools) that takes the number of Han characters for which we need implicit primaries (from one of the early lines in FractionalUCA.txt) and calculates the number of lead bytes for 3-byte weights with a certain gap size (for tailoring between Han characters). Given the current gap size, it uses three lead bytes FB..FD. FE is for 4-byte unassigned-implicit primaries, and FF is for "trailing weights" where there are currently only a couple including for U+FFFD and U+FFFF. See https://sites.google.com/site/icusite/design/collation/bytes These may move in the future when there are more Han characters, we decide on a different gap size, leave more room for trailing weights, etc. The primary lead bytes from somewhere near 80 to currently FA are used for large CJK tailorings, so that we get a decent number of two-byte weights. markus -------------- next part -------------- An HTML attachment was scrubbed... URL: