From pedberg at apple.com Fri Sep 9 16:38:50 2016 From: pedberg at apple.com (Peter Edberg) Date: Fri, 09 Sep 2016 14:38:50 -0700 Subject: CLDR Version 30 alpha available Message-ID: Dear CLDR Users, The alpha draft version of Unicode CLDR v30 is available for testing. The main improvements include: ? New format and preference structure has been added to support week designations such as ?the week of August 10? or ?week 3 of March?. ? New data items have been added to support relative times such as ?3 Fridays ago? or ?this hour?. ? New data can be used to generate labels for groups of related characters in character pickers. ? The structure for emoji annotations has been revised, and the data has been significantly updated. ? Unicode support is updated to 9.0, including updated Unihan readings for the pinyin collation and Han-Latin transforms, and support for new script codes and number systems. Support is also added for region codes EZ, UN. ? The set of language codes for translation has been updated, with a significant increase in the total number of translated language names. ? The CLDR 30 Survey Tool data collection and additional bug fixing resulted in a net increase in data items of about 8.6%, with an additional 5.6% of items changed. Draft release note: http://cldr.unicode.org/index/downloads/cldr-30 Draft charts: http://www.unicode.org/cldr/charts/dev/ Draft data tag: http://www.unicode.org/repos/cldr/tags/release-30-d01 The final release of CLDR 30 is targeted for the end of September. Please provide any feedback on the alpha draft version by filing a ticket as described Here: http://cldr.unicode.org/index/bug-reports Best regards, Peter Edberg for the CLDR Project From mr at eibbor.co.uk Tue Sep 13 14:33:51 2016 From: mr at eibbor.co.uk (Mark Robbie) Date: Tue, 13 Sep 2016 20:33:51 +0100 Subject: =?UTF-8?Q?Ch=C3=A9n_=2C_Sh=C4=9Bn_and_=E6=B2=88_pinyin_confusi?= =?UTF-8?Q?on?= Message-ID: Hi, We are using ICU and CLDR with SQLite. I am not a software developer but a user of the output. We have had some comments from Chinese colleagues on name sorting and I am unsure if what we have is correct or if it is expected our development team are supposed to use the tools in a different way. We are currently sorting the phonebook by pinyin and an example of a comment we have had is regarding ??? when ends up being sorted as Chen, but our China team are saying it should be Shen. I am trying to figure out if the utilities should come up with the generally accepted match out of the box or if ??? really does map to 2 pinyin equivalents or if our dev team is supposed to override the default rule to make Chen a Shen. I did notice in CLDR 24 for zh.xml that there is an additional section called compounds and then says ?Here ? collates as sh?n/7stk/rad85, between ? 7/stk/rad57, ? 8stk/rad40?. I have not a clue how to interpret this but am wondering if this means to override the mapping to ch?n earlier in the table and if this was something learned in CLDR for v24 onwards ? Not being able to read Chinese I am unsure if there will be loads of these examples or only a few and I believe our dev team have a similar problem too and are relying of the default collations. Any advice is very much appreciated. Ps I did visit some other sites like Chinese tools and on searching for ??? was offered Ch?n , Sh?n and T?n as pinyin equivalents so I guess there are more than 1, I am just wondering if for names (which in our case it is a phonebook) there is some common knowledge it can only be Sh?n. I also managed to pin down a passing Chinese work colleague but all he could say was is only and Ch?n is a ?suggestions? rather than actual match (and then exited stage left in haste) ? is that correct ? Kind regards, Mark Robbie, -------------- next part -------------- An HTML attachment was scrubbed... URL: From kazede at google.com Tue Sep 13 15:06:11 2016 From: kazede at google.com (kz) Date: Tue, 13 Sep 2016 13:06:11 -0700 Subject: =?UTF-8?B?UmU6IENow6luICwgU2jEm24gYW5kIOayiCBwaW55aW4gY29uZnVzaW9u?= In-Reply-To: References: Message-ID: Hi Mark, Commenting as a Chinese speaker (and not a dev). Quite a few characters in Chinese have more than one pronunciations. In contexts such as people's names, it often comes down to which pronunciation their parents preferred while naming them. CLDR might have data on all the possible pronunciations of a character, but a phonebook application should allow users to override inferred pronunciation of a name. There's just a caveat for collating though. Collations are usually done on surnames in Chinese. Surnames in China (and other Chinese-speaking regions) follow a strict convention, so in the context of a surname, ? is 99% likely to be sh?n rather than ch?n. Similar examples out of the top of my head: ? (usually hu?, as a surname hu?) and ? (usually d?n, as a surname sh?n). One should also take care of compound surnames (rare but not that rare). I'm not certain how much support CLDR provides for this use case. Thanks k On Tue, Sep 13, 2016 at 12:33 PM, Mark Robbie wrote: > Hi, > > > > We are using ICU and CLDR with SQLite. I am not a software developer but a > user of the output. > > > > We have had some comments from Chinese colleagues on name sorting and I am > unsure if what we have is correct or if it is expected our development team > are supposed to use the tools in a different way. We are currently sorting > the phonebook by pinyin and an example of a comment we have had is > regarding ??? when ends up being sorted as Chen, but our China team are > saying it should be Shen. > > > > I am trying to figure out if the utilities should come up with the > generally accepted match out of the box or if ??? really does map to 2 > pinyin equivalents or if our dev team is supposed to override the default > rule to make Chen a Shen. I did notice in CLDR 24 for zh.xml that there is > an additional section called compounds and then says ?Here ? collates as > sh?n/7stk/rad85, between ? 7/stk/rad57, ? 8stk/rad40?. I have not a clue > how to interpret this but am wondering if this means to override the > mapping to ch?n earlier in the table and if this was something learned in > CLDR for v24 onwards ? > > > > Not being able to read Chinese I am unsure if there will be loads of these > examples or only a few and I believe our dev team have a similar problem > too and are relying of the default collations. > > > > Any advice is very much appreciated. > > > > Ps I did visit some other sites like Chinese tools and on searching for ??? > was offered Ch?n , Sh?n and T?n as pinyin equivalents so I guess there are > more than 1, I am just wondering if for names (which in our case it is a > phonebook) there is some common knowledge it can only be Sh?n. > > > > I also managed to pin down a passing Chinese work colleague but all he > could say was is only and Ch?n is a ?suggestions? rather than actual match > (and then exited stage left in haste) ? is that correct ? > > > > Kind regards, > > > > Mark Robbie, > > > > _______________________________________________ > CLDR-Users mailing list > CLDR-Users at unicode.org > http://unicode.org/mailman/listinfo/cldr-users > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From pedberg at apple.com Tue Sep 13 15:28:17 2016 From: pedberg at apple.com (Peter Edberg) Date: Tue, 13 Sep 2016 13:28:17 -0700 Subject: =?utf-8?Q?Re=3A_Ch=C3=A9n_=2C_Sh=C4=9Bn_and_=E6=B2=88_pinyin_con?= =?utf-8?Q?fusion?= In-Reply-To: References: Message-ID: CLDR transforms do make this distinction. CLDR has a Names variant of the Han-Latin transform that specifically intended for surnames; this does in fact transform ?, ?, and ? using the name readings given below, as well as doing the same for a number of other characters. We do not currently have a collation variant that sorts by surname readings. However one could emulate that by first transforming to pinyin using the Han-Latin/Names transform, and then sorting using the pinyin result. - Peter E > On Sep 13, 2016, at 1:06 PM, kz wrote: > > Hi Mark, > > Commenting as a Chinese speaker (and not a dev). > > Quite a few characters in Chinese have more than one pronunciations. In contexts such as people's names, it often comes down to which pronunciation their parents preferred while naming them. CLDR might have data on all the possible pronunciations of a character, but a phonebook application should allow users to override inferred pronunciation of a name. > > There's just a caveat for collating though. Collations are usually done on surnames in Chinese. Surnames in China (and other Chinese-speaking regions) follow a strict convention, so in the context of a surname, ? is 99% likely to be sh?n rather than ch?n. Similar examples out of the top of my head: ? (usually hu?, as a surname hu?) and ? (usually d?n, as a surname sh?n). One should also take care of compound surnames (rare but not that rare). > > I'm not certain how much support CLDR provides for this use case. > > > Thanks > k > > On Tue, Sep 13, 2016 at 12:33 PM, Mark Robbie > wrote: > Hi, > > > > We are using ICU and CLDR with SQLite. I am not a software developer but a user of the output. > > > > We have had some comments from Chinese colleagues on name sorting and I am unsure if what we have is correct or if it is expected our development team are supposed to use the tools in a different way. We are currently sorting the phonebook by pinyin and an example of a comment we have had is regarding ??? when ends up being sorted as Chen, but our China team are saying it should be Shen. > > > > I am trying to figure out if the utilities should come up with the generally accepted match out of the box or if ??? really does map to 2 pinyin equivalents or if our dev team is supposed to override the default rule to make Chen a Shen. I did notice in CLDR 24 for zh.xml that there is an additional section called compounds and then says ?Here ? collates as sh?n/7stk/rad85, between ? 7/stk/rad57, ? 8stk/rad40?. I have not a clue how to interpret this but am wondering if this means to override the mapping to ch?n earlier in the table and if this was something learned in CLDR for v24 onwards ? > > > > Not being able to read Chinese I am unsure if there will be loads of these examples or only a few and I believe our dev team have a similar problem too and are relying of the default collations. > > > > Any advice is very much appreciated. > > > > Ps I did visit some other sites like Chinese tools and on searching for ??? was offered Ch?n , Sh?n and T?n as pinyin equivalents so I guess there are more than 1, I am just wondering if for names (which in our case it is a phonebook) there is some common knowledge it can only be Sh?n. > > > > I also managed to pin down a passing Chinese work colleague but all he could say was is only and Ch?n is a ?suggestions? rather than actual match (and then exited stage left in haste) ? is that correct ? > > > > Kind regards, > > > > Mark Robbie, > > > > > _______________________________________________ > CLDR-Users mailing list > CLDR-Users at unicode.org > http://unicode.org/mailman/listinfo/cldr-users > > > _______________________________________________ > CLDR-Users mailing list > CLDR-Users at unicode.org > http://unicode.org/mailman/listinfo/cldr-users -------------- next part -------------- An HTML attachment was scrubbed... URL: From Mr at eibbor.co.uk Tue Sep 13 15:41:26 2016 From: Mr at eibbor.co.uk (Work) Date: Tue, 13 Sep 2016 21:41:26 +0100 Subject: =?utf-8?Q?Re:_Ch=C3=A9n_,_Sh=C4=9Bn_and_=E6=B2=88_pinyin_confusi?= =?utf-8?Q?on?= In-Reply-To: References: Message-ID: <5FBBB156-3359-4287-B13E-89EF7DB9D018@eibbor.co.uk> K, Thanks for the response. With respect to the China strict surname convention. Is there an authoritative online reference for this which I can use to independently check our basic CLDR and ICU mapping implementation and so our sorting of names and identify where our implementation will attract criticism and so needs tailoring... Rather than wait for my colleagues in China to drip feed me complaints as they stumble over them, which I fear may end up happening. Sent from my iPhone > On 13 Sep 2016, at 21:06, kz wrote: > > Hi Mark, > > Commenting as a Chinese speaker (and not a dev). > > Quite a few characters in Chinese have more than one pronunciations. In contexts such as people's names, it often comes down to which pronunciation their parents preferred while naming them. CLDR might have data on all the possible pronunciations of a character, but a phonebook application should allow users to override inferred pronunciation of a name. > > There's just a caveat for collating though. Collations are usually done on surnames in Chinese. Surnames in China (and other Chinese-speaking regions) follow a strict convention, so in the context of a surname, ? is 99% likely to be sh?n rather than ch?n. Similar examples out of the top of my head: ? (usually hu?, as a surname hu?) and ? (usually d?n, as a surname sh?n). One should also take care of compound surnames (rare but not that rare). > > I'm not certain how much support CLDR provides for this use case. > > > Thanks > k > >> On Tue, Sep 13, 2016 at 12:33 PM, Mark Robbie wrote: >> Hi, >> >> >> >> We are using ICU and CLDR with SQLite. I am not a software developer but a user of the output. >> >> >> >> We have had some comments from Chinese colleagues on name sorting and I am unsure if what we have is correct or if it is expected our development team are supposed to use the tools in a different way. We are currently sorting the phonebook by pinyin and an example of a comment we have had is regarding ??? when ends up being sorted as Chen, but our China team are saying it should be Shen. >> >> >> >> I am trying to figure out if the utilities should come up with the generally accepted match out of the box or if ??? really does map to 2 pinyin equivalents or if our dev team is supposed to override the default rule to make Chen a Shen. I did notice in CLDR 24 for zh.xml that there is an additional section called compounds and then says ?Here ? collates as sh?n/7stk/rad85, between ? 7/stk/rad57, ? 8stk/rad40?. I have not a clue how to interpret this but am wondering if this means to override the mapping to ch?n earlier in the table and if this was something learned in CLDR for v24 onwards ? >> >> >> >> Not being able to read Chinese I am unsure if there will be loads of these examples or only a few and I believe our dev team have a similar problem too and are relying of the default collations. >> >> >> >> Any advice is very much appreciated. >> >> >> >> Ps I did visit some other sites like Chinese tools and on searching for ??? was offered Ch?n , Sh?n and T?n as pinyin equivalents so I guess there are more than 1, I am just wondering if for names (which in our case it is a phonebook) there is some common knowledge it can only be Sh?n. >> >> >> >> I also managed to pin down a passing Chinese work colleague but all he could say was is only and Ch?n is a ?suggestions? rather than actual match (and then exited stage left in haste) ? is that correct ? >> >> >> >> Kind regards, >> >> >> >> Mark Robbie, >> >> >> >> >> _______________________________________________ >> CLDR-Users mailing list >> CLDR-Users at unicode.org >> http://unicode.org/mailman/listinfo/cldr-users > > _______________________________________________ > CLDR-Users mailing list > CLDR-Users at unicode.org > http://unicode.org/mailman/listinfo/cldr-users -------------- next part -------------- An HTML attachment was scrubbed... URL: From Mr at eibbor.co.uk Tue Sep 13 15:52:20 2016 From: Mr at eibbor.co.uk (Work) Date: Tue, 13 Sep 2016 21:52:20 +0100 Subject: =?utf-8?Q?Re:_Ch=C3=A9n_,_Sh=C4=9Bn_and_=E6=B2=88_pinyin_confusi?= =?utf-8?Q?on?= In-Reply-To: References: Message-ID: <6D31EC84-23B7-43C8-B291-BF72FDC50BF0@eibbor.co.uk> Peter, Thankyou for your response. That in essence is what our application is trying to do - transform to pinyin then sort as pinyin (but display the Chinese text), but somehow we may be using the utilities at our disposal incorrectly? If it makes any difference I think we are using CLDR22 which is a little old and so not sure of limitations here or if Names variant was around then too. Is it possible to tell from the zh.xml file alone how names would resolve to the most likely result or is it trickier than that. Also can you advise how do you invoke the Hans-Latin names variant ? Sent from my iPhone > On 13 Sep 2016, at 21:28, Peter Edberg wrote: > > CLDR transforms do make this distinction. CLDR has a Names variant of the Han-Latin transform that specifically intended for surnames; this does in fact transform ?, ?, and ? using the name readings given below, as well as doing the same for a number of other characters. > > We do not currently have a collation variant that sorts by surname readings. However one could emulate that by first transforming to pinyin using the Han-Latin/Names transform, and then sorting using the pinyin result. > > - Peter E > > >> On Sep 13, 2016, at 1:06 PM, kz wrote: >> >> Hi Mark, >> >> Commenting as a Chinese speaker (and not a dev). >> >> Quite a few characters in Chinese have more than one pronunciations. In contexts such as people's names, it often comes down to which pronunciation their parents preferred while naming them. CLDR might have data on all the possible pronunciations of a character, but a phonebook application should allow users to override inferred pronunciation of a name. >> >> There's just a caveat for collating though. Collations are usually done on surnames in Chinese. Surnames in China (and other Chinese-speaking regions) follow a strict convention, so in the context of a surname, ? is 99% likely to be sh?n rather than ch?n. Similar examples out of the top of my head: ? (usually hu?, as a surname hu?) and ? (usually d?n, as a surname sh?n). One should also take care of compound surnames (rare but not that rare). >> >> I'm not certain how much support CLDR provides for this use case. >> >> >> Thanks >> k >> >>> On Tue, Sep 13, 2016 at 12:33 PM, Mark Robbie wrote: >>> Hi, >>> >>> >>> >>> We are using ICU and CLDR with SQLite. I am not a software developer but a user of the output. >>> >>> >>> >>> We have had some comments from Chinese colleagues on name sorting and I am unsure if what we have is correct or if it is expected our development team are supposed to use the tools in a different way. We are currently sorting the phonebook by pinyin and an example of a comment we have had is regarding ??? when ends up being sorted as Chen, but our China team are saying it should be Shen. >>> >>> >>> >>> I am trying to figure out if the utilities should come up with the generally accepted match out of the box or if ??? really does map to 2 pinyin equivalents or if our dev team is supposed to override the default rule to make Chen a Shen. I did notice in CLDR 24 for zh.xml that there is an additional section called compounds and then says ?Here ? collates as sh?n/7stk/rad85, between ? 7/stk/rad57, ? 8stk/rad40?. I have not a clue how to interpret this but am wondering if this means to override the mapping to ch?n earlier in the table and if this was something learned in CLDR for v24 onwards ? >>> >>> >>> >>> Not being able to read Chinese I am unsure if there will be loads of these examples or only a few and I believe our dev team have a similar problem too and are relying of the default collations. >>> >>> >>> >>> Any advice is very much appreciated. >>> >>> >>> >>> Ps I did visit some other sites like Chinese tools and on searching for ??? was offered Ch?n , Sh?n and T?n as pinyin equivalents so I guess there are more than 1, I am just wondering if for names (which in our case it is a phonebook) there is some common knowledge it can only be Sh?n. >>> >>> >>> >>> I also managed to pin down a passing Chinese work colleague but all he could say was is only and Ch?n is a ?suggestions? rather than actual match (and then exited stage left in haste) ? is that correct ? >>> >>> >>> >>> Kind regards, >>> >>> >>> >>> Mark Robbie, >>> >>> >>> >>> >>> _______________________________________________ >>> CLDR-Users mailing list >>> CLDR-Users at unicode.org >>> http://unicode.org/mailman/listinfo/cldr-users >> >> _______________________________________________ >> CLDR-Users mailing list >> CLDR-Users at unicode.org >> http://unicode.org/mailman/listinfo/cldr-users > > _______________________________________________ > CLDR-Users mailing list > CLDR-Users at unicode.org > http://unicode.org/mailman/listinfo/cldr-users -------------- next part -------------- An HTML attachment was scrubbed... URL: From kazede at google.com Tue Sep 13 16:09:39 2016 From: kazede at google.com (kz) Date: Tue, 13 Sep 2016 14:09:39 -0700 Subject: =?UTF-8?B?UmU6IENow6luICwgU2jEm24gYW5kIOayiCBwaW55aW4gY29uZnVzaW9u?= In-Reply-To: <6D31EC84-23B7-43C8-B291-BF72FDC50BF0@eibbor.co.uk> References: <6D31EC84-23B7-43C8-B291-BF72FDC50BF0@eibbor.co.uk> Message-ID: Hi Mark, I don't know of a authoritative list of Chinese surname pronunciations, and a cursory Google search didn't reveal anything interesting. >From what Peter's saying though, it sounds like CLDR has decent data on this, so we might not need a second list. Thanks k On Tue, Sep 13, 2016 at 1:52 PM, Work wrote: > Peter, > > Thankyou for your response. > > That in essence is what our application is trying to do - transform to > pinyin then sort as pinyin (but display the Chinese text), but somehow we > may be using the utilities at our disposal incorrectly? > > If it makes any difference I think we are using CLDR22 which is a little > old and so not sure of limitations here or if Names variant was around then > too. > > Is it possible to tell from the zh.xml file alone how names would resolve > to the most likely result or is it trickier than that. > > Also can you advise how do you invoke the Hans-Latin names variant ? > > > Sent from my iPhone > > On 13 Sep 2016, at 21:28, Peter Edberg wrote: > > CLDR transforms do make this distinction. CLDR has a Names variant of the > Han-Latin transform that specifically intended for surnames; this does in > fact transform ?, ?, and ? using the name readings given below, as well as > doing the same for a number of other characters. > > We do not currently have a collation variant that sorts by surname > readings. However one could emulate that by first transforming to pinyin > using the Han-Latin/Names transform, and then sorting using the pinyin > result. > > - Peter E > > > On Sep 13, 2016, at 1:06 PM, kz wrote: > > Hi Mark, > > Commenting as a Chinese speaker (and not a dev). > > Quite a few characters in Chinese have more than one pronunciation. In > contexts such as people's names, it often comes down to which pronunciation > their parents preferred while naming them. CLDR might have data on all the > possible pronunciations of a character, but a phonebook application should > allow users to override inferred pronunciation of a name. > > There's just a caveat for collating though. Collations are usually done on > surnames in Chinese. Surnames in China (and other Chinese-speaking regions) > follow a strict convention, so in the context of a surname, ? is 99% likely > to be sh?n rather than ch?n. Similar examples off the top of my head: ? > (usually hu?, as a surname hu?) and ? (usually d?n, as a surname sh?n). One > should also take care of compound surnames > (rare but not > that rare). > > I'm not certain how much support CLDR provides for this use case. > > > Thanks > k > > On Tue, Sep 13, 2016 at 12:33 PM, Mark Robbie wrote: > >> Hi, >> >> >> >> We are using ICU and CLDR with SQLite. I am not a software developer but >> a user of the output. >> >> >> >> We have had some comments from Chinese colleagues on name sorting and I >> am unsure if what we have is correct or if it is expected our development >> team are supposed to use the tools in a different way. We are currently >> sorting the phonebook by pinyin and an example of a comment we have had is >> regarding ??? when ends up being sorted as Chen, but our China team are >> saying it should be Shen. >> >> >> >> I am trying to figure out if the utilities should come up with the >> generally accepted match out of the box or if ??? really does map to 2 >> pinyin equivalents or if our dev team is supposed to override the default >> rule to make Chen a Shen. I did notice in CLDR 24 for zh.xml that there is >> an additional section called compounds and then says ?Here ? collates as >> sh?n/7stk/rad85, between ? 7/stk/rad57, ? 8stk/rad40?. I have not a >> clue how to interpret this but am wondering if this means to override the >> mapping to ch?n earlier in the table and if this was something learned in >> CLDR for v24 onwards ? >> >> >> >> Not being able to read Chinese I am unsure if there will be loads of >> these examples or only a few and I believe our dev team have a similar >> problem too and are relying of the default collations. >> >> >> >> Any advice is very much appreciated. >> >> >> >> Ps I did visit some other sites like Chinese tools and on searching for ? >> ?? was offered Ch?n , Sh?n and T?n as pinyin equivalents so I guess >> there are more than 1, I am just wondering if for names (which in our case >> it is a phonebook) there is some common knowledge it can only be Sh?n. >> >> >> >> I also managed to pin down a passing Chinese work colleague but all he >> could say was is only and Ch?n is a ?suggestions? rather than actual match >> (and then exited stage left in haste) ? is that correct ? >> >> >> >> Kind regards, >> >> >> >> Mark Robbie, >> >> >> >> _______________________________________________ >> CLDR-Users mailing list >> CLDR-Users at unicode.org >> http://unicode.org/mailman/listinfo/cldr-users >> >> > _______________________________________________ > CLDR-Users mailing list > CLDR-Users at unicode.org > http://unicode.org/mailman/listinfo/cldr-users > > > _______________________________________________ > CLDR-Users mailing list > CLDR-Users at unicode.org > http://unicode.org/mailman/listinfo/cldr-users > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From markus.icu at gmail.com Tue Sep 13 16:47:02 2016 From: markus.icu at gmail.com (Markus Scherer) Date: Tue, 13 Sep 2016 14:47:02 -0700 Subject: =?UTF-8?B?UmU6IENow6luICwgU2jEm24gYW5kIOayiCBwaW55aW4gY29uZnVzaW9u?= In-Reply-To: References: <6D31EC84-23B7-43C8-B291-BF72FDC50BF0@eibbor.co.uk> Message-ID: The Names variant of the Han-Latin transform (e.g., via ICU Transliterator) should do this -- as a preprocessing step. The CLDR/ICU Collator does not currently offer a tailoring that would do this automatically just while sorting. Adding such a variant would add at least a couple of 100kB to the data size. For Chinese and Japanese, I suggest you add a pronunciation field (pinyin for zh-CN, Hiragana for ja); prefill it via the Transliterator, make it visible to the user, let them fix it; sort by that. markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From duerst at it.aoyama.ac.jp Tue Sep 13 21:57:44 2016 From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J._D=c3=bcrst?=) Date: Wed, 14 Sep 2016 11:57:44 +0900 Subject: =?UTF-8?B?UmU6IENow6luICwgU2jEm24gYW5kIOayiCBwaW55aW4gY29uZnVzaW9u?= In-Reply-To: References: <6D31EC84-23B7-43C8-B291-BF72FDC50BF0@eibbor.co.uk> Message-ID: On 2016/09/14 06:47, Markus Scherer wrote: > The Names variant of the Han-Latin transform (e.g., via ICU Transliterator) > should do this -- as a preprocessing step. > > The CLDR/ICU Collator does not currently offer a tailoring that would do > this automatically just while sorting. Adding such a variant would add at > least a couple of 100kB to the data size. > > For Chinese and Japanese, I suggest you add a pronunciation field (pinyin > for zh-CN, Hiragana for ja); Both hiragana and katakana work. But in my experience, Katakana is way more frequent. Please make sure to accept both half-width and full-width Katakana; getting a message like "only full-with Katakana accepted" is very annoying when this can be done automatically. Same for Hiragana->Katakana conversion. > prefill it via the Transliterator, This at first sight sounds like a neat idea for Japanese. However, I have never seen it (and living in Japan, I would have had ample occasion to see it). There is always a "pronunciation" (reading/yomi) field, but it's never pre-filled. My guess is that the reason for this is that there are just too many variations in Japanese names. For Chinese, there's usually just one reading, and occasionally (as discussed in this thread) two or more, but for Japanese, the percentages are different. Regards, Martin. > make it visible to the user, let them fix it; sort by that. > > markus > > > > _______________________________________________ > CLDR-Users mailing list > CLDR-Users at unicode.org > http://unicode.org/mailman/listinfo/cldr-users > -- Martin J. D?rst Department of Intelligent Information Technology Collegue of Science and Engineering Aoyama Gakuin University Fuchinobe 5-1-10, Chuo-ku, Sagamihara 252-5258 Japan From Mr at eibbor.co.uk Tue Sep 13 23:27:37 2016 From: Mr at eibbor.co.uk (Work) Date: Wed, 14 Sep 2016 05:27:37 +0100 Subject: =?utf-8?Q?Re:_Ch=C3=A9n_,_Sh=C4=9Bn_and_=E6=B2=88_pinyin_confusi?= =?utf-8?Q?on?= In-Reply-To: References: <6D31EC84-23B7-43C8-B291-BF72FDC50BF0@eibbor.co.uk> Message-ID: >From what has been said earlier by Markus and Peter does anyone know if ? transforms/transliterates to Sh?n if the Names variant of Han-Latin transform is invoked ? I think Peter's reply was saying it would, but I was not sure. I will talk to Dev team about invoking the names variant and have a chat with guys about the pronunciation field as a catch all fall back. At the minute the subject field mapping when views as a sorted list seems to be the big groan coming back at me, so maybe the invoking the Names variant of Han-Latin transform is a quick win while we look into the pronunciation suggestion. Thanks again. Sent from my iPhone > On 13 Sep 2016, at 22:47, Markus Scherer wrote: > > The Names variant of the Han-Latin transform (e.g., via ICU Transliterator) should do this -- as a preprocessing step. > > The CLDR/ICU Collator does not currently offer a tailoring that would do this automatically just while sorting. Adding such a variant would add at least a couple of 100kB to the data size. > > For Chinese and Japanese, I suggest you add a pronunciation field (pinyin for zh-CN, Hiragana for ja); prefill it via the Transliterator, make it visible to the user, let them fix it; sort by that. > > markus From Mr at eibbor.co.uk Wed Sep 14 01:55:57 2016 From: Mr at eibbor.co.uk (Work) Date: Wed, 14 Sep 2016 07:55:57 +0100 Subject: =?utf-8?Q?Re:_Ch=C3=A9n_,_Sh=C4=9Bn_and_=E6=B2=88_pinyin_confusi?= =?utf-8?Q?on?= In-Reply-To: References: <6D31EC84-23B7-43C8-B291-BF72FDC50BF0@eibbor.co.uk> Message-ID: <466BCF91-4715-48F6-8DF8-EA47FB5B5E1F@eibbor.co.uk> For info I tried using the transformation demo and selected Names and Names (Variant) and pasted ? to the input and got Ch?n at the output. Does this mean ? will never transform to Sh?n or there is some manual addition I need to make to the 'Compound 1' text box contents? Sent from my iPhone > On 14 Sep 2016, at 05:27, Work wrote: > > From what has been said earlier by Markus and Peter does anyone know if ? transforms/transliterates to Sh?n if the Names variant of Han-Latin transform is invoked ? > > I think Peter's reply was saying it would, but I was not sure. > > I will talk to Dev team about invoking the names variant and have a chat with guys about the pronunciation field as a catch all fall back. > > At the minute the subject field mapping when views as a sorted list seems to be the big groan coming back at me, so maybe the invoking the Names variant of Han-Latin transform is a quick win while we look into the pronunciation suggestion. > > Thanks again. > > Sent from my iPhone > >> On 13 Sep 2016, at 22:47, Markus Scherer wrote: >> >> The Names variant of the Han-Latin transform (e.g., via ICU Transliterator) should do this -- as a preprocessing step. >> >> The CLDR/ICU Collator does not currently offer a tailoring that would do this automatically just while sorting. Adding such a variant would add at least a couple of 100kB to the data size. >> >> For Chinese and Japanese, I suggest you add a pronunciation field (pinyin for zh-CN, Hiragana for ja); prefill it via the Transliterator, make it visible to the user, let them fix it; sort by that. >> >> markus From Mr at eibbor.co.uk Wed Sep 14 05:44:26 2016 From: Mr at eibbor.co.uk (Work) Date: Wed, 14 Sep 2016 11:44:26 +0100 Subject: =?utf-8?Q?Re:_Ch=C3=A9n_,_Sh=C4=9Bn_and_=E6=B2=88_pinyin_confusi?= =?utf-8?Q?on?= In-Reply-To: <466BCF91-4715-48F6-8DF8-EA47FB5B5E1F@eibbor.co.uk> References: <6D31EC84-23B7-43C8-B291-BF72FDC50BF0@eibbor.co.uk> <466BCF91-4715-48F6-8DF8-EA47FB5B5E1F@eibbor.co.uk> Message-ID: I found a way - not sure I fully understand what I have typed in.... If I edit the Names Compound1 text box contents to say "Han-Latin/Names; Latin; Title" then I get a Sh?n in lieu of Ch?n. Although I still do not understand all of the fields meanings in the box yet - but will start reading docs more in this area. Sent from my iPhone > On 14 Sep 2016, at 07:55, Work wrote: > > For info I tried using the transformation demo and selected Names and Names (Variant) and pasted ? to the input and got Ch?n at the output. > > > Does this mean ? will never transform to Sh?n or there is some manual addition I need to make to the 'Compound 1' text box contents? > > Sent from my iPhone > >> On 14 Sep 2016, at 05:27, Work wrote: >> >> From what has been said earlier by Markus and Peter does anyone know if ? transforms/transliterates to Sh?n if the Names variant of Han-Latin transform is invoked ? >> >> I think Peter's reply was saying it would, but I was not sure. >> >> I will talk to Dev team about invoking the names variant and have a chat with guys about the pronunciation field as a catch all fall back. >> >> At the minute the subject field mapping when views as a sorted list seems to be the big groan coming back at me, so maybe the invoking the Names variant of Han-Latin transform is a quick win while we look into the pronunciation suggestion. >> >> Thanks again. >> >> Sent from my iPhone >> >>> On 13 Sep 2016, at 22:47, Markus Scherer wrote: >>> >>> The Names variant of the Han-Latin transform (e.g., via ICU Transliterator) should do this -- as a preprocessing step. >>> >>> The CLDR/ICU Collator does not currently offer a tailoring that would do this automatically just while sorting. Adding such a variant would add at least a couple of 100kB to the data size. >>> >>> For Chinese and Japanese, I suggest you add a pronunciation field (pinyin for zh-CN, Hiragana for ja); prefill it via the Transliterator, make it visible to the user, let them fix it; sort by that. >>> >>> markus > > > _______________________________________________ > CLDR-Users mailing list > CLDR-Users at unicode.org > http://unicode.org/mailman/listinfo/cldr-users From Mr at eibbor.co.uk Wed Sep 14 14:45:37 2016 From: Mr at eibbor.co.uk (Work) Date: Wed, 14 Sep 2016 20:45:37 +0100 Subject: Compound transforms/transliteration how to read? Message-ID: If I had a transform "Han-Latin/Names; Any-Latin; Latin-ASCII" How would I read this ? I think the above relies on anything which is not Han-Latin/Names 'falling through' from input to output unchanged and then the next transform is applied .... Is that correct and is it a how the transforms work ? I am also wondering if Any-Latin is not a recommended transform to use when considering other language systems ? (Else what's the point of the specific transforms like Hans-Latin) Finally what is the term for the /Names part of the first transform ? Kind Regards, Mark Sent from my iPhone From Mr at eibbor.co.uk Wed Sep 14 14:51:58 2016 From: Mr at eibbor.co.uk (Work) Date: Wed, 14 Sep 2016 20:51:58 +0100 Subject: Where do I find a full list of transforms and modifiers Message-ID: <6C705C58-B78A-4718-9301-939E5DAB99C5@eibbor.co.uk> I can see there are transforms mentioned in transliteration user guide, but unsure where to find the full list and whatever the 'modifier' (possibly incorrect term) such as /Names and /UNGEGN and what they are/mean. Kind Regards, Mark From doug at ewellic.org Tue Sep 20 11:59:17 2016 From: doug at ewellic.org (Doug Ewell) Date: Tue, 20 Sep 2016 09:59:17 -0700 Subject: Where do I find a full list of transforms and modifiers Message-ID: <20160920095917.665a7a7059d7ee80bb4d670165c8327d.9390808c8e.wbe@email03.godaddy.com> Work wrote: > I can see there are transforms mentioned in transliteration user > guide, but unsure where to find the full list and whatever the > 'modifier' (possibly incorrect term) such as /Names and /UNGEGN and > what they are/mean. Download this file: http://unicode.org/Public/cldr/29/core.zip and then extract this file from it: common\bcp47\transform.xml -- Doug Ewell | Thornton, CO, US | ewellic.org From pedberg at apple.com Thu Sep 22 00:24:30 2016 From: pedberg at apple.com (Peter Edberg) Date: Wed, 21 Sep 2016 22:24:30 -0700 Subject: CLDR Version 30 final candidate available Message-ID: <555B65D5-D7DE-4DF1-8101-1FE5765C6D30@apple.com> Dear CLDR Users, The final candidate version of Unicode CLDR v30 is available for testing. The main improvements include: ? New format and preference structure has been added to support week designations such as ?the week of August 10? or ?week 3 of March?, though this structure may be refined in the future. ? New data items have been added to support relative times such as ?3 Fridays ago? or ?this hour?. ? New data can be used to generate labels for groups of related characters in character pickers. ? The structure for emoji annotations has been revised, and the data has been significantly updated. ? Unicode support is updated to 9.0, including updated Unihan readings for the pinyin collation and Han-Latin transforms, and support for new script codes and number systems. Some support is also added for region codes EZ, UN. ? The set of language codes for translation has been updated, with a significant increase in the total number of translated language names. ? The CLDR 30 Survey Tool data collection and additional bug fixing resulted in a net increase in data items of about 8.6%, with an additional 5.6% of items changed. Draft release note: http://cldr.unicode.org/index/downloads/cldr-30 Draft charts: http://www.unicode.org/cldr/charts/dev/ (not yet updated for recent number symbol fixes) Draft data tag: http://www.unicode.org/repos/cldr/tags/release-30-d03 The final release of CLDR 30 is targeted for the end of September. Please provide any feedback on the final candidate by filing a ticket as described here: http://cldr.unicode.org/index/bug-reports Best regards, Peter Edberg for the CLDR Project -------------- next part -------------- An HTML attachment was scrubbed... URL: