Re: Chén , Shěn and 沈 pinyin confusion

Tue Sep 13 15:52:20 CDT 2016

Peter,

Thankyou for your response.

That in essence is what our application is trying to do - transform to pinyin then sort as pinyin (but display the Chinese text), but somehow we may be using the utilities at our disposal incorrectly?

If it makes any difference I think we are using CLDR22 which is a little old and so not sure of limitations here or if Names variant was around then too.

Is it possible to tell from the zh.xml file alone how names would resolve to the most likely result or is it trickier than that.

Also can you advise how do you invoke the Hans-Latin names variant ?

Sent from my iPhone

> On 13 Sep 2016, at 21:28, Peter Edberg <pedberg at apple.com> wrote:
> 
> CLDR transforms do make this distinction. CLDR has a Names variant of the Han-Latin transform that specifically intended for surnames; this does in fact transform 沈, 华, and 单 using the name readings given below, as well as doing the same for a number of other characters.
> 
> We do not currently have a collation variant that sorts by surname readings. However one could emulate that by first transforming to pinyin using the Han-Latin/Names transform, and then sorting using the pinyin result.
> 
> - Peter E
> 
> 
>> On Sep 13, 2016, at 1:06 PM, kz <kazede at google.com> wrote:
>> 
>> Hi Mark,
>> 
>> Commenting as a Chinese speaker (and not a dev).
>> 
>> Quite a few characters in Chinese have more than one pronunciations. In contexts such as people's names, it often comes down to which pronunciation their parents preferred while naming them. CLDR might have data on all the possible pronunciations of a character, but a phonebook application should allow users to override inferred pronunciation of a name.
>> 
>> There's just a caveat for collating though. Collations are usually done on surnames in Chinese. Surnames in China (and other Chinese-speaking regions) follow a strict convention, so in the context of a surname, 沈 is 99% likely to be shěn rather than chén. Similar examples out of the top of my head: 华 (usually huá, as a surname huà) and 单 (usually dān, as a surname shàn). One should also take care of compound surnames (rare but not that rare).
>> 
>> I'm not certain how much support CLDR provides for this use case.
>> 
>> 
>> Thanks
>> k
>> 
>>> On Tue, Sep 13, 2016 at 12:33 PM, Mark Robbie <mr at eibbor.co.uk> wrote:
>>> Hi,
>>> 
>>>  
>>> 
>>> We are using ICU and CLDR with SQLite. I am not a software developer but a user of the output.
>>> 
>>>  
>>> 
>>> We have had some comments from Chinese colleagues on name sorting and I am unsure if what we have is correct or if it is expected our development team are supposed to use the tools in a different way.  We are currently sorting the phonebook by pinyin and an example of a comment we have had is regarding “沈” when ends up being sorted as Chen, but our China team are saying it should be Shen.
>>> 
>>>  
>>> 
>>> I am trying to figure out if  the utilities should come up with the generally accepted match out of the box or if  “沈” really does map to 2 pinyin equivalents or if our dev team is supposed to override the default rule to make Chen a Shen. I did notice in CLDR 24 for zh.xml that there is an additional section called compounds and then says “Here 沈 collates as shěn/7stk/rad85, between 弞 7/stk/rad57, 审 8stk/rad40”.  I have not a clue how to interpret this but am wondering if this means to override the mapping to chén earlier in the table and if this was something learned in CLDR for v24 onwards ?
>>> 
>>>  
>>> 
>>> Not being able to read Chinese I am unsure if there will be loads of these examples or only a few and I believe our dev team have a similar problem too and are relying of the default collations.
>>> 
>>>  
>>> 
>>> Any advice is very much appreciated.
>>> 
>>>  
>>> 
>>> Ps I did visit some other sites like Chinese tools and on searching for “沈” was offered Chén , Shěn and Tán as pinyin equivalents so I guess there are more than 1, I am just wondering if for names (which in our case it is a phonebook) there is some common knowledge it can only be Shěn.
>>> 
>>>  
>>> 
>>> I also managed to pin down a passing Chinese work colleague but all he could say was is only and Chén is a ‘suggestions’ rather than actual match (and then exited stage left in haste) – is that correct ?
>>> 
>>>  
>>> 
>>> Kind regards,
>>> 
>>>  
>>> 
>>> Mark Robbie,
>>> 
>>>  
>>> 
>>> 
>>> _______________________________________________
>>> CLDR-Users mailing list
>>> CLDR-Users at unicode.org
>>> http://unicode.org/mailman/listinfo/cldr-users
>> 
>> _______________________________________________
>> CLDR-Users mailing list
>> CLDR-Users at unicode.org
>> http://unicode.org/mailman/listinfo/cldr-users
> 
> _______________________________________________
> CLDR-Users mailing list
> CLDR-Users at unicode.org
> http://unicode.org/mailman/listinfo/cldr-users
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/cldr-users/attachments/20160913/8d433a83/attachment-0001.html>