Re: Chén , Shěn and 沈 pinyin confusion

Tue Sep 13 15:06:11 CDT 2016

Hi Mark,

Commenting as a Chinese speaker (and not a dev).

Quite a few characters in Chinese have more than one pronunciations. In
contexts such as people's names, it often comes down to which pronunciation
their parents preferred while naming them. CLDR might have data on all the
possible pronunciations of a character, but a phonebook application should
allow users to override inferred pronunciation of a name.

There's just a caveat for collating though. Collations are usually done on
surnames in Chinese. Surnames in China (and other Chinese-speaking regions)
follow a strict convention, so in the context of a surname, 沈 is 99% likely
to be shěn rather than chén. Similar examples out of the top of my head: 华
(usually huá, as a surname huà) and 单 (usually dān, as a surname shàn). One
should also take care of compound surnames
<https://en.wikipedia.org/wiki/Chinese_compound_surname> (rare but not that
rare).

I'm not certain how much support CLDR provides for this use case.

Thanks
k

On Tue, Sep 13, 2016 at 12:33 PM, Mark Robbie <mr at eibbor.co.uk> wrote:

> Hi,
>
>
>
> We are using ICU and CLDR with SQLite. I am not a software developer but a
> user of the output.
>
>
>
> We have had some comments from Chinese colleagues on name sorting and I am
> unsure if what we have is correct or if it is expected our development team
> are supposed to use the tools in a different way.  We are currently sorting
> the phonebook by pinyin and an example of a comment we have had is
> regarding “沈” when ends up being sorted as Chen, but our China team are
> saying it should be Shen.
>
>
>
> I am trying to figure out if  the utilities should come up with the
> generally accepted match out of the box or if  “沈” really does map to 2
> pinyin equivalents or if our dev team is supposed to override the default
> rule to make Chen a Shen. I did notice in CLDR 24 for zh.xml that there is
> an additional section called compounds and then says “Here 沈 collates as
> shěn/7stk/rad85, between 弞 7/stk/rad57, 审 8stk/rad40”.  I have not a clue
> how to interpret this but am wondering if this means to override the
> mapping to chén earlier in the table and if this was something learned in
> CLDR for v24 onwards ?
>
>
>
> Not being able to read Chinese I am unsure if there will be loads of these
> examples or only a few and I believe our dev team have a similar problem
> too and are relying of the default collations.
>
>
>
> Any advice is very much appreciated.
>
>
>
> Ps I did visit some other sites like Chinese tools and on searching for “沈”
> was offered Chén , Shěn and Tán as pinyin equivalents so I guess there are
> more than 1, I am just wondering if for names (which in our case it is a
> phonebook) there is some common knowledge it can only be Shěn.
>
>
>
> I also managed to pin down a passing Chinese work colleague but all he
> could say was is only and Chén is a ‘suggestions’ rather than actual match
> (and then exited stage left in haste) – is that correct ?
>
>
>
> Kind regards,
>
>
>
> Mark Robbie,
>
>
>
> _______________________________________________
> CLDR-Users mailing list
> CLDR-Users at unicode.org
> http://unicode.org/mailman/listinfo/cldr-users
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/cldr-users/attachments/20160913/c910b8fc/attachment.html>