CJK Ideograph Encoding Velocity (was: Re: Unicode Emoji 11.0 characters now ready for adoption!)

Tue Mar 6 02:09:49 CST 2018

Dear Ken,

the context of the question was how many characters in modern use are 
being encoded. Part of the answer is that there are several thousand 
Chinese characters that are names of people on places to be encoded. The 
limit of 1,000 characters a working set per member was for workings set 
2017, this is a new thing. If the same member limit is applied to future 
working sets, then the result will be that some of these characters 
identified in 2017. Some around 500 have been included in working set 
2017. Some will be included in the following working set which will most 
likely be in 2020 and if there is then also a limit of 1,000 characters 
per member then not all would be included. That would mean some would 
have to wait until 2022 before they can be submitted to IRG, which means 
at least 2027 before they are encoded. Names of pleople and places are 
not the only CJK unified ideographs that need to be encoded but they 
illustrate the problem that if future working have a 1,000 limit per 
member which submissions every 2 or 3 years, then it delay the encoding 
on CJK unified ideographs by years.

On 06.03.2018 01:40, Ken Whistler via Unicode wrote:
> John,
>
> I think this may be giving the list a somewhat misleading picture of
> the actual statistics for encoding of CJK unified ideographs. The 
> "500
> characters a year" or "1000 characters a year" limits are
> administrative limits set by the IRG for national bodies (and others)
> submitting repertoire to the "working set" that the IRG then segments
> into chunks for processing to prepare new increments for actual
> encoding.
>

Here I was refering to the number of CJK unified ideogrpahs that the 
People's Republic of China can submit to IRG, the numbers are of course 
different for CJK  unified ideographs as a whole. A limit of 1,000 a 
working set means that the number of CJK unified ideographs in the 
People's Republic of China awaiting submission to IRG is most likely to 
increase not decreases for decades to come. For other IRG members that 
still have characters to submit a limit of 1,000 a working set most 
likely leads to a decrease in the number of CJK unified ideographs 
awaiting submission over time. In short the administrative limit of 
1,000 works to a degree for most IRG members, but not for the People's 
Republic of China.

> In point of fact, if we take 1991 as the base year, the *average*
> rate of encoding new CJK unified ideographs now stands at 3379 per
> annum (87,860 as of Unicode 10.0). By "encoding" here, I mean, final,
> finished publication of the encoded characters -- not the larger
> number of potentially unifiable submissions that eventually go into a
> publication increment. There is a gradual downward drift in that
> number over time, because of the impact on the stats of the "big 
> bang"
> encoding of 42,711 ideographs for Extension B back in 2001, but
> recently, the numbers have been quite consistent with an average
> incremental rate of about 3000 new ideographs per year:
>

1991 to 2001 70,207 that is around seven thousand a year. However 2002 
to 2018 only 17,675 so around one thousand a year

> 5762 added for Extension E in 2015
>

These 5762 were submitted to IRG in 2001, so 14 years from submission 
to encoding.

> 7463 added for Extension F in 2017
>
> ~ 4934 to be added for Extension G, probably to be published in 2020
>
> If you run the average calculation including Extension G, assuming
> 2020, you end up with a cumulative per annum rate of 3200, not much
> different than the calculation done as of today.
>
> And as for the implication that China, in particular, is somehow
> limited by these numbers, one should note that the vast majority of
> Extension G is associated with Chinese sources. Although a 
> substantial
> chunk is formally labeled with a "UK" source this time around, almost
> all of those characters represent a roll-in of systematic
> simplifications, of various sorts, associated with PRC usage. (People
> who want to check can take a look at L2/17-366R in the UTC document
> registry.)
>

Extension G was before the 1,000 character per memeber limit. Whatever 
the UK characters submitted were, the largest single Chinese source was 
in fact over one thousand Zhuang characters submitted by People's 
Republic of Chhina not "systematic simplifications". It would certainly 
be incorrect to think that the vaste majority of CJK unified ideographs 
to be encoded are "systematic simplifications".

Regards
John

> --Ken
>
>
> On 3/5/2018 7:13 AM, via Unicode wrote:
>> Dear All,
>>
>> to simplify discussion I have split the points. <unicode at unicode.org 
>> [1]
>
>>
>>>
>>>>
>>>>
>>>> On 2018/03/01 12:31, via Unicode wrote:
>>>>
>>>>> Third, I cannot confirm or deny the "500 characters a year" 
>>>>> limit, but
>>>>> I'm quite sure that if China (or Hong Kong, Taiwan,...) had a 
>>>>> real need
>>>>> to encode more characters, everybody would find a way to handle 
>>>>> these.
>>
>>
>> Chinese characters for Unicode first go to IRG (or ISO/IEC 
>> JTC1/SC2/WG2/IRG) website. The limit of 500 a year for China is an 
>> average based on IRG #48 document regarding working set 2017 
>> http://appsrv.cse.cuhk.edu.hk/~irg/irg/irg48/IRGN2220_IRG48Recommends.pdf 
>> which explicitly states "each submission shall not exceed 1,000 
>> characters". The People's Republic of China as one member of IRG is 
>> limited to 1,000 characters, which hopefully we can all agree has a 
>> population of over 1,000,000,000 , therefore was limited to submitting 
>> at most 1,000 characters. The earliest possible date for the next 
>> working set is two or three years later, that is 2019 or 2020, so 
>> that's an average limit of either 500 or 333 characters a year.
>>
>> Regards
>> John
>>
>>
>>
>>