Discrepancies between kTotalStrokes and kRSUnicode in the Unihan database - repost all ascii

Tue Sep 9 16:22:45 CDT 2014

Thanks for your long and detailed reply Richard.  (The full version came to
me directly so I could see it.)  It will take me some time to digest it, but
I since you suggest I submit something to UTC I want to clarify the extent
of my knowledge of and ongoing involvement with Han characters.

I started out life as a linguist but have worked in software for the past 20
years. My main work now involves web crawling and page and entity
identification focusing strongly on English language sources.  I ran into
the issue I've described in this mailing list will doing a personal project
involving correlating Sino-Japanese and Sino-Korean vocabulary.  I actually
am more interested in the readings of the characters (Japanese On and Korean
Eum) than the characters themselves, but I am trying to leverage the fact
the two languages normally write cognate Sino-X words with the same
characters (plus or minus variations in form).

I wanted to have stroke counts of the form Total-Rad-Residual such as I am
used to from Jack Halpern's Japanese character dictionaries and was fully
confident that I could make them by simply combining the kTotalStrokes and
kRSUnicode fields in the manner I indicate in my message.  But I started
noticing occasional examples where the implied radical stroke counts seemed
to large or too small, and I modified a tool I already had to try to detect
cases of this algorithmically. 

I don't know Chinese and I have a lot of trouble making out Chinese
characters when they are printed in normal size due lack of familiarity made
worse by poor eyesight and a touch of dyslexia.  I am certainly willing to
give UTC a complete list of characters (through Extension D) and their
status as suspicious or not along with some stats that the tool uses to make
its decisions. In fact that I already have.

Beyond that I might be able to commit to a submitting list of kTotalStrokes
that should be corrected to match the lRSUncodes.  I definitely to not have
either the time or the knowledge to determine the correctness ate kRSUnicode
values or do anything with variants.

But I'm not sure I am the best person to do this.  Based on the information
about CDL I see on the Wenlin Institute website I sense you already have a
full compositional model and could use it to produce a list of corrections
that would be far more accurate than anything I could do.

In terms of changing or adding fields, while I think the original separation
of kTotalStrokes and kRSUnicode was a poor design choice (though maybe
unavoidable for  historical reasons), I'm thinking more and more that it's
not worth making a change just to fix the issue I'm raising, and a better
next step would be to represent characters as specific formally recognized
radical variants (with fixed stroke counts) + residual stroke counts.  This
would be a first step towards a compositional model but could be done
without getting into all the complexity and difficulty of a full recursive
model.

What do you think?  Feel free to respond off-list if you prefer.

-- John

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20140909/fee3f4a5/attachment.html>