Discrepancies between kTotalStrokes and kRSUnicode in the Unihan database - repost all ascii
John Armstrong
john.armstrong.rn3 at gmail.com
Tue Sep 9 16:22:45 CDT 2014
Thanks for your long and detailed reply Richard. (The full version came to
me directly so I could see it.) It will take me some time to digest it, but
I since you suggest I submit something to UTC I want to clarify the extent
of my knowledge of and ongoing involvement with Han characters.
I started out life as a linguist but have worked in software for the past 20
years. My main work now involves web crawling and page and entity
identification focusing strongly on English language sources. I ran into
the issue I've described in this mailing list will doing a personal project
involving correlating Sino-Japanese and Sino-Korean vocabulary. I actually
am more interested in the readings of the characters (Japanese On and Korean
Eum) than the characters themselves, but I am trying to leverage the fact
the two languages normally write cognate Sino-X words with the same
characters (plus or minus variations in form).
I wanted to have stroke counts of the form Total-Rad-Residual such as I am
used to from Jack Halpern's Japanese character dictionaries and was fully
confident that I could make them by simply combining the kTotalStrokes and
kRSUnicode fields in the manner I indicate in my message. But I started
noticing occasional examples where the implied radical stroke counts seemed
to large or too small, and I modified a tool I already had to try to detect
cases of this algorithmically.
I don't know Chinese and I have a lot of trouble making out Chinese
characters when they are printed in normal size due lack of familiarity made
worse by poor eyesight and a touch of dyslexia. I am certainly willing to
give UTC a complete list of characters (through Extension D) and their
status as suspicious or not along with some stats that the tool uses to make
its decisions. In fact that I already have.
Beyond that I might be able to commit to a submitting list of kTotalStrokes
that should be corrected to match the lRSUncodes. I definitely to not have
either the time or the knowledge to determine the correctness ate kRSUnicode
values or do anything with variants.
But I'm not sure I am the best person to do this. Based on the information
about CDL I see on the Wenlin Institute website I sense you already have a
full compositional model and could use it to produce a list of corrections
that would be far more accurate than anything I could do.
In terms of changing or adding fields, while I think the original separation
of kTotalStrokes and kRSUnicode was a poor design choice (though maybe
unavoidable for historical reasons), I'm thinking more and more that it's
not worth making a change just to fix the issue I'm raising, and a better
next step would be to represent characters as specific formally recognized
radical variants (with fixed stroke counts) + residual stroke counts. This
would be a first step towards a compositional model but could be done
without getting into all the complexity and difficulty of a full recursive
model.
What do you think? Feel free to respond off-list if you prefer.
-- John
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20140909/fee3f4a5/attachment.html>
More information about the Unicode
mailing list