UAX #38: Unicode Han Database (Unihan)

kDYC Property Record (Proposed Draft)

Version	Unicode 14.0.0 (Proposed)
Source	Richard Cook 曲理查
Date	2020-01-28

Introductory Notes for Reviewers

This document presents the UAX #38 record for a proposed new Unihan property kDYC defining dictionary mappings for the head entries of an influential Qīng Dynasty (1815) revision of the Eastern Hàn dictionary Shuōwén (121 CE), identifying the traditional core of the Hàn writing system (10,706 characters).
Key Features of the kDYC property data:
- Maps each of ~40,000 Unihan characters (including many variants of various kinds) to one or more entries in this Chinese dictionary.
- Includes ~62,000 references total, using a simple variant typology (documented below).
- Covers ~47% of the Unihan character set (including derived mappings; Unicode 13.0 Unihan defines properties for 93,858 CJK Unified Ideographs, including simplified-only and CJK Compatibility characters).
- Applications to general ongoing development of: Unihan, CJK, encoding model and repertory, cross-locale CJK variant folding, spoofing, IDNA, and IME.
- Completes the UAX #45 “DYC” subset with the full superset of source mappings from which it derives (UAX #45 should be updated with a reference to kDYC).
- Provides a common CJK framework for developing the (non-CJK) Seal Script encoding model (including property and font data), for determining the repertory (identifying and unifying duplicates across the various sources), developing the Seal code charts, indexes, and IME.
- Derives from database tables in development since 1994, and revised since 1998 in conjunction with UCS development.
- Reflects ~20,000 or more hours of development time (over 26 years, 1994-2020).
In the record below the Syntax and Description fields derive from (Cook 2003); early drafts of the current document were discussed in UTC Script Ad Hoc (2017-2020), and might be used to draft a related Unicode Technical Note (UTN).
The record below is formatted for insertion into UAX #38 at the appropriate alphabetical location (between kDefinition and kEACC; and adding kDYC to the AlphabeticalListing).

Property	kDYC
Status	Provisional
Category	Dictionary Indices
Introduced	14.0
Delimiter	space
Syntax	[A-D][0-9A-Za-z]-[0-9]{3}\.[1-4][1-9AB][0-5]
Description	The page position (typologized variant class) of this character, in the framework of the Chinese dictionary Shuō Wén Jiě Zì – Zhù by Duàn Yùcái (DYC). The character references have the form “TO-PPP.QMV”, in which: “T” = Type of match (A-D); “O” = Offset within that Type (0-9A-Za-z); “PPP.QMV” identifies the Variant Class, in which: “PPP” = Page reference, 3-digit zero-padded (001..752); “Q” = Quadrant on the page (1-4); “M” = offset within Q of the Main Seal form (正文, lexical head entry); “V” > 0 indicates a Variant Seal form (重文, variant of M). Type (A-D): “A” = best available match (CJKUI most closely matching M, stroke-for-stroke); “B” = proper match (non-best, proper variant, closely/directly related), interchangeable or synonymous usage (同用); “C” = questionable match (possible/probable, unclearly/indirectly related); “D” = improper match (confusable/confused old/modern forms, lexical-source separation, cross-reference, component-form, etc.), incorrect or common substitution (通用). Offset (0-9A-Za-z): sequential alphanumeric index within a given Type; a lower index has higher priority for the Type (“B0-” bears more proper/direct relation to M than does “B1-”; a common `kBigFive` form not in Type=A will have the “B0-” prefix). Variant Class (PPP.QMV): All members of a given Variant Class have the same value “PPP.QMV” (pointing to the dictionary head entry for a single Seal form; see below); all members of a given Seal Variant Class have the same value “PPP.QM” (pointing to the class of related Seal forms M and V; the head entries are lexically adjacent in the source). -- Examples and Discussion -- The `kDYC` property values derive from a source file with 10,706 lines like the following: #　PPP.QMV ; Ａ ; Ｂ?Ｃ!Ｄ　048.410 ; 八 ; 丷捌?扒趴!儿𠘧人入𠔁　062.250 ; 㕣 ; ?䛇!𠮦𠮥召叴台合公谷𠔌允只兄　062.251 ; 𧮲 ; !䜭容䆟　089.330 ; 言 ; 訁讠𢍬䇾𢍗𧩁?𥩭㖖𠲩𠱫𠲗!占舌𢀛音咅㕻𧥛𧥜𧥿心信𧧙讞𠷗𧧑誩𧨟譶𧭛𧮦　093.330 ; 說 ; 説𧧘悅?𢛹!稅脫哾𠱕𧭚詋𧧗　224.110 ; 入 ; !人久亠宀𠂉𠆢八丫仌从𠓜　365.110 ; 人 ; 亻儿𤯔𠔽𠂊𠂋!入勹𡰣尸几八卜匕𠤎刀𠆢𠂉冖饣𠚤从仌㐺众𠈌　404.410 ; 儿 ; 人!八入几兀丌兒　405.210 ; 兌 ; 兑兊𠫞𠫨說悅恱閱銳㙂!兄允充兗党兇兒𠒆𠒋皃𠏮㟋𡷋㕣　527.420 ; 沇 ; 渷兗兖䆓?𠵷!充吮㳘況涗　527.421 ; 㕣 ; 沿㳂𡵴!𠮦 In the above example the first line (#) shows the 3-column source file syntax: ⑴ PPP.QMV (Variant Class); ⑵ Ａ (Type A); ⑶ Ｂ?Ｃ!Ｄ (Types B,C,D, delimited: “?” marks the start of Type C, and “!” marks the start of Type D). Each CJK Unified Ideograph (CJKUI) in each line of the source file is a member of that Variant Class, and receives a `kDYC` property value with a prefix (“TO-”) indicating its Type (A,B,C,D) and the Offset within that type. Processing the source file outputs `kDYC` values like the following (subset of the CJKUI in the above example; the source file is reproducible from the full set of `kDYC` values): 　㕣　U+3562 　kDYC　A0-062.250 A0-527.421 C0-570.410 D0-049.310 D0-570.220 　　　　　D1-058.230 D1-556.120 D2-059.460 D2-222.430 D5-057.150 　　　　　DD-405.210 　兊　U+514A 　kDYC　B1-405.210 D2-405.130 DS-738.310 　兌　U+514C 　kDYC　A0-405.210 D0-384.120 D0-472.340 D2-405.220 D2-405.310 　兑　U+5151 　kDYC　B0-405.210 D3-405.310 　悅　U+6085 　kDYC　B2-093.330 B5-405.210 C0-151.310 　稅　U+7A05 　kDYC　A0-326.420 D0-093.330 　說　U+8AAA 　kDYC　A0-093.330 B4-405.210 D0-100.122 　説　U+8AAC 　kDYC　B0-093.330 　銳　U+92B3 　kDYC　A0-707.380 B2-710.350 B8-405.210 C5-039.350 D0-263.340 　鋭　U+92ED 　kDYC　B0-707.380 　𠫞　U+20ADE　kDYC　B2-405.210 C3-707.380 　𠫨　U+20AE8　kDYC　B3-405.210 C2-707.380 D1-583.240 　𢛹　U+226F9　kDYC　C0-093.330 　𧧘　U+279D8　kDYC　B1-093.330 D1-100.360 The reference “A0-093.330” for U+8AAA (說) identifies that CJKUI as the best available match (Type=A) out of the various Hànzì associated with the third Main Seal form in quadrant 3 on page 93 (DYC); the Variant Class is “093.330” (PPP.QMV). The reference “B0-093.330” for U+8AAC (説) identifies that character as the next best match (Type=B) in the same Variant Class. The Type difference here (A vs. B) reflects the fact that the “ears-in” (八) component of U+514C (兌) more closely matches the Seal form than does the “ears-out” (丷) component of U+5151 (兑); see below. A query with either U+8AAA (說) or U+8AAC (説) finds the same DYC page location (the same Variant Class). A query with the `kSimplifiedVariant` form U+8BF4 (说) would require folding to the corresponding `kTraditionalVariant` code point (see below). The Offset “0” (zero) after the Type (A,B) in the reference prefixes (“A0-” and “B0-”) indicates that this is the first reference of that Type within this Variant Class; subsequent references have higher offsets (0-9A-Za-z). The higher Offset indicates more distant/indirect (less clear, more uncertain) relation to the head entry (or to a character of that Type with lower Offset in the Variant Class). Other characters in the 說/説 Variant Class (093.330) have higher Type Offsets (and in some cases are also assigned to other Variant Classes, and other Types). The reference “A0-405.210” for U+514C (兌) is of Type=A, vs. “B0-405.210” for U+5151 (兑) of Type=B; other CJKUI variants in this Variant Class (405.210) include old/rare related forms such as U+514A (兊), U+20ADE (𠫞), and U+20AE8 (𠫨). A query with any Variant Class member finds the Variant Class. The specific associated dictionary sources provide explanations of the historical semantic relations. Each Variant Class has exactly one Type=A property value (though multple values are allowed by the syntax). In the few cases where there is no “A0-” value, a value such as “A1-” (with offset “1” or higher) indicates the possible need for future encoding of a new CJKUI, or the need to update this data for newly encoded characters (on the basis of newly available dictionary mappings). Assignment to and Offset within Types (B,C,D) can be rather free; Type D in particular sometimes includes cross-references (as for bound morphemes), or component usage examples (doubling, tripling, quadrupling, etc.). Note that U+3562 (㕣) is assigned to “A0-” in two different Variant Classes (it is a Main form in its own right, and also a variant of the Main form 沇); for simplicity its Type D variants are given in full only at the first instance; likewise, 鋭 is not included with 銳 in “405.210” (but both are included in “707.380”); 税 (simple form) is omitted from the 稅 Variant Class (but is a derivable member of that Class, see below). The character identifications and Variant Class assignments reflect traditional associations between (Sòng-style) CJKUI and Shuōwén (SW) Seal forms, with primary reference to DYC and the Chinese dictionaries `kHanYu` (often citing DYC) and `kSBGY` (often citing SW), supplemented in some cases by other dictionaries such as `kKangXi` (China), and `kMorohashi` (Japan). Where explicit dictionary mappings have not been available in published Unihan property data (or the print sources have been unavailable), the relations sometimes derive from other sources (such as ROC《異體字字典》, 《中文大辭典》), or are simply inferred from the character form. The Variant Class assignments in this data are useful for queries on a wide variety of old and cross-locale texts. For example, U+8AAC (説) serves as the traditional form in both PRC (`kGB1`) and Japan (`kJis0`), but U+8AAA (說) is the `kTraditionalVariant` form in use in ROC (`kBigFive`); see below. Users in different locales will be able to use this data to find the appropriate DYC dictionary location, and to explore related characters. The Variant Class mappings in this data are useful to supplement and improve `kZVariant` data, and for input-method editor (IME) and spell-check development. Queries with characters of one Type are often useful for locating characters of another Type; a user not knowing how to find the exact character in question may nevertheless find it by searching for a familiar, similar, related, or incorrect form. The properties of a common form provide access to uncommon forms. Users inputting a given text may not be aware that close/distinct encoded variants exist, and two users inputting that text might unwittingly or accidentally input different strings, reconcilable with this data. There is a great variety of historical and locale-specific variation in CJK texts, sometimes non-distinctive for common or cross-locale purposes. The many encoded CJKUI variants reflect the development of the character set over thousands of years, with forms which originated for one purpose later being replaced by other forms, or reused or augmented for other purposes. This data is useful for digitizing old texts, for exploring the historical development of the Hàn writing system, and for studying this influential Qīng Dynasty commentary edition (DYC) of the Eastern Hàn Dynasty dictionary Shuōwén (121 CE). This data directly maps ~40,000 CJKUI to exactly 10,706 DYC head entries associating Sòng-style Hànzì (宋體漢字) with the equivalent Seal forms (篆文); there are 10,706 unique references (Variant Classes) in the form “PPP.QMV”. • Most mappings are many-to-one: multiple CJKUI map to each Seal form (multiple Hànzì belong to the same Variant Class); each CJKUI can occur in multiple Variant Classes (with different Types; there are ~61,800 references total, for the various Types). • The Source Separation Rule (R1, obsolete since 1992; see Section 18.1, Han, in [Unicode]) accounts for the separate encoding of many characters which otherwise might have been unified (such as 兑/兌, 说/説/說). • Mappings for simplified-only characters (CJKUI used only in PRC simple-form texts, and not occurring in traditional texts) are for the most part excluded from this data (for example, 说), since such mappings are derivable from other Unihan properties (`kSimplifiedVariant`, `kTraditionalVariant`). But where PRC traditional forms (as defined in `kHanYu` and 《汉语大词典》) differ from the ROC forms (`kBigFive`), both forms are included (for example, 兑/兌). Mappings for all `kCompatibility` characters are also derivable and so excluded. • This data reflects some three decades of development (as of January, 2020), with refinement and extension for on-going CJKUI encoding Extensions; for further details see Cook(2003). -- Sources -- 《說文解字‧注》 Shuō Wén Jiě Zì – Zhù 〔東漢〕許慎著〔清〕段玉裁注; 上海 (瑞金二路 272 號): 上海古籍出版社, 1981. [ 1988, 1989, 1998 (9th printing); ISBN: 7-5325-0487 5/H.6; corrected/pointed/indexed reproduction of the original text (經韻樓藏版, 1813-1815), in one volume (752 pages, + appendices); this edition inserts an equivalent Sòng-style character in the upper margin of each page quadrant, above the corresponding Seal form; these forms also appear in the Song-style body text and appended Radical-Stroke chart; for a version of the unpointed original text in 15 volumes (~3332 pages), see <http://www.wul.waseda.ac.jp/kotenseki/html/ho04/ho04_00026_0001/index.html> ] 《說文解字‧電子版》 Shuō Wén Jiě Zì – Diànzǐ Bǎn: Digital Recension of the Eastern Hàn Chinese Grammaticon; Cook, Richard S.; UC Berkeley, Dept. of Linguistics, 2003. [ 2009; STEDT Monograph #9, in 4 vols.; ISBN 0-944613-48-9; <http://linguistics.berkeley.edu/~rscook/html/writing.html#EHC> ]