UAX #38: Unicode Han Database (Unihan)

kDYC Property Record (Proposed Draft)

Version Unicode 14.0.0 (Proposed)
Source Richard Cook 曲理查
Date 2020-01-28

Introductory Notes for Reviewers

Property kDYC
Status Provisional
Category Dictionary Indices
Introduced 14.0
Delimiter space
Syntax [A-D][0-9A-Za-z]-[0-9]{3}\.[1-4][1-9AB][0-5]
Description The page position (typologized variant class) of this character, in the framework of the Chinese dictionary Shuō Wén Jiě Zì – Zhù by Duàn Yùcái (DYC).

The character references have the form “TO-PPP.QMV”, in which: “T” = Type of match (A-D); “O” = Offset within that Type (0-9A-Za-z); “PPP.QMV” identifies the Variant Class, in which: “PPP” = Page reference, 3-digit zero-padded (001..752); “Q” = Quadrant on the page (1-4); “M” = offset within Q of the Main Seal form (正文, lexical head entry); “V” > 0 indicates a Variant Seal form (重文, variant of M).
  • Type (A-D): “A” = best available match (CJKUI most closely matching M, stroke-for-stroke); “B” = proper match (non-best, proper variant, closely/directly related), interchangeable or synonymous usage (同用); “C” = questionable match (possible/probable, unclearly/indirectly related); “D” = improper match (confusable/confused old/modern forms, lexical-source separation, cross-reference, component-form, etc.), incorrect or common substitution (通用).
  • Offset (0-9A-Za-z): sequential alphanumeric index within a given Type; a lower index has higher priority for the Type (“B0-” bears more proper/direct relation to M than does “B1-”; a common kBigFive form not in Type=A will have the “B0-” prefix).
  • Variant Class (PPP.QMV): All members of a given Variant Class have the same value “PPP.QMV” (pointing to the dictionary head entry for a single Seal form; see below); all members of a given Seal Variant Class have the same value “PPP.QM” (pointing to the class of related Seal forms M and V; the head entries are lexically adjacent in the source).
-- Examples and Discussion --

The kDYC property values derive from a source file with 10,706 lines like the following:
# PPP.QMV ; A ; B?C!D
  048.410 ; 八 ; 丷捌?扒趴!儿𠘧人入𠔁
  062.250 ; 㕣 ; ?䛇!𠮦𠮥召叴台合公谷𠔌允只兄
  062.251 ; 𧮲 ; !䜭容䆟
  089.330 ; 言 ; 訁讠𢍬䇾𢍗𧩁?𥩭㖖𠲩𠱫𠲗!占舌𢀛音咅㕻𧥛𧥜𧥿心信𧧙讞𠷗𧧑誩𧨟譶𧭛𧮦
  093.330 ; 說 ; 説𧧘悅?𢛹!稅脫哾𠱕𧭚詋𧧗
  224.110 ; 入 ; !人久亠宀𠂉𠆢八丫仌从𠓜
  365.110 ; 人 ; 亻儿𤯔𠔽𠂊𠂋!入勹𡰣尸几八卜匕𠤎刀𠆢𠂉冖饣𠚤从仌㐺众𠈌
  404.410 ; 儿 ; 人!八入几兀丌兒
  405.210 ; 兌 ; 兑兊𠫞𠫨說悅恱閱銳㙂!兄允充兗党兇兒𠒆𠒋皃𠏮㟋𡷋㕣
  527.420 ; 沇 ; 渷兗兖䆓?𠵷!充吮㳘況涗
  527.421 ; 㕣 ; 沿㳂𡵴!𠮦
In the above example the first line (#) shows the 3-column source file syntax: ⑴ PPP.QMV (Variant Class); ⑵  (Type A); ⑶ B?C!D (Types B,C,D, delimited: “?” marks the start of Type C, and “!” marks the start of Type D). Each CJK Unified Ideograph (CJKUI) in each line of the source file is a member of that Variant Class, and receives a kDYC property value with a prefix (“TO-”) indicating its Type (A,B,C,D) and the Offset within that type. Processing the source file outputs kDYC values like the following (subset of the CJKUI in the above example; the source file is reproducible from the full set of kDYC values):
 㕣 U+3562  kDYC A0-062.250 A0-527.421 C0-570.410 D0-049.310 D0-570.220
                D1-058.230 D1-556.120 D2-059.460 D2-222.430 D5-057.150
                DD-405.210
 兊 U+514A  kDYC B1-405.210 D2-405.130 DS-738.310
 兌 U+514C  kDYC A0-405.210 D0-384.120 D0-472.340 D2-405.220 D2-405.310
 兑 U+5151  kDYC B0-405.210 D3-405.310
 悅 U+6085  kDYC B2-093.330 B5-405.210 C0-151.310
 稅 U+7A05  kDYC A0-326.420 D0-093.330
 說 U+8AAA  kDYC A0-093.330 B4-405.210 D0-100.122
 説 U+8AAC  kDYC B0-093.330
 銳 U+92B3  kDYC A0-707.380 B2-710.350 B8-405.210 C5-039.350 D0-263.340
 鋭 U+92ED  kDYC B0-707.380
 𠫞 U+20ADE kDYC B2-405.210 C3-707.380
 𠫨 U+20AE8 kDYC B3-405.210 C2-707.380 D1-583.240
 𢛹 U+226F9 kDYC C0-093.330
 𧧘 U+279D8 kDYC B1-093.330 D1-100.360
  • The reference “A0-093.330” for U+8AAA (說) identifies that CJKUI as the best available match (Type=A) out of the various Hànzì associated with the third Main Seal form in quadrant 3 on page 93 (DYC); the Variant Class is “093.330” (PPP.QMV).
  • The reference “B0-093.330” for U+8AAC (説) identifies that character as the next best match (Type=B) in the same Variant Class.
  • The Type difference here (A vs. B) reflects the fact that the “ears-in” (八) component of U+514C (兌) more closely matches the Seal form than does the “ears-out” (丷) component of U+5151 (兑); see below.
  • A query with either U+8AAA (說) or U+8AAC (説) finds the same DYC page location (the same Variant Class). A query with the kSimplifiedVariant form U+8BF4 (说) would require folding to the corresponding kTraditionalVariant code point (see below).
  • The Offset “0” (zero) after the Type (A,B) in the reference prefixes (“A0-” and “B0-”) indicates that this is the first reference of that Type within this Variant Class; subsequent references have higher offsets (0-9A-Za-z). The higher Offset indicates more distant/indirect (less clear, more uncertain) relation to the head entry (or to a character of that Type with lower Offset in the Variant Class). Other characters in the 說/説 Variant Class (093.330) have higher Type Offsets (and in some cases are also assigned to other Variant Classes, and other Types).
  • The reference “A0-405.210” for U+514C (兌) is of Type=A, vs. “B0-405.210” for U+5151 (兑) of Type=B; other CJKUI variants in this Variant Class (405.210) include old/rare related forms such as U+514A (兊), U+20ADE (𠫞), and U+20AE8 (𠫨). A query with any Variant Class member finds the Variant Class. The specific associated dictionary sources provide explanations of the historical semantic relations.
  • Each Variant Class has exactly one Type=A property value (though multple values are allowed by the syntax). In the few cases where there is no “A0-” value, a value such as “A1-” (with offset “1” or higher) indicates the possible need for future encoding of a new CJKUI, or the need to update this data for newly encoded characters (on the basis of newly available dictionary mappings). Assignment to and Offset within Types (B,C,D) can be rather free; Type D in particular sometimes includes cross-references (as for bound morphemes), or component usage examples (doubling, tripling, quadrupling, etc.).
  • Note that U+3562 (㕣) is assigned to “A0-” in two different Variant Classes (it is a Main form in its own right, and also a variant of the Main form 沇); for simplicity its Type D variants are given in full only at the first instance; likewise, 鋭 is not included with 銳 in “405.210” (but both are included in “707.380”); 税 (simple form) is omitted from the 稅 Variant Class (but is a derivable member of that Class, see below).
The character identifications and Variant Class assignments reflect traditional associations between (Sòng-style) CJKUI and Shuōwén (SW) Seal forms, with primary reference to DYC and the Chinese dictionaries kHanYu (often citing DYC) and kSBGY (often citing SW), supplemented in some cases by other dictionaries such as kKangXi (China), and kMorohashi (Japan). Where explicit dictionary mappings have not been available in published Unihan property data (or the print sources have been unavailable), the relations sometimes derive from other sources (such as ROC《異體字字典》, 《中文大辭典》), or are simply inferred from the character form.

The Variant Class assignments in this data are useful for queries on a wide variety of old and cross-locale texts. For example, U+8AAC (説) serves as the traditional form in both PRC (kGB1) and Japan (kJis0), but U+8AAA (說) is the kTraditionalVariant form in use in ROC (kBigFive); see below. Users in different locales will be able to use this data to find the appropriate DYC dictionary location, and to explore related characters. The Variant Class mappings in this data are useful to supplement and improve kZVariant data, and for input-method editor (IME) and spell-check development. Queries with characters of one Type are often useful for locating characters of another Type; a user not knowing how to find the exact character in question may nevertheless find it by searching for a familiar, similar, related, or incorrect form. The properties of a common form provide access to uncommon forms. Users inputting a given text may not be aware that close/distinct encoded variants exist, and two users inputting that text might unwittingly or accidentally input different strings, reconcilable with this data. There is a great variety of historical and locale-specific variation in CJK texts, sometimes non-distinctive for common or cross-locale purposes. The many encoded CJKUI variants reflect the development of the character set over thousands of years, with forms which originated for one purpose later being replaced by other forms, or reused or augmented for other purposes. This data is useful for digitizing old texts, for exploring the historical development of the Hàn writing system, and for studying this influential Qīng Dynasty commentary edition (DYC) of the Eastern Hàn Dynasty dictionary Shuōwén (121 CE).

This data directly maps ~40,000 CJKUI to exactly 10,706 DYC head entries associating Sòng-style Hànzì (宋體漢字) with the equivalent Seal forms (篆文); there are 10,706 unique references (Variant Classes) in the form “PPP.QMV”. • Most mappings are many-to-one: multiple CJKUI map to each Seal form (multiple Hànzì belong to the same Variant Class); each CJKUI can occur in multiple Variant Classes (with different Types; there are ~61,800 references total, for the various Types). • The Source Separation Rule (R1, obsolete since 1992; see Section 18.1, Han, in [Unicode]) accounts for the separate encoding of many characters which otherwise might have been unified (such as 兑/兌, 说/説/說). • Mappings for simplified-only characters (CJKUI used only in PRC simple-form texts, and not occurring in traditional texts) are for the most part excluded from this data (for example, 说), since such mappings are derivable from other Unihan properties (kSimplifiedVariant, kTraditionalVariant). But where PRC traditional forms (as defined in kHanYu and 《汉语大词典》) differ from the ROC forms (kBigFive), both forms are included (for example, 兑/兌). Mappings for all kCompatibility characters are also derivable and so excluded. • This data reflects some three decades of development (as of January, 2020), with refinement and extension for on-going CJKUI encoding Extensions; for further details see Cook(2003).

-- Sources --
  • 《說文解字‧注》 Shuō Wén Jiě Zì – Zhù 〔東漢〕許慎著〔清〕段玉裁注; 上海 (瑞金二路 272 號): 上海古籍出版社, 1981. [ 1988, 1989, 1998 (9th printing); ISBN: 7-5325-0487 5/H.6; corrected/pointed/indexed reproduction of the original text (經韻樓藏版, 1813-1815), in one volume (752 pages, + appendices); this edition inserts an equivalent Sòng-style character in the upper margin of each page quadrant, above the corresponding Seal form; these forms also appear in the Song-style body text and appended Radical-Stroke chart; for a version of the unpointed original text in 15 volumes (~3332 pages), see <http://www.wul.waseda.ac.jp/kotenseki/html/ho04/ho04_00026_0001/index.html> ]
  • 《說文解字‧電子版》 Shuō Wén Jiě Zì – Diànzǐ Bǎn: Digital Recension of the Eastern Hàn Chinese Grammaticon; Cook, Richard S.; UC Berkeley, Dept. of Linguistics, 2003. [ 2009; STEDT Monograph #9, in 4 vols.; ISBN 0-944613-48-9; <http://linguistics.berkeley.edu/~rscook/html/writing.html#EHC> ]



























Valid HTML 4.01 Transitional