Version | Unicode 14.0.0 (Proposed) |
Source | Richard Cook 曲理查 |
Date | 2020-01-28 |
Property | kDYC |
Status | Provisional |
Category | Dictionary Indices |
Introduced | 14.0 |
Delimiter | space |
Syntax | [A-D][0-9A-Za-z]-[0-9]{3}\.[1-4][1-9AB][0-5] |
Description | The page position (typologized variant class) of this character, in the framework of the Chinese dictionary Shuō Wén Jiě Zì – Zhù by Duàn Yùcái (DYC).
The character references have the form “TO-PPP.QMV”, in which: “T” = Type of match (A-D); “O” = Offset within that Type (0-9A-Za-z); “PPP.QMV” identifies the Variant Class, in which: “PPP” = Page reference, 3-digit zero-padded (001..752); “Q” = Quadrant on the page (1-4); “M” = offset within Q of the Main Seal form (正文, lexical head entry); “V” > 0 indicates a Variant Seal form (重文, variant of M).
The kDYC property values derive from a source file with 10,706 lines like the following: In the above example the first line (#) shows the 3-column source file syntax: ⑴ PPP.QMV (Variant Class); ⑵ A (Type A); ⑶ B?C!D (Types B,C,D, delimited: “?” marks the start of Type C, and “!” marks the start of Type D). Each CJK Unified Ideograph (CJKUI) in each line of the source file is a member of that Variant Class, and receives a kDYC property value with a prefix (“TO-”) indicating its Type (A,B,C,D) and the Offset within that type. Processing the source file outputs kDYC values like the following (subset of the CJKUI in the above example; the source file is reproducible from the full set of kDYC values):# PPP.QMV ; A ; B?C!D 048.410 ; 八 ; 丷捌?扒趴!儿𠘧人入𠔁 062.250 ; 㕣 ; ?䛇!𠮦𠮥召叴台合公谷𠔌允只兄 062.251 ; 𧮲 ; !䜭容䆟 089.330 ; 言 ; 訁讠𢍬䇾𢍗𧩁?𥩭㖖𠲩𠱫𠲗!占舌𢀛音咅㕻𧥛𧥜𧥿心信𧧙讞𠷗𧧑誩𧨟譶𧭛𧮦 093.330 ; 說 ; 説𧧘悅?𢛹!稅脫哾𠱕𧭚詋𧧗 224.110 ; 入 ; !人久亠宀𠂉𠆢八丫仌从𠓜 365.110 ; 人 ; 亻儿𤯔𠔽𠂊𠂋!入勹𡰣尸几八卜匕𠤎刀𠆢𠂉冖饣𠚤从仌㐺众𠈌 404.410 ; 儿 ; 人!八入几兀丌兒 405.210 ; 兌 ; 兑兊𠫞𠫨說悅恱閱銳㙂!兄允充兗党兇兒𠒆𠒋皃𠏮㟋𡷋㕣 527.420 ; 沇 ; 渷兗兖䆓?𠵷!充吮㳘況涗 527.421 ; 㕣 ; 沿㳂𡵴!𠮦 㕣 U+3562 kDYC A0-062.250 A0-527.421 C0-570.410 D0-049.310 D0-570.220 D1-058.230 D1-556.120 D2-059.460 D2-222.430 D5-057.150 DD-405.210 兊 U+514A kDYC B1-405.210 D2-405.130 DS-738.310 兌 U+514C kDYC A0-405.210 D0-384.120 D0-472.340 D2-405.220 D2-405.310 兑 U+5151 kDYC B0-405.210 D3-405.310 悅 U+6085 kDYC B2-093.330 B5-405.210 C0-151.310 稅 U+7A05 kDYC A0-326.420 D0-093.330 說 U+8AAA kDYC A0-093.330 B4-405.210 D0-100.122 説 U+8AAC kDYC B0-093.330 銳 U+92B3 kDYC A0-707.380 B2-710.350 B8-405.210 C5-039.350 D0-263.340 鋭 U+92ED kDYC B0-707.380 𠫞 U+20ADE kDYC B2-405.210 C3-707.380 𠫨 U+20AE8 kDYC B3-405.210 C2-707.380 D1-583.240 𢛹 U+226F9 kDYC C0-093.330 𧧘 U+279D8 kDYC B1-093.330 D1-100.360
The Variant Class assignments in this data are useful for queries on a wide variety of old and cross-locale texts. For example, U+8AAC (説) serves as the traditional form in both PRC (kGB1) and Japan (kJis0), but U+8AAA (說) is the kTraditionalVariant form in use in ROC (kBigFive); see below. Users in different locales will be able to use this data to find the appropriate DYC dictionary location, and to explore related characters. The Variant Class mappings in this data are useful to supplement and improve kZVariant data, and for input-method editor (IME) and spell-check development. Queries with characters of one Type are often useful for locating characters of another Type; a user not knowing how to find the exact character in question may nevertheless find it by searching for a familiar, similar, related, or incorrect form. The properties of a common form provide access to uncommon forms. Users inputting a given text may not be aware that close/distinct encoded variants exist, and two users inputting that text might unwittingly or accidentally input different strings, reconcilable with this data. There is a great variety of historical and locale-specific variation in CJK texts, sometimes non-distinctive for common or cross-locale purposes. The many encoded CJKUI variants reflect the development of the character set over thousands of years, with forms which originated for one purpose later being replaced by other forms, or reused or augmented for other purposes. This data is useful for digitizing old texts, for exploring the historical development of the Hàn writing system, and for studying this influential Qīng Dynasty commentary edition (DYC) of the Eastern Hàn Dynasty dictionary Shuōwén (121 CE). This data directly maps ~40,000 CJKUI to exactly 10,706 DYC head entries associating Sòng-style Hànzì (宋體漢字) with the equivalent Seal forms (篆文); there are 10,706 unique references (Variant Classes) in the form “PPP.QMV”. • Most mappings are many-to-one: multiple CJKUI map to each Seal form (multiple Hànzì belong to the same Variant Class); each CJKUI can occur in multiple Variant Classes (with different Types; there are ~61,800 references total, for the various Types). • The Source Separation Rule (R1, obsolete since 1992; see Section 18.1, Han, in [Unicode]) accounts for the separate encoding of many characters which otherwise might have been unified (such as 兑/兌, 说/説/說). • Mappings for simplified-only characters (CJKUI used only in PRC simple-form texts, and not occurring in traditional texts) are for the most part excluded from this data (for example, 说), since such mappings are derivable from other Unihan properties (kSimplifiedVariant, kTraditionalVariant). But where PRC traditional forms (as defined in kHanYu and 《汉语大词典》) differ from the ROC forms (kBigFive), both forms are included (for example, 兑/兌). Mappings for all kCompatibility characters are also derivable and so excluded. • This data reflects some three decades of development (as of January, 2020), with refinement and extension for on-going CJKUI encoding Extensions; for further details see Cook(2003). -- Sources --
|