Unicode Character Database FAQ

Q: What is the Unicode Character Database?

It is a set of data files defining character properties and other information about Unicode characters. It is commonly known by the acronym "UCD".

Q: Where can I find the Unicode Character Database?

The latest version of the UCD is always found online at: https://www.unicode.org/Public/UCD/latest/.

This location also includes the large collection of data specifically for Unified ideographs, called the "Unihan" Database, which forms a separate subset of the Unicode character properties. It's structure and contents are significantly different so that it isn't generally included when talking about the "UCD".

Q: Where can I find general information about the Unicode Character Database?

See About the Unicode Character Database. (The Unihan database is documented in UAX #38: The Unicode Han Database (Unihan).)

Q: Where do I find detailed information about the data file structure and character properties?

Unicode Standard Annex #44, Unicode Character Database, provides the detailed documentation about the UCD, including file formats, all information about specialized files, including test data files, and information about each character property defined in the UCD.

Q: Why are some properties duplicated in “Derived...” property files?

A derived property is one that is normatively specified elsewhere, whether explicitly or implicitly, but is either extracted or derived from a combination of properties and presented for convenience, usually in its own file. Some of the original data files, like UnicodeData.txt, LineBreaking.txt, and so on use specialized formats different from the other files. For properties in these files, the derived files provide them again in a format that doesn't require special parsing. For other properties, like DerivedAge.txt, the underlying property is not explicitly listed, but impliclity defined. Unicode Standard Annex #44, Unicode Character Database, provides additional documentation. [AF]

Q: Is the UCD available in a format that can be parsed with standard tools?

Some of the original file formats in the UCD are pretty arcane and hard to parse. Starting with Unicode 5.1, the entire UCD is also available in XML format. The XML version is available in the versioned directory for each release. The latest version of the XML files can always be found at: https://www.unicode.org/Public/UCD/latest/ucdxml/.

Q: Where is the documentation for the XML data representation?

Start with the readme.txt, which explains what each of the zipped XML data files contains: https://www.unicode.org/Public/UCD/latest/ucdxml/.
The detailed specification of the attributes and other conventions used in the XML can be found in Unicode Standard Annex #42, Unicode Character Database in XML.

Q: Why don't you provide an XML Schema for the XML representation of the UCD?

We found that the development of a Relax NG schema (an ISO standard, by the way) is considerably simpler than the development of a W3C XML Schema. Furthermore, there are tools to convert from Relax NG to XML Schema (for example, trang, available from thaiopensource.com), should the need arise. It is also worth noting that an XML Schema is not required for the proper interpretation of the data for the Unicode Character Database, because there are no default values provided. [EM]

Q: Are there other FAQs which deal with Unicode character properties?

Yes, questions about particular character properties might be answered at Character Properties, Case Mappings & Names.

Q: Where can I find information about Han ideographs?

The UCD data files contain only selected properties related to Han ideographs. Additional properties are found in the Unihan Database. The data files for the Unihan Database and the UCD are located together. [AF]

Q: Is the UCD the sole source of property data?

Specifications such as those on emoji, security, IDNA, or Unicode in mathematics define their own properties beyond those documented in the UCD.

Q: What about older versions of the Unicode Character Database?

All older versions of the UCD, which are formally a part of earlier releases of the Unicode Standard, are permanently archived on the Unicode web site. They can be found by following the links to component listings for specific versions at: https://www.unicode.org/versions/enumeratedversions.html.