Unicode Frequently Asked Questions

Basic Questions

Q: What is Unicode?

Unicode is the universal character encoding, maintained by the Unicode Consortium. This encoding standard provides the basis for processing, storage and interchange of text data in any language in all modern software and information technology protocols. See "What is Unicode?" for a short explanation of what Unicode is all about.

Q: What is the scope of Unicode?

Unicode covers all the characters for all the writing systems of the world, modern and ancient. It also includes technical symbols, punctuations, and many other characters used in writing text. The Unicode Standard is intended to support the needs of all types of users, whether in business or academia, using mainstream or minority scripts.

Q: How many characters are in Unicode?

The short answer is that as of Version 15.0, the Unicode Standard contains 149,186 characters. The long answer is rather more complicated, because of all the different kinds of characters that people might be interested in counting. To dive into this question in detail, see Unicode Statistics.

Q: Does Unicode encode scripts or languages?

The Unicode Standard encodes scripts used for writing languages, rather than languages. It encodes characters on a script basis. For example, there is only one set of Latin characters defined, despite the fact that the Latin script is used for the alphabets of

The same principle applies for any other script (Cyrillic, Arabic, Ethiopic, Devanagari, ...), which is used for writing many different languages. However, the Unicode Standard does not encode scripts as such.

For a listing of scripts and their names, see Supported Scripts. For the ISO standard for script codes, see ISO/IEC 15924, Code for the Representation of Names of Scripts. For the ISO standard for language codes, see ISO 639, Code for the Representation of Names of Languages.

Q: How many languages are covered by Unicode?

There is no simple answer because many scripts (especially the Latin script) are used to write a large number of languages. Many languages are written in multiple scripts.

The simplest answer is that Unicode covers all of the languages that can be written in the following widely-used scripts: Latin, Greek, Cyrillic, Armenian, Hebrew, Arabic, Syriac, Thaana, Devanagari, Bengali, Gurmukhi, Oriya, Tamil, Telugu, Kannada, Malayalam, Sinhala, Thai, Lao, Tibetan, Myanmar, Georgian, Hangul, Ethiopic, Cherokee, Canadian Aboriginal Syllabics, Khmer, Mongolian, Han (Japanese, Chinese, Korean ideographs), Hiragana, Katakana, and Yi.

In addition to these, Unicode encodes scripts that may be used to write many less-widely used languages, or that are used as an additional way to write a particular language.

Unicode also includes many historic scripts used to write long-dead languages. See Supported Scripts for the full list. See also the list of Languages and Scripts.

Q: Why does Unicode unify Chinese, Japanese, and Korean ideographs, but not unify the Latin, Greek, and Cyrillic alphabets?

The basic answer is that the ideographs are considered a common script, while the alphabets, despite a common historical origin now function as separate scripts. For a detailed answer, see https://www.unicode.org/notes/tn26/.

Q: What's the connection between Unicode and the International Standard, ISO/IEC 10646?

Both 10646 and Unicode specify the same character encoding: they contain the same characters at the same locations. They remain fully synchronized even as they continue to be extended to cover additional characters. See the Unicode and ISO 10646 FAQ and Appendix C of the Unicode Standard for a more extensive explanation of their relationship.

Q: My company might want to get involved in Unicode. How can I present the case to management?

See Why Join and How to Join for a white paper outlining the overall value proposition of a Unicode membership to an organization and information on joining.

Q: Where can I purchase the Unicode software or the Unicode font?

The Unicode Standard is not a software program, nor is it a font. It is a character encoding system, like ASCII, designed to help developers who want to create software applications that work in any language in the world.

If all you need is to create a multilingual text or write a document or send e-mail in another language, then a Unicode-compliant text editor, mail program, or word processing package will do the job. Please see the following pages on our web site for further information about the standard and where to look for help:

If you are a developer starting to learn about using Unicode, you should read the latest version of the Unicode Standard to find out more about Unicode. In addition to the pages listed above, please see:

Q: My computer cannot display some of the latest Unicode symbols I need. How can I display and type the latest Unicode characters?

The reason you don't see the characters as expected is most likely because you need to install a font that covers the set of Unicode characters you are trying to see. Other possible reasons might be that:

If you need to install a font to resolve the problem, free fonts can be downloaded for many Unicode ranges. See Font Resources, or search in your browser for the name of the font you need. Fonts typically cover only one script, or sometimes a range of scripts. Often fonts haven't been updated to render the most recent additions to the Unicode character set. See also Display Problems?

Q: I tried downloading and extracting the latest Unicode data files from the Unicode web site, but it has no effect on the characters my computer can display or type.

The Unicode data files do not function like a software patch, and cannot automatically update existing fonts or applications, so downloading the files will not help in displaying and typing the Unicode characters needed.

Q: I can't find my character in Unicode. Where do I look?

Look at "Where is my Character?"

Q: Where do I find information on the use of characters for a given writing system or script?

The block introductions found in Chapters 7 through 22 of the Unicode Standard are a good place to start. Another place to look is the comments contained in the names lists, which accompanies the code charts, although the comments are not intended to be encyclopedic. The data files in the Unicode Character Database provide information, often in machine-readable form, on character properties, line breaking, word breaking, and so on.

Q: Are script descriptions in the block introductions complete?

No. They cover the information necessary to define the encoded characters, but issues such as usage conventions, layout behavior and glyph design are usually covered only as far as needed to help establish the identify of an encoded character.

Q: Where do I go to find more information about characters for a given script?

Consult the bibliography in the References on the Unicode website. Also check the original proposals to encode the scripts. Those are the documents in which the characters were proposed for encoding. While the proposals are not authoritative and do not have any formal status, they were used in the process of committee deliberation. They often contain useful information, including examples or lists of references.

Q: Where do I find script proposals for a specific script? 

Most proposals are available in the UTC Document Registry. You can also search for specific topics on the Unicode website to find proposals. Many proposals are also available on the JTC 1/SC2/WG2 document register. Individually maintained websites may also include links to particular script proposals.

Q: Where can I find resources to help me with Unicode?

Here's a short table that suggests links to information that can answer typical questions.

Question

Reference

  • What is in each particular version of Unicode?

  • What is in the latest version of Unicode?

Versions of the Unicode Standard

Enumerated Versions

  • What is the meaning of a special term?

Unicode Glossary or Terminology for translations of terms

  • Where can I find code libraries, commercial or open-source, for the following?

    • character conversion

    • collation

    • date, time, number, and message formatting

    • normalization

    • and the other features mentioned under "What level of support should I look for?"

See International Components for Unicode (ICU)

  • What should regular expressions do with Unicode?

  • Can I transmit Unicode text on EBCDIC systems?

  • How should a word-processor break lines in Unicode text?

  • Are there ways to normalize Unicode text?

  • For the Far East, how do I decide which characters should use wide glyphs and which ones narrow?

  • How should I sort Unicode text?

  • Is there an update to the BIDI algorithm?

  • How can I compress Unicode text?

Unicode Technical Reports, also

Specifications FAQ

  • I want to get online data for implementing Unicode. Where can I find data for:

    • Character properties?

    • Upper/lower/titlecasing?

    • Decompositions?

    • Normalization?

    • Conversion to other character encodings?

    • Code for Kanji code conversion with compressed tables?

Online Data

  • Are there conferences or seminars where we can find out more about Unicode?

Unicode Conferences

  • Who are the current members of the Consortium?

  • I am interested in joining the Consortium. Where can I find out more?

Membership Information

Our Members

Q: What does Unicode conformance require?

Chapter 3, Conformance discusses this in detail. Here's a very informal version:

Q: Can applications simply use unassigned characters as they wish?

Conformant Unicode implementations must not assign meaning to any reserved, un-encoded values outside of the private use area (but they may store and transport them as part of the text stream).

Only the values in the private use areas (U+E000..U+F8FF, U+F0000..U+FFFFD, and U+100000..U+10FFFD) are legal for private assignment. With over 137,000 code points the set of private use characters should be more than ample for the vast majority of implementations.

Q: Are surrogate code points the same as supplementary characters?

The two sound similar and are easily confused; they are not the same even though the concepts are related.

Surrogate code points (in the range U+D800..U+DFFF) are reserved for use in pairs to represent supplementary code points in UTF-16. When not used in this way, and when not using UTF-16, surrogate code points should not be used. There are not and will never be surrogate characters (that is, encoded characters represented with a single surrogate code point).

Supplementary characters are those assigned code points in the range U+10000..U+10FFFF, which is different from the range for surrogate code points. Whether supplementary characters are represented by a single code unit (using UTF-32), a surrogate code point pair, (as in UTF-16) or four code units (bytes) (as in UTF-8) these code point are all available to be assigned (or have been assigned) to single characters in the ordinary way.

They are called supplementary characters for historical reasons.

Q: What can I do if I think there is an error in the Unicode Standard or other specification?

Request a correction, clarification or change to the relevant specification by submitting feedback, a formal proposal, or a bug report to the corresponding technical committee (UTC or CLDR-TC or ICU-TC). See Public Review Issues for an explanation of how to do this. (The methods are different for the three committees and the type of change requested.)

Q: Can I visit Unicode?

The Unicode Consortium is a global virtual organization, with people contributing from around the world. So there is no central physical location that would be appropriate for an in-person visit.