Blocks and Ranges
Q:What are Unicode blocks?
Blocks in the Unicode Standard are named ranges of code points. They are used to help organize the standard into groupings of related kinds of characters, for convenience in reference. And they are used by a charting program to define the ranges of characters printed out together for the code charts seen in the book or posted online.
Q:Is there a reason that so many blocks look like other 7-bit or 8-bit code pages?
The character blocks in the Unicode Standard are merely a convenience for exposition of the standard.While it might seem that way at first glance, the Unicode Standard is not an assemblage of 8-bit code pages for different languages, but a single, universal character encoding. All characters are equally accessible, and the blocks have no implementation expression in most Unicode software. The fact that some character blocks, and in particular, the Indic script character blocks, bear a superficial resemblance, in ordering and size, to other standards such as ISCII or the ISO/IEC 8859 series, is primarily to assist people in interpreting the repertoire visually in comparison to legacy encodings, and to make it simpler to develop conversion tables for older character encodings.
Q: Where can I find the definitive list of Unicode blocks?
The Unicode blocks and their names are a normative part of the Unicode Standard. The exact list is always maintained in one of the files of the Unicode Character Database, Blocks.txt.
Q: Is casing significant for Unicode block names?
No. Block names are commonly represented in Titlecase, but can also appear in all UPPERCASE. Other casing combinations can occur, and case should be ignored when comparing block names.
Q:What is the difference between Unicode ranges and Unicode blocks?
A range simply refers to any sequence of Unicode code points with a starting point and an ending point. It doesn't have to be the same as the specific ranges for the Unicode blocks. A range can overlap block boundaries, and a range in general doesn't have any name.
Q: How are Unicode ranges expressed?
By using the U+ form for the starting and ending code points, connected with dots. So, for example: U+0100..U+03FF. Sometimes a dash or a long dash is substituted for the two dots, and the "U+" can be omitted if it is clear you are talking about Unicode code points specifically.
Q: Are there any restrictions on the ranges used for Unicode blocks?
Every Unicode block starts with a code point of the form nnn0
or nnnn0
and ends with a code point of the form nnnF
or nnnnF
.
That is another way of saying that every block consists of some number of complete columns of characters, when seen printed out
in charts. And the number of code points in every block is divisible by 16. Also, the ranges for the Unicode blocks do not
extend across planes in the standard. The reasons for these restrictions have mostly to do with convenience for printing out the charts, but they also provide some minor benefits for implementations when constructing tables.
Q: Can blocks overlap?
No. Every Unicode block is discrete, and cannot overlap with any other block. Also, every assigned character in the Unicode Standard has to be in a block (and only one block, of course). This ensures that when code charts are printed, no characters are omitted simply because they aren't in a block.
Q: Can block boundaries change?
Yes, however, blocks of encoded characters can only grow, not shrink, although the Roadmap Committee can adjust the boundaries for blocks of characters not yet encoded, as needed When a block gets full, an "Extended-n" block is often created except in rare cases where it is possible to extend the boundaries of the original block.
Q: Are there any restrictions on what characters can be encoded in a Unicode block?
There are no absolute rules involved, but in general the encoding committees are careful to try to encode related characters together when they can, given the constraints on what has already been encoded. So any additional Devanagari letters would be encoded in the existing Devanagari block, if possible, and additional punctuation in one of the existing punctuation blocks, and so on.
Q: Do Unicode blocks have defined character properties?
No. The character properties are associated with encoded characters themselves, rather than the blocks they are encoded in.
Q: Do Unicode ranges ever have defined character properties?
Yes, there are a few special cases where specific ranges of code points are defined to have default property values. The most important of these cases is for the Bidi_Class property, where certain ranges of code points, including unassigned code points, are specified to be right-to-left. This is done to enable stability for implementations of the Bidirectional Algorithm, as characters are added over time to the standard. There are other instances of special ranges with predefined character properties. For details, see the documentation for the Unicode Character Database.
Q:Do blocks ever contain characters of different script properties?
Yes. For example, the Thai block contains Thai characters that have the Thai script property, but it also contains the character for the baht currency sign, which is used in Thai text, of course, but which is defined to have the Common script property. To find the script property value for any character you need to rely on the Unicode Character Database data file, Scripts.txt, rather than the block value alone. For another example, the Greek and Coptic block contains mostly characters of the Greek script, but also a few historic characters of the Coptic script.
Q:Are all characters of the same script kept together in a single block?
Many scripts have a main block and one or more extension in different blocks. In some cases, such as Latin, the encoded characters are spread across as many as a dozen different Unicode blocks.The Han ideographs are also spread across several blocks. Such cases are simply the result of the history of the standard. In other instances, a single block may contain characters of more than one script.
Q: Are Unicode blocks predefined, even before characters are encoded for them?
Formally, no. However, the Unicode Consortium and SC2/WG2 jointly maintain a Roadmap that contains both existing blocks and tentative allocations of blocks for future encoding. The tentative allocations help in the planning for encoding and provide a convenient place for linking to proposal documents. However, they are not part of the standard itself, and such tentative block allocations can be and frequently are moved around during the process of proposal review and approval. For details, see the Roadmaps.
Q: Are Unicode blocks important for implementations of Unicode?
Usually they are not. What matters for implementations of Unicode are the properties for characters. Those are obtained from other data files in the Unicode Character Database, and don't depend on blocks, per se. In particular, since block identity is not exactly correlated with script identity, it is much better to rely on Scripts.txt when implementing an operation that depends on script identity for a character.
Blocks are sometimes convenient for display of characters, as for a character picker application. But even when expressing such thing as the supported repertoire for an application, it is generally better to express that in terms of explicit ranges of assigned characters, rather than just in terms of blocks.
Q: Can Unicode blocks be used in defining sets for regular expressions?
Yes, but only with some care, as they may lead to surprises—particularly in not matching characters that users may expect them to. For further discussion, see cautions about use of blocks in regular expressions.
Q:Why do block ranges and names appear differently in different places in the Unicode Standard? Are these errors I should report?
There are several reasons for such discrepancies, and in most instances they are intentional distinctions. They are not errors to report. The names and ranges of blocks are occasionally modified editorially in the text of the Unicode Standard. Block names are sometimes shortened a little in book headers, so they fit on a line and don't cause problems in the table of contents or index. Sometimes when discussing characters in a single script where two adjacent blocks contain those characters, a header may be listed coalescing the range under discussion, or a header may list one name and two discrete ranges. Such changes are simply to help in the presentation of material about the standard, and in no way are intended to modify the normative block definitions. In all cases the normative block ranges and names are those specified in Blocks.txt.
Q: Why do code chart headers in the Unicode names list sometimes differ from block headers?
The Unicode names list file, which can be found in the Unicode Character Database, is the data file which is used to drive the charting program for the Unicode code charts. It uses some special markup conventions explained in the documentation of the names list file. In particular, the header entries in the names list file occasionally depart from normative block ranges because of constraints on how the charting program works and also to prevent the printing of unnecessary blank columns or pages in the charts. The label used in a header entry may also differ from a block name, adding annotations that are helpful for reading the charts. For example, here is the normative block definition for the Latin-1 Supplement:
- 0080..00FF; Latin-1 Supplement
But the names list file uses a header entry:
- @@ 0080 C1 Controls and Latin-1 Supplement (Latin-1 Supplement) 00FF
The range used is the same, but the header entry adds "C1 Controls and" for clarity when printing the Unicode code charts. The parenthetical string is used instead by the charting program when printing code charts for ISO/IEC 10646.
In another example, the normative block definition for CJK Unified Ideographs is:
- 4E00..9FFF; CJK Unified Ideographs
But the names list file uses a header entry:
- @@ 4E00 CJK Unified Ideographs 9FD5
The charting program uses the "9FD5" value to know where the last assigned character to print is, since CJK Unified Ideographs are not explicitly listed in UnicodeData.txt. And the charting program uses this information to optimize page breaks and prevent printing of empty columns.
Finally, the published code charts depart from both Blocks.txt and NamesList.txt in some instances. For example, there are two normative high surrogate blocks:
- D800..DB7F; High Surrogates
- DB80..DBFF; High Private Use Surrogates
For these blocks, the code charts show only a header page explaining the High Surrogate Area, Range: D800-DBFF, and Low Surrogate Area, Range: DC00-DFFF, respectively. As these are not assignable for characters for these ranges, no code chart is shown.
Q: Do Unicode blocks exactly match the blocks defined in ISO/IEC 10646?
For the most part they do, but there are several principled exceptions.
First, the Unicode blocks for Basic Latin and the Latin-1 Supplement are extended to incorporate the control characters, since the Unicode Standard prints out all the code points for the control characters, as well as the graphic characters.
- Unicode: 0000..007F; Basic Latin
- 10646: 0020-007E BASIC LATIN
- Unicode: 0080..00FF; Latin-1 Supplement
- 10646: 00A0-00FF LATIN-1 SUPPLEMENT
There is a similar distinction for the special cases of the Byte Order Mark at U+FEFF and the two noncharacters at the very end of the BMP.
- Unicode: FE70..FEFF Arabic Presentation Forms-B
- 10646: FE70-FEFE ARABIC PRESENTATION FORMS-B
- Unicode: FFF0..FFFF Specials
- 10646: FFF0-FFFD SPECIALS
Second, for Hangul syllables, 10646 defines a block that ends at the last encoded Hangul syllable, but the Unicode rules for block definitions require ending a block at an even 16-character boundary:
- Unicode: AC00..D7AF; Hangul Syllables
- 10646: AC00-D7A3 HANGUL SYLLABLES
Third, the Unicode Standard defines blocks for sub-ranges of surrogate code points; those have no blocks defined in 10646. Also, the Unicode Standard defines blocks for the supplementary private use areas on planes 15 and 16, while no blocks are defined for those in 10646.