Latin and Cyrillic Scripts
Q: Where can I find the Latin and Cyrillic characters in the Unicode Standard?
The layout of the Latin and Cyrillic scripts in the Unicode Standard is an artifact of the history of Unicode and of ISO/IEC 10646. The Unicode Standard started out with the Basic Latin alphabet and Latin-1 Supplement laid out according to the ISO/IEC 8859-7 standard, and with the Cyrillic alphabet likewise laid out to match legacy standards.
As part of the standards compromise which resulted in the synchronization of the Unicode Standard with the drafts of ISO/IEC 10646, the Unicode Standard acquired a collection of pre-composed Latin characters. Those had to be placed somewhere, and a block was created at U+1E00..U+1EFF to accommodate them.
Since then, many additional blocks have been allocated for both Latin and Cyrillic scripts, some of which are in the first supplementary plane, meaning that each character requires 2 code units in UTF-16. With improved support, modern software should handle such supplementary characters just fine, but legacy tools may have more limited support.
Extensions for additional languages and systems for phonetic transcriptions were added in additional blocks or by filling up space in existing blocks. As a result, neither the Latin nor the Cyrillic characters are in any consistent alphabetical order. To locate all Latin or Cyrillic character blocks use the index for the Unicode Character Code Charts. [AF]
Q: When do I use combining marks?
Both Latin and Cyrillic characters can be used with characters from the Combining Diacritical Marks block. For many, if not most combining sequences widely used for a given language there exist precomposed characters. If data is expected to be in Normalization Form NFD, you would always use the sequences except for cases where only the precomposed character is defined. For example Latin “Ø” O WITH STROKE does not have a decomposition to O plus “ ̸” COMBINING LONG SOLIDUS OVERLAY. The same is true for many other letters with bar or stroke overlays.
If instead data is expected to be in NFC, you would always use the precomposed forms where available and the combining sequences only where there is no equivalent precomposed character. The need for that is relatively rare. Because it matches legacy practice, NFC tends to be used in a wider range of contexts than NFD. [AF]
Q: Why have the Latin and Cyrillic script not been unified?
The Latin and Cyrilic script share a common ancestor (Greek) and are closely related. A number of characters have closely similar appearance, and in many if not most fonts they are identical. In fact, it is possible to “spell” many English words entirely with Cyrillic letters with the reader none-the wiser about the substitution. That raises the question of why these aren’t treated (together with Greek) as a single script, where each of the many languages uses whatever subset they require.
Even though some letters like Latin “B” and Cyrillic “В” (and Greek “Β”) may look the same, their lower case equivalents are “b” and “в” (or “β”) and thus do not look the same. It is a longstanding principle of the Unicode Standard to disunify characters that do not share the same case mappings so that case mappings are unique as much as possible.
Another processing task that depends on ;a separate script identity is collation. Even when sorting according to English rules, terms in Cyrillic are all sorted together, and not adjacent to whatever English term they might look like. [AF]
Q: What should I do about look-alike Latin and Cyrillic letters?
In regular text there shouldn’t be an issue when you mix scripts, whether because of an inserted quote or accidental substitution of a look-alike character from the other script. The reason is that keyboards tend to be specific to the language and limited to the script that is used, and readers only care that the shape displayed makes sense in the context.
The situation is very different in security-relevant contexts, such as network identifiers. In those cases, if the code point is different from what the user assumes from the appearance, a substitution could facilitate spoofing. Mitigation approaches range from disallowing mixed-script identifiers, to flagging unusual script use, or treating any Latin label that looks like a given Cyrillic label as equivalent to the latter. Such equivalence can be used to prevent duplicate registration of look-alike identifiers or to make sure that the same appearance always resolves to the same network resource.
See UTS#39 Unicode Security Mechanisms for data on confusables and strategies to mitigate the issue. [AF]
Q: How should I handle text in which the orthography contains both Latin and Cyrillic letters together? Should I propose the Latin (or Cyrillic) letters that are not in Unicode?
Writing systems can borrow characters from different scripts (compare Japanese). In general, proposing a new Cyrillic character that exactly resembles a Latin character (or adding a Latin letter which is identical to an existing Cyrillic letter) is not advised, unless the letter meets one of the following criteria
- the letter is in common use today, with new content regularly being generated.
- the letter has adapted unique characteristics that are not inherent in the original script (e.g., the letter has distinct casing patterns not found in the original script)
- demonstrable implementation problems are shown to occur when mixing scripts in words.
This is of particular relevance for notational systems that are not in use as an orthography.
Q: How should historical text material from the Russian Empire and the former Soviet Union be represented whose alphabets contain Latin and Cyrillic?
During the nineteenth and early twentieth centuries, several experimental orthographies were devised to represent different languages and dialects of the Russia Empire and former Soviet Union. In many cases, printers were mixing and matching movable type sorts from both Cyrillic and Latin when publishing text. Many of these orthographies found in printed works did not survive into modern times. To represent these historical texts which contain transitional orthographies, specialized fonts should be used, employing existing characters rather than proposing new character additions.
Q: How many languages are written with the Latin or Cyrillic script?
Both the Latin and Cyrillic script are used to write an extensive range of languages, but the Latin script is used far and away for the largest number of languages for any script. Many of these languages may have very small populations, or may no longer be written with the script in question (or are predominantly written in some other script). Even so, a considerable number of the affected languages are in widespread modern use. For example, for the Latin script out of several thousand languages written or transcribed in that script, there are about 200 languages that are written and used extensively enough to suggest the need to support internet identifiers; for the Cyrillic script, that figure is 30 languages out of a total of around 160.
Q: How do I represent text that is typeset in Fraktur?
For general text, Unicode considers Fraktur (black letter) a font style of Latin, which is therefore encoded with regular Latin script characters. The rendering as Fraktur requires selecting an appropriate font. [AF]
Q: Why does Unicode contain a Fraktur alphabet?
In contrast to ordinary text, the use of Fraktur for variables in mathematical expressions is not considered styling, but carries deep semantic distinction (for example marking a variable as a vector). For that purpose, a separate mathematical alphabet has been encoded that contains the basic set of Fraktur letter shapes. (Similar mathematical alphabets exist for double-struck letters, for example). Use of these character codes for ordinary text in Fraktur style is discouraged. [AF]
Q: How do I represent text typeset in Gaelic or Insular style?
Gaelic text (Insular Script) would be encoded using the standard Latin letters, with a suitable font selected. There are some exceptions, such as INSULAR G, for certain letters of very specific appearance. [AF]
Q: How do I represent text typeset in the Old/Early Cyrillic alphabet?
Old or Early Cyrillic, also sometimes called “Old Slavonic Cyrillic”, is a style of Cyrillic used before 1708. It should not be confused with the Glagolitic alphabet, which is separately encoded in Unicode.
Unicode considers Old Cyrillic to be a font style of Cyrillic, encoded using regular Cyrillic script characters with a suitable font selected. Unicode has separately encoded a small number of Old Cyrillic letters with an appearance very different from their modern Cyrillic equivalents such as U+A657 ꙗ CYRILLIC SMALL LETTER IOTIFIED A and U+A64B ꙋ CYRILLIC SMALL LETTER MONOGRAPH UK. [BY]