Compression
Q: I need to compress Unicode data. Is there anything special to consider?
Unicode text is often stored and usually exchanged in UTF-8, which is compact for ASCII-heavy markup (HTML, XML, JSON, etc.). General-purpose compression algorithms often use much less than one byte per character. Whether or not to use a compression algorithm depends on the proportion of text, compared with non-text data, such as images (which may already be compressed or be compressible), and on the cost of the storage and transmission of the data. [MS]
Q: Why not use UTF-8 as compressed format?
UTF-8 is the default encoding for text on the internet. It is reasonably compact, simple, and universally supported; protocols like HTTP also offer additional compression (e.g., gzip). However, all non-ASCII characters are represented in UTF-8 using more than one byte per character, and all CJK characters require at least three. If an application would benefit from compact or compressed text, then UTF-8 is not optimal. [MS]
Q: Does Unicode require a specialized compression format?
General-purpose compression algorithms will readily compress Unicode-encoded text. If they are supported as part of a protocol or operating system, for example to create compressed folders, there's usually no need to consider anything else, as these methods will yield quite reasonable levels of compression of Unicode data. One of their main benefits is that they are general-purpose and thus work for both binary and text data.
For large numbers of very short runs of pure Unicode-encoded text, there is a specialized algorithm called SCSU that might be useful in cases where memory or storage space is very constrained. It may be an appropriate choice in these circumstances. [AF]
Q: What is SCSU?
Unicode has defined a Standard Compression Scheme for Unicode (SCSU). It is a compact encoding that stores most text with one byte per character, or two for CJK.
Q: What are the design points for SCSU?
One of the key design points of SCSU was that it should work with small strings. Starting a new general-purpose compression for each string is probably wasteful. SCSU usually does not need more than one or two bytes overhead, and often 0 bytes to start up.
Furthermore, it was not so much the smallest strings the SCSU designers wanted, but to get most types of Unicode encoded data to be as compact as in the equivalent legacy encoding. For example, most simple scripts require a single byte per character in SCSU, and CJK requires two bytes per character.
Whether or not these capabilities are important to your overall design is a different matter, but as long as they are, SCSU is superior to generic algorithms. [AF]
Q: What about compressing longer texts?
The best way to compress long strings of Unicode-encoded text is via general-purpose compression, which is an option in HTTP and other protocols. Some compression algorithms are sensitive to the input encoding, and using SCSU first may help to minimize the resulting size; other algorithms give near identical results no matter which encoding form was used. For details see Unicode Technical Note #14 “A Survey of Unicode Compression” and “Unicode Compression: Does Size Really Matter?”.
Q: Are there disadvantages to using SCSU?
Unlike some other schemes, strings compressed with SCSU cannot be binary compared for equality of contents. That is because the encoder has the choice of compression strategies, and different encoders may make different choices for the same string. While you could compare strings for equality if they are compressed by the same encoder, the comparison order in case of strings of different contents will not be the same as the binary comparison order for the original strings (in the general case). [AF]
Q: Are there security concerns with SCSU?
Because identical strings can have different compressed representations, filtering of compressed strings for unsecure contents can fail. [AF]
On the web, where encoding declarations are often incorrect, the text encoding is often detected heuristically; encodings like SCSU and the obsolete UTF-7 which use bytes 0x20..0x7E for the encoding of non-ASCII characters have been used to inject malicious code. Therefore, these encodings must not be used in web documents (W3C Choosing & applying a character encoding, HTML5 document character encoding). [MS]