Surrogates and noncharacters
haberg-1 at telia.com
Tue May 12 08:56:04 CDT 2015
> On 12 May 2015, at 15:45, Philippe Verdy <verdy_p at wanadoo.fr> wrote:
> 2015-05-11 23:53 GMT+02:00 Hans Aberg <haberg-1 at telia.com>:
>> It is perfectly fine considering the Unicode code points as abstract integers, with UTF-32 and UTF-8 encodings that translate them into byte sequences in a computer. The code points that conflict with UTF-16 might have been merely declared not in use until UTF-16 has been fallen out of use, replaced by UTF-8 and UTF-32.
> The deprecation of UTF-16 and UTF-32 as encoding *schemes* ("charsets" in MIME) is already very advanced.
UTF-32 is usable for internal use in programs.
That is legacy, which may remain for long. For example, C/C++ trigraphs are only removed now, since long just a bother for compiler implementation. Java is very old, designed around 32-bit programming with limits on function code size, which was a limitation in pre-PPC CPU that went out of use in the early 1990s.
> UTF-8 will also remain for long as the prefered internal encoding for Python, PHP (even if Python introduced also a 16-bit native datatype).
> In all cases, programming languages are not based on any Unicode encoding forms but on more or less opaque streams of code units using datatypes that are not constrained by Unicode (because their "character" or "byte" datatype is also used for binary I/O and for supporting also the conversion of various binary structures, including executable code, and also because even this datatype is not necessarily 8-bit but may be larger and not even an even multiple of 8-bits)
Indeed, that is why UTF-8 was invented for use in Unix-like environments.
More information about the Unicode