Unicode education in Schools

Richard Wordingham via Unicode unicode at unicode.org
Sat Aug 26 12:52:03 CDT 2017

On Sat, 26 Aug 2017 18:55:25 +0300
Eli Zaretskii via Unicode <unicode at unicode.org> wrote:

> > Date: Sat, 26 Aug 2017 16:09:33 +0100
> > From: Richard Wordingham via Unicode <unicode at unicode.org>

> > It shouldn't.  UTF-16 works just like UTF-8, except that the code
> > units are bigger.  

> Not really, since UTF-8 doesn't have surrogates.

It has 115 surrogates, thoroughly oppressed by the UTC - there are 64
trailing surrogates 0x80 to 0xBF, 51 leading surrogates 0xC2 to 0xF4 ,
and 0xC0, 0xC1 and 0xF5 to 0xFF suffer the indignity of being the 13
uncodepoints - not even allowed in Unicode 8-bit strings. Emacs is one
of the few systems that comes close to allowing them the dignity of
integer values of their own - 3FFF80₁₆ to 3FFFFF₁₆ for the code units
0x80 to 0xFF.

I well remembered when Unicode regular expressions were required to
allow one to search for lone surrogates, but there was no such concept
of looking for isolated ill-associated bytes in Unicode 8-bit strings.

The point is that if one understands how UTF-8 works, UTF-16 is a
system that works using a subset of the same principles, and one should
therefore understand how UTF-16 works, until one comes to the weird and
dubious concept of surrogate points having properties.  I believe the
latter concept is of value only in code that lacks the concept of
gibberish.  In UTF-8, the distinction between code unit value and
Unicode scalar value is very clear; in UTF-16, it is muddied by the
concept of 'codepoint'.


More information about the Unicode mailing list