Unicode education in Schools

Eli Zaretskii via Unicode unicode at unicode.org
Sat Aug 26 13:20:45 CDT 2017


> Date: Sat, 26 Aug 2017 18:52:03 +0100
> From: Richard Wordingham via Unicode <unicode at unicode.org>
> 
> > > It shouldn't.  UTF-16 works just like UTF-8, except that the code
> > > units are bigger.  
> 
> > Not really, since UTF-8 doesn't have surrogates.
> 
> It has 115 surrogates, thoroughly oppressed by the UTC - there are 64
> trailing surrogates 0x80 to 0xBF, 51 leading surrogates 0xC2 to 0xF4 ,
> and 0xC0, 0xC1 and 0xF5 to 0xFF suffer the indignity of being the 13
> uncodepoints - not even allowed in Unicode 8-bit strings. Emacs is one
> of the few systems that comes close to allowing them the dignity of
> integer values of their own - 3FFF80₁₆ to 3FFFFF₁₆ for the code units
> 0x80 to 0xFF.
> 
> I well remembered when Unicode regular expressions were required to
> allow one to search for lone surrogates, but there was no such concept
> of looking for isolated ill-associated bytes in Unicode 8-bit strings.
> 
> The point is that if one understands how UTF-8 works, UTF-16 is a
> system that works using a subset of the same principles, and one should
> therefore understand how UTF-16 works, until one comes to the weird and
> dubious concept of surrogate points having properties.  I believe the
> latter concept is of value only in code that lacks the concept of
> gibberish.  In UTF-8, the distinction between code unit value and
> Unicode scalar value is very clear; in UTF-16, it is muddied by the
> concept of 'codepoint'.

We are miscommunicating.  My point was that programming for MS-Windows
needs a good understanding of what the UTF-16 surrogates are, and in
what MS-Windows APIs/library functions they can and cannot be used.
Without this understanding, one cannot figure out why the likes of
iwspace and iswupper only support the BMP, and what APIs to use to
lift this limitation.  Likewise with display-related APIs, used to
display Unicode text.

If you don't teach UTF-16 including these details, the programmers
will feel lost when they meet with these complications.


More information about the Unicode mailing list