Counting Codepoints

Richard Wordingham richard.wordingham at ntlworld.com
Tue Oct 13 14:04:49 CDT 2015


On Tue, 13 Oct 2015 12:17:43 +0200
Philippe Verdy <verdy_p at wanadoo.fr> wrote:

> 2015-10-13 8:36 GMT+02:00 Richard Wordingham <
> richard.wordingham at ntlworld.com>:
 
> > For
> > example, a MSKLC keyboard will deliver a supplementary character in
> > two WM_CHAR messages, one for the high surrogate and one for the low
> > surrogate.

> I have not tested the actual behavior in 64-bit versions of Windows :
> is the message field of the WM_CHAR  returned by the 64-bit version
> of the API still requires returning two messages and not a single one
> if that field has been extended to 64-bit ?

In Unicode applications, WM_CHAR still delivers one UTF-16
codepoint.   I suspect if delivers just one byte in multibyte 'ANSI'
encodings.  There is a WM_UNICHAR message that delivers whole Unicode
characters, but reportedly Microsoft does not use it.

> The actual behavior is also tricky as the basic layouts built with
> MSKLC will have its character data translated "transparently" to
> other "OEM" encodings according to the current input code page of the
> console (using one of the codepage mapping tables installed
> separately): the transcoder will also need to translate the 16-bit
> Unicode input from WM_CHAR messages into the 8-bit input stream used
> by the console, and this translation will need to read both
> surrogates at once before sending any output.

This only applies to 'ANSI' applications.  I am not aware of any ANSI
codepages that contain supplementary characters.  For a Unicode
application, no translation from Unicode occurs.

Richard.


More information about the Unicode mailing list