Implementing SMP on a UTF-16 OS

Marcel Schneider charupdate at
Thu Aug 13 10:07:50 CDT 2015

[Given the bad news we got, I spared this in my draft folder from 12 Aug 11:12 on.] 


On 11 Aug 2015, at 23:18, Richard Wordingham  wrote [I've replaced < > with ‹ ›, as already I've got a disappearance and am not sure whether once <> converted, a second conversion won't happen]:

> On Tue, 11 Aug 2015 21:27:27 +0200 (CEST)
> Marcel Schneider  wrote:
> > Iʼve tried to just remove the parentheses and let the string. This
> > was compiled, but the keyboard test showed that in the keyboard
> > driver DLL, UTF-16 strings with SMP characters arenʼt handled as
> > such. Each surrogate code unit is considered as a single character
> > even when itʼs followed by a trailing one. Only the code unit
> > corresponding to the shift state (modification number) is taken, no
> > matter if itʼs only a surrogate and the other half comes next.
> This is exactly what one should expect. The data is an array of
> UTF-16 code units rather than a UTF-16 string. Moreover, it was
> probably written as UCS-2. I believe it is the application that has
> the job of stitching the surrogate pairs together.
> > Is this the reason why a Unicode character cannot be represented
> > alternatively as a 32 bit integer on Windows?
> They are, from time to time. There's a Windows message that delivers a
> supplementary character rather a UTF-16 code unit, and fairly obviously
> they have to be handled as such when performing font lookups. I've a
> suspicion that this message hit an interoperability problem. A program
> that can handle pairs of surrogates but predates the message will not
> work with the more recent message. Therefore using the message type is
> deferred until applications can handle it. Therefore applications don't
> need to handle it, and don't. Therefore the message type doesn't get
> used.
> > Being UTF-16, the OS
> > could handle a complete surrogates pair in one single 32 bit integer.
> > Couldn't this be performed on driver level by modifying a program and
> > updating this when the driver is installed?
> You really talking about a parallel set of routines. I suspect the
> answer is that Microsoft don't want to work on extending a primitive
> keyboarding system when TSF is available.
> You want to use dead keys. Why? Is it not that they are the only
> mechanism you have experience of.

Yes, along with the allocation table and the ligatures (and modifier key mapping). Dead keys are the only way I found in the driver source to easily input precomposed characters and to work out a Compose functionality. Marc Durdin told us that for most languages, dead keys are not the best way for input. However, we're accustomed to. About Compose I found out that preceding diacritics are the only way to efficiently input multiply diacriticized precomposed letters. When we use combining diacritics, the problem is where to place all the diacritics on a backwards compatible layout. The Compose key idea is to use punctuation keys to input diacritics. Basically we need to hit Compose once only, while generating combining marks out of punctuation needs at least one differenciating keystroke for each one. Given the limited number of keys, we can scarcely have more than one special dead key like Compose in the Base shift state. And as diacritical marks are so numerous that all keyboard punctuation together is not sufficient, we need sequences of punctuation for a number of less current diacritics. This brings the need of a triggering keystroke at the end. Most characters are therefore best input when diacritics come before the triggering letter. But that's my experience only, I wonder how it works on TSF.

> Better systems can be built, in which one sees what one is doing.

I read that on Mac OS X, the dead key input and the Compose functionality that is made of, are accompanied by a visual feedback, which shows what characters have already been typed. 

> Is it not much better to type 'e' and then a circumflex, and see the 'e'
> and then the 'e' with a circumflex? 

Yes, in fact the precomposed characters are legacy characters from the beginning of Unicode on. The most up-to-date input of diacriticized characters is with use of combining diacritical marks. This produces directly the string that is generated by the canonical decomposition algorithms. However, on the internet, AFAIK, precomposed characters must be used for a web page to be validated W3C.

> Dead keys are an imitation of a limitation of typewriter technology. 
> If I was typing cuneiform, I'd much rather type 'bi‹COMMIT›' and see 
> the growing sequence 'b', 'bi', '‹CUNEIFORM SIGN BI›' as I typed.
> (What you have for a ‹COMMIT› key is your choice.) TSF lets one do this.
> A simple extension of the keyboard definition DLLs generated by MSKLC
> does not. What you should be pressing for is a usable tutorial on how
> to do this in TSF.

Agreed. I'll look for. Marc does all in TSF. But recently he shared how hard it was at the beginning and over 15 years. Now he's got it run, and when we need TSF, let's consider using his software.

> > If yes, we must modify the interface so that keyboard driver DLLs are
> > really read in UTF-16. And/or we must find another compiler. 
> > 
> > Must the Windows driver be compiled by a Microsoft compiler?
> The compiler is not the issue. The point is that the 16-bit code
> exists, and programs that use the 16-bit API exist. Language upgrades
> may make supplementary characters easier to use in programs, but that
> is all. They don't change existing binary interfaces.

Indeed. And if it would make sense to use other compilers than those shipping with the WDK, Max would have told us in this thread. So best practice is to stick with the original development environment. Or to use TSF.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

More information about the Unicode mailing list