Implementing SMP on a UTF-16 OS
Marcel Schneider
charupdate at orange.fr
Tue Aug 11 14:27:27 CDT 2015
On 10 Aug 2015, at 21:45, Max Truxa wrote:
> from what I can see in the short piece of code you posted, it looks
> like you are trying to somehow "group" the surrogate pairs (which does
> not make any sense to me).
> Correct syntax would be: [...] 0xD835, 0xDCEA, 0xD835, 0xDCD0, [...]
On 10 Aug 2015, at 23:06, Richard Wordingham wrote:
> On Mon, 10 Aug 2015 22:53:11 +0200 (CEST)
> Marcel Schneider wrote:
>
> > On Mon, 10 Aug 2015, at 22:33, Richard Wordingham wrote:
>
> > > Non-BMP characters must be entered as 'ligatures'.
>
> > This is clearly a Unicode implementation problem. C and C++ should be
> > standardized for handling of UTF-16. IMO we cannot consider that
> > Windows supports UTF-16 for internal use, if it does not support
> > surrogates pairs except with workarounds using ligatures.
>
> Perhaps this is why Windows offers a new method of keyboard
> mapping, via the Text Services Framework (TSF).
>
> > I may be wrong, but that's how I see the problem now.
>
> I think you're not looking hard enough.
Iʼve tried to just remove the parentheses and let the string. This was compiled, but the keyboard test showed that in the keyboard driver DLL, UTF-16 strings with SMP characters arenʼt handled as such. Each surrogate code unit is considered as a single character even when itʼs followed by a trailing one. Only the code unit corresponding to the shift state (modification number) is taken, no matter if itʼs only a surrogate and the other half comes next.
Windows can handle 32 bit code units. I found evidence in C:\WinDDK\7600.16385.1\inc\api\functiondiscoverykeys.h. So I tried this in the driver source:
{'A' /*T10 D01*/ ,0x01 ,'a' ,'A' ,NONE ,0xd835dcea ,0xd835dcd0 ,0x00e6 ,0x00c6 ,NONE ,NONE }, // ,0x0061 ,0x0041
But the compiler returned:
warning C4305: 'initializing' : truncation from 'unsigned int' to 'WCHAR'
and:
error C2220: warning treated as error - no 'object' file generated
I understand that the compiler read correctly the first of the 32 bit integers, but as here it expected a WCHAR, it deleted 16 bits and woulnʼt go forth.
On 11 Aug 2015, at 8:27, Max Truxa" wrote [corrected typo following your next e-mail]:
> On Aug 10, 2015 10:53 PM, "Marcel Schneider" wrote:
> >
> > This is clearly a Unicode implementation problem. C and C++ should be standardized for handling of UTF-16. IMO we cannot consider that Windows supports UTF-16 for internal use, if it does not support surrogates pairs except with workarounds using ligatures.
> C and C++ *are* "standardized for handling of UTF-16"... and UTF-8... and UTF-32.
> If you are interested in this topic just search for "C++ Unicode string literals" and "C++ Unicode character literals" which are standardized since C11/C++11 (with the exception of UTF-8 character literals which will follow in C++17; don't know about C though).
> The reason you won't be able to easily use these features is because the compiler shipping with the WDK is still only supporting C89/C90. And sadly for us driver developers Microsoft will not change this.
Is this the reason why a Unicode character cannot be represented alternatively as a 32 bit integer on Windows? Being UTF-16, the OS could handle a complete surrogates pair in one single 32 bit integer. Couldn't this be performed on driver level by modifying a program and updating this when the driver is installed?
If yes, we must modify the interface so that keyboard driver DLLs are really read in UTF-16. And/or we must find another compiler.
Must the Windows driver be compiled by a Microsoft compiler?
Meanwhile, the only workaround I see for getting SMP characters in the deadtrans list, is that these must be programmed on two entries, so that a user must type for example Compose, &, &, &, A, 1, and then Compose, &, &, &, A, 2, to get (bold script, when normal script is with two ampersands, and ‘with curl’, one ampersand). (Instead of 1 and 2 we can also choose l for leading, and t for trailing.) Normally a user should be able to get this letter with five key strokes, not ten. On Word weʼve already an autocorrect for script letters (, ), so that we should add another series for bold script (which is bolder than ‘bold’ ‘script’). But that working on Office, not on the Notepad and elsewhere, a keyboard driver or TSF based solution is preferrable, also because typing \ s c r i p t a Space Backspace is already ten keystrokes, too! (A trailing backslash would save one.)
Best regards,
Marcel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150811/040eba07/attachment.html>
More information about the Unicode
mailing list