From charupdate at orange.fr Sun Aug 2 07:26:45 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Sun, 2 Aug 2015 14:26:45 +0200 (CEST) Subject: Windows 10 release (was: Re: WORD JOINER vs ZWNBSP) Message-ID: <1584793787.9398.1438518405314.JavaMail.www@wwinf1m23> On 30 Jul 2015 at 20:56, Doug Ewell wrote: > Marcel Schneider wrote: >>> Unfortunately that doesn?t work on at least one recent version of >>> Windows. An unambigous bug was due to the presence of 0x2060 in the >>> Ligatures table. This has cost me a whole workday to retrieve, fix, >>> and verify. The bug on Windows I encountered at the end of July has been definitely identified and reconstructed. After ninety-five drivers compiled since the bug appeared, I can tell so much as that the problem is related to the length of the so-called ligatures. When the MSKLC was built, they were limited to four characters on Windows (see glossary in the MSKLC Help). On my machine the maximal length is 16 characters. The problem is that this is not equal on all shift states and perhaps keys. Roughly, I can put five characters on modification number three, that is normally AltGr, but not on #4 (Shift+AltGr). Relating the problem to the presence of 0x2060 was due to a misinterpretation. [About why five characters: The ellipsis made of three times PERIOD looks often better or seemingly, *and* is a part of all fonts, *and* doesn?t bug when a server enforces Latin-1 even on the UTF-8 pages it sends itself (see last month?s thread ?UTF-8 display?). The complete sequence is a braced ellipsis, for more usefulness in a context of quotation. I wanted the braced U+2026 on Ctrl+Alt+Period, and the braced three periods on Shift+Ctrl+Alt+Period. Now it?s the other way round.] The following source lines show the sole difference between a bugging driver and a driver that works fine: Bugs: {VK_OEM_PERIOD /*T33 B08*/ ,3 ,'[' ,0x2026 ,']' ,NONE ,NONE ,NONE ,NONE ,NONE ,NONE ,NONE ,NONE ,NONE ,NONE ,NONE ,NONE ,NONE }, // {VK_OEM_PERIOD /*T33 B08*/ ,4 ,'[' ,'.' ,'.' ,'.' ,']' ,NONE ,NONE ,NONE ,NONE ,NONE ,NONE ,NONE ,NONE ,NONE ,NONE ,NONE }, // Works: {VK_OEM_PERIOD /*T33 B08*/ ,3 ,'[' ,'.' ,'.' ,'.' ,']' ,NONE ,NONE ,NONE ,NONE ,NONE ,NONE ,NONE ,NONE ,NONE ,NONE ,NONE }, // {VK_OEM_PERIOD /*T33 B08*/ ,4 ,'[' ,0x2026 ,']' ,NONE ,NONE ,NONE ,NONE ,NONE ,NONE ,NONE ,NONE ,NONE ,NONE ,NONE ,NONE ,NONE }, // I was about to catalogize the cases depending on shift states (and possibly key scan codes and so on), but I encountered so many keyboard bugs (Windows keypress added while key not pressed; arrow keys disabled; Backspace disabled; and so on), that I decided not to waste more than the past week on that problem. BTW, I got really aware that the so-called Windows 7 Starter is not Windows Seven but a sort of relooked Windows Vista. That?s why its version number is 6.1. I think debugging Windows Vista isn?t worthwile, as today we have Windows 10, which as the ultimate Windows version must have fixed all those bugs. I?m hopeful and I expect people will ? it. >>> >>> The effect of the bug was that Word, Excel, Firefox and Zotero were >>> unstartable. Only when the faulty driver is the one of the default keyboard when Windows starts up. Otherwise the apps aren?t blocked, but the buggy keyboard layout is disabled, but on Word, not on the built-in Notepad. >>> >>> As a result, the WORD JOINER cannot be implemented on a driver based >>> keyboard layout for general use on Windows. By contrast, the ZWNBSP >>> can. That?s complete nonsense, sorry. Both can be implemented in the driver. > and: >> The so-called ligatures, by contrast, must not be constructed with >> 0x2060. This however was the case of three items: >> >> - A justifying no-break space emulation 0x2060 0x0020 0x2060, for use >> in word processors where the NBSP is not justifying, unlike as in >> desktop publishing and high-end editing software as Philippe Verdy >> referred to, where U+00A0 is justifying. It not being in word >> processing is consistent with the need of using U+00A0 along with >> punctuations in French, and the lack of U+202F in many fonts. >> >> - A colon with such a justifying no-break space, for use in documents >> that imitate the usage of at least a part, if not mainstream, old- >> fashioned typography: 0x2060 0x0020 0x2060 0x003a. >> >> - A punctuation apostrophe emulation 0x2060 0x0027 0x2060, mapped to >> Kana + I. There is a mistake in my e-mail: the curly punctuation apostrophe is emulated using the letter apostrophe. This sequence runs: 0x2060 0x02bc 0x2060 I?m not sure however if this is useful, as such sequences are obtainable by autocorrect where the word joiners are really useful, while in English the letter apostrophe is preferrable (whereas other languages can use U+2019 unambiguously). >> >> I'm about to test on another Windows Edition. I wonder if there is a >> real issue or not, as you are suggesting. Nevertheless I believe that >> no such bugs must occur in whatever version and edition of Windows. That remains true, as the versions we?re talking about are known to be unstable. But nobody?s perfect, and everybody?s invited to improve, notably on keyboard layouts which traditionally are neglected to the benefit of upper-level tools and high-end programs. > I created, installed, and activated an MSKLC keyboard with the three WJ > sequences described above, mapped for convenience to AltGr+Z, AltGr+X, > and AltGr+C respectively Thank you again. Curiously I hadn?t not even the idea; perhaps the missing dead key chaining and some other limitations lead me to rely rather on the WDK since I got aware of its existence (despite of its mention in the MSKLC glossary) on an explaining web page. > (not the Kana key, which I don't have), I?m using the standard keyboard on my netbook and wouldn?t have any Kana neither but thanks to the Windows Driver Kit allowing to add this as a modifier and as a toggle. Using Kana as main 3rd level helps limiting the messing of Ctrl+Alt with AltGr. I dismapped the latter and am using Ctrl+Alt in a few cases, like this one. > and had no trouble opening or using any applications on Windows 7, including > the four mentioned above (except Zotero, which I don't use). KLC source > available on request. Thank you for the proposal. and your test has even brought me to the idea of making a patch of the layout I?m working on, so I took a subset and made it from scratch in MSKLC. That?s much safer and easier to install. KanaLock is emulated using SGCaps. Compose could be emulated using other apps. But for a number of non-English languages, which SGCaps is for, CapsLock and easy input of multiply diacriticized letters is missing. > > I wouldn't have wasted the 15 minutes but for the continuing, tiresome > rhetoric about Windows bugs. I?m sorry. As stated above, Windows made me waste not only fifteen minutes, but about fifty hours. And I?m not even talking about all the other cases and my far over one thousand noted desiderata. Best regards, Marcel Schneider -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Sun Aug 2 16:23:19 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Sun, 2 Aug 2015 22:23:19 +0100 Subject: Mongolian Joining Type Message-ID: <20150802222319.25a2dd7e@JRWUBU2> I've been trying to understand the joining type logic that categorises a Mongolian letter as isolated, initial, medial or final, and the consequent effect of free variation selectors. As far as I can tell, it is currently supposed to be controlled by the property joining_type. However, this property appears only to have been non-trivially assigned to the characters of the Mongolian script from Version 6.3.0. How was categorisation assigned before then? I am particularly interested in the intended effects of U+202F NARROW NO-BREAK SPACE and U+180E MONGOLIAN VOWEL SEPARATOR. They seem to presently have a Joining_Type value of Non_Joining, but some things would make more sense to me if they had the value Dual_Joining. I am wondering if their effective value has changed; e.g. previously the definitions for Mongolian characters worked as though they were dual joining, but when matters were formalised they accidentally became non-joining. Richard. From leob at mailcom.com Sun Aug 2 19:55:38 2015 From: leob at mailcom.com (Leo Broukhis) Date: Sun, 2 Aug 2015 17:55:38 -0700 Subject: Emoji characters for food allergens In-Reply-To: <1958555392.26939.1438376320812.JavaMail.www@wwinf1j14> References: <20150728122416.665a7a7059d7ee80bb4d670165c8327d.e45e67032a.wbe@email03.secureserver.net> <27556497.34395.1438177379567.JavaMail.defaultUser@defaultHost> <55B9A182.2030504@kli.org> <32045577.9821.1438246295044.JavaMail.defaultUser@defaultHost> <1369192884.16904.1438276057084.JavaMail.www@wwinf1h34> <29774584.15201.1438335472835.JavaMail.defaultUser@defaultHost> <1958555392.26939.1438376320812.JavaMail.www@wwinf1j14> Message-ID: The discussion widens: http://tech.slashdot.org/story/15/08/02/2248257/unicode-consortium-looks-at-symbols-for-allergies From mark at macchiato.com Mon Aug 3 03:39:13 2015 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Mon, 3 Aug 2015 10:39:13 +0200 Subject: Emoji characters for food allergens In-Reply-To: References: <20150728122416.665a7a7059d7ee80bb4d670165c8327d.e45e67032a.wbe@email03.secureserver.net> <27556497.34395.1438177379567.JavaMail.defaultUser@defaultHost> <55B9A182.2030504@kli.org> <32045577.9821.1438246295044.JavaMail.defaultUser@defaultHost> <1369192884.16904.1438276057084.JavaMail.www@wwinf1h34> <29774584.15201.1438335472835.JavaMail.defaultUser@defaultHost> <1958555392.26939.1438376320812.JavaMail.www@wwinf1j14> Message-ID: BTW, the UTC declined to accept the allergen emoji set proposal. While some of the food items may be acceptable and the emoji subcommittee could re-propose them, there are principled problems with trying to deal with allergens as a set of emoji. So that is off the table. Mark *? Il meglio ? l?inimico del bene ?* On Mon, Aug 3, 2015 at 2:55 AM, Leo Broukhis wrote: > The discussion widens: > > > http://tech.slashdot.org/story/15/08/02/2248257/unicode-consortium-looks-at-symbols-for-allergies > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mailinglists at ngalt.com Mon Aug 3 03:49:40 2015 From: mailinglists at ngalt.com (Nathan Sharfi) Date: Mon, 3 Aug 2015 01:49:40 -0700 Subject: Emoji characters for food allergens In-Reply-To: References: <20150728122416.665a7a7059d7ee80bb4d670165c8327d.e45e67032a.wbe@email03.secureserver.net> <27556497.34395.1438177379567.JavaMail.defaultUser@defaultHost> Message-ID: <63A5860F-743C-4D50-8BD9-FF279842FF1E@ngalt.com> > On Jul 29, 2015, at 7:27 AM, Andrew West wrote: > > On 29 July 2015 at 14:42, William_J_G Overington > wrote: >> >> For example, one such character could be used to be placed before a list of >> emoji characters for food allergens to indicate that that a list of dietary >> need follows. >> >> For example, >> >> My dietary need is no gluten no dairy no egg >> >> There could be a way to indicate the following. >> >> My diet can include soya > > There already is, you can write "My diet can include soya". > > If you are likely to swell up and die if you eat a peanut (for > example), you will not want to trust your life to an emoji picture of > a peanut which could be mistaken for something else or rendered as a > square box for the recipient. There may be a case to be made for > encoding symbols for food allergens for labelling purposes, but there > is no case for encoding such symbols as a form of symbolic language > for communication of dietary requirements. > > Andrew I've recently tried to closely follow the care tags on my clothes instead of dumping most of them in the cold/cold batch. When I look at the care tags, I squint at the hieroglyphs[1] for five seconds, give up, and then start looking for instructions written in English ? that is, useful instructions. I'd imagine a chef trying to 'read' dietary-needs symbols would be similarly trying, only with dire consequences for getting it wrong. I can see why someone might want to communicate their allergies in a language-agnostic manner while traveling abroad, but for that to work, everyone would need to memorize a bunch of pictographs on the off chance that a foreign traveller is incapable of conveying his or her allergies in a mutually understood spoken/written language. This seems like a worse strategy than carrying around a card that says "I can't have nuts or eggs". [1] https://en.wikipedia.org/wiki/Laundry_symbol From c933103 at gmail.com Mon Aug 3 06:38:27 2015 From: c933103 at gmail.com (gfb hjjhjh) Date: Mon, 3 Aug 2015 19:38:27 +0800 Subject: Emoji characters for food allergens In-Reply-To: <63A5860F-743C-4D50-8BD9-FF279842FF1E@ngalt.com> References: <20150728122416.665a7a7059d7ee80bb4d670165c8327d.e45e67032a.wbe@email03.secureserver.net> <27556497.34395.1438177379567.JavaMail.defaultUser@defaultHost> <63A5860F-743C-4D50-8BD9-FF279842FF1E@ngalt.com> Message-ID: general public only need to understand those symbol that related to themselves, people who prepare food can have a legend for which icon mean what written in their own language. And i think it is actually better to establish another standard instead of base it on Unicode as unicode can't do the job of promoting people to use these symbol unlike what standard formulation committee can do. 2015?8?3? ??4:53? "Nathan Sharfi" ??? > > > On Jul 29, 2015, at 7:27 AM, Andrew West wrote: > > > > On 29 July 2015 at 14:42, William_J_G Overington > > wrote: > >> > >> For example, one such character could be used to be placed before a > list of > >> emoji characters for food allergens to indicate that that a list of > dietary > >> need follows. > >> > >> For example, > >> > >> My dietary need is no gluten no dairy no egg > >> > >> There could be a way to indicate the following. > >> > >> My diet can include soya > > > > There already is, you can write "My diet can include soya". > > > > If you are likely to swell up and die if you eat a peanut (for > > example), you will not want to trust your life to an emoji picture of > > a peanut which could be mistaken for something else or rendered as a > > square box for the recipient. There may be a case to be made for > > encoding symbols for food allergens for labelling purposes, but there > > is no case for encoding such symbols as a form of symbolic language > > for communication of dietary requirements. > > > > Andrew > > I've recently tried to closely follow the care tags on my clothes instead > of dumping most of them in the cold/cold batch. When I look at the care > tags, I squint at the hieroglyphs[1] for five seconds, give up, and then > start looking for instructions written in English ? that is, useful > instructions. > > I'd imagine a chef trying to 'read' dietary-needs symbols would be > similarly trying, only with dire consequences for getting it wrong. > > I can see why someone might want to communicate their allergies in a > language-agnostic manner while traveling abroad, but for that to work, > everyone would need to memorize a bunch of pictographs on the off chance > that a foreign traveller is incapable of conveying his or her allergies in > a mutually understood spoken/written language. This seems like a worse > strategy than carrying around a card that says "I can't have nuts or eggs". > > > [1] https://en.wikipedia.org/wiki/Laundry_symbol > -------------- next part -------------- An HTML attachment was scrubbed... URL: From charupdate at orange.fr Mon Aug 3 09:10:12 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Mon, 3 Aug 2015 16:10:12 +0200 (CEST) Subject: Emoji characters for food allergens Message-ID: <382276435.11376.1438611012864.JavaMail.www@wwinf1g28> On 03 Aug 2015, at 10:39:13, Mark Davis ?? wrote: > BTW, the UTC declined to accept the allergen emoji set proposal. While some of the food items may be acceptable and the emoji subcommittee could re-propose them, there are principled problems with trying to deal with allergens as a set of emoji. So that is off the table. Since food emoji encoding goes on then, there are a few points to flash back on last week, sorry to be late. On 26 Jul 2015, at 05:45, William_J_G Overington wrote: > I suggest that, in view of the importance of precision in conveying information about food allergens, that the emoji characters for food allergens should be separate characters from other emoji characters. That is, encoded in a separate quite distinct block of code points far away in the character map from other emoji characters, with no dual meanings for any of the characters: a character for a food allergen should be quite separate and distinct from a character for any other meaning. I fear that discouraging the use of food pictographs, may they represent allergen containing food or not, in current messages and with any desirable meaning could prevent users from becoming familiar with. Browsing the Charts I cannot see another place for food related symbols than the last Supplemental Symbols and Picrographs 1F900 block. This is already far away from the Emoticons block U+1F600. The meaning, as often in Unicode, is context-determined. Finding an allergen pictograph on a food package with a convenient markup would then have added an unambiguous sense. I'll try to explain a bit later why emojis may be useful. > I opine that having two separate meanings for the same character, one meaning as an everyday jolly good fun meaning in a text message and one meaning as a specialist food allergen meaning could be a source of confusion. Far better to encode a separate code block with separate characters right from the start than risk needless and perhaps medically dangerous confusion in the future. Unicode would have encoded the new allergen pictographs under a Food Allergens subhead, ensuring thus the primary meaning. Furthermore, an everyday use will primarily be less obvious in most of the cases, as food allergens are preferrably depicted as ingredients, while typical everyday food emojis like the fast food shown elsewhere in this thread show mostly prepared food. For example, U+1F35A ?? BOWL OF RICE, while gluten-free, will rather have a dish meaning, whereas 1F33E ?? EAR OF RICE may refer more precisely to the ingredient, and a future EAR OF WHEAT will symbolize gluten-containing cereals, as it does already in food labelling. BTW I find it urgent to encode all these ears, as WHEAT, BUCKWHEAT (part of the proposal), and so on, because actually only *two* kinds of cereals have their ear in Unicode: 1F33D ?? EAR OF MAIZE, and 1F33E ?? EAR OF RICE. > I suggest that for each allergen that there be two characters. > The glyph for the first character of the pair goes from baseline to ascender. > The glyph for the second character of the pair is a copy of the glyph for the first character of the pair augmented with a thick red line from lower left descender to higher right a little above the base line, the thick red line perhaps being at about thirty degrees from the horizontal. Thus the thick red line would go over the allergen part of the glyph yet just by clipping it a bit so that clarity is maintained. > The glyphs are thus for the presence of the allergen and the absence of the allergen respectively. Sorry, I don???t believe that this would have been agreed, because package design is done in high-end software as QuarkXPress, InDesign, PagePro, so it would be easy to add some expressive and unambiguous markup to a unique symbol. IMHO it might be nice to have something surrounding, like a circle for the presence and a (barred) square for the absence, or conversely. About colors, the absence of an allergen from a given food being good news for patients, we could opt for some green tone, while by contrast the red color conveys rather a warning and might thus be suitable for its presence. With analogy to road symbols, a red circle could perhaps best express this case, as allergic consumers must avoid the product. I?thought about a triangle, but this has an inner field too small for the symbol while it takes too much place (a triangle being bulkier than square and circle). If the industry agrees, a triangle for presence may be adopted. > It is typical in the United Kingdom to label food packets not only with an ingredients list but also with a list of allergens in the food and also with a list of allergens not in the food. > For example, a particular food may contain soya yet not gluten. > Thus I opine that two characters are needed for each allergen. Correspondingly, French legislation requires that the allergens be marked up with bold font style in the ingredients list, and this be followed by a list of allergens risking contamination due to their use in the workshop. The meaning of the bold markup must be explained (like ?In bold: information intended for allergic persons?). The United?Kingdom solution is more explicit. The problem is how to transpose this into a CJK context, and that???s where the proposed pictographs will become useful. > I have deliberately avoided a total strike through at fourty-five degrees as I opine that that could lead to problems distinguishing clearly the glyph for the absence of one allergen from the glyph for the absence of another allergen. About how to place the slash or backslash, I agree with William that it must not hide the symbol. To achieve this, the allergen pictograph might also be raised to the foreground, being thus fully viewable, while in this case the slash must be very thick and can be continued in outline before the pictograph. Its orientation (upper left - lower right, vs lower left - upper right) may be a matter of personal preference but from heraldics, from road symbols and from Unicode (U+20E0) the backslash could be slightly more current. [I already answered to some other point and will mention others in next replies.] Thank you for having made the Mailing List aware of this proposal and for supporting it. I'm sad that it will be essentially removed. All the best, Marcel Schneider From charupdate at orange.fr Mon Aug 3 09:18:45 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Mon, 3 Aug 2015 16:18:45 +0200 (CEST) Subject: Emoji characters for food allergens Message-ID: <419788703.11590.1438611525033.JavaMail.www@wwinf1g28> On 28 Jul 2015, at 15:00, Michael Everson wrote [I placed the quotation first]: > On 26 Jul 2015, at 06:05, Garth Wallace wrote: > >> On Sat, Jul 25, 2015 at 9:43 AM, William_J_G Overington >> wrote: >>> Emoji characters for food allergens >>> >>> An interesting document entitled >>> >>> Preliminary proposal to add emoji characters for food allergens >>> >>> by Hiroyuki Komatsu >>> >>> was added into the UTC (Unicode Technical Committee) Document Register >>> yesterday. >>> >>> http://www.unicode.org/L2/L2015/15197-emoji-food-allergens.pdf >>> >>> This is a welcome development. >> >> I'm skeptical. I understand the rationale, but several of the proposed >> characters are essentially SMALL PILE OF BROWN DOTS and would be >> difficult to distinguish at typical sizes. [I?ve already answered on this point.] > I do NOT understand the rationale. > > Emojis are not for labelling things. They?re for the playful expression of emotions. > > Standardized symbols for allergens might be useful, if there were a textual use for them. On 28 Jul 2015, at 20:26, Garth Wallace replied: > Well, there are several emoji for various items encountered in daily > life, and I think the reasoning is that allergens are important things > to refer to because of their health effects. It's a bit of a leap to > say that means there's a need for dedicated pictograms though. > I agree, it does seem to be putting the cart before the horse. I believe the issue should be replaced into its original context. All over the world, pictographs allow to convey some vital information to tourists, but more specially in CJK countries they avoid also encumbering packages with lots of Latin, Cyrillic, and possibly Greek and other scripts. Well, personally I would suggest to cite the allergens with their Latin scientific name (as TRITICUM for wheat and by extension, gluten), but I would suggest now to remember that prior to depreciate the proposal, we should ask ourselves and the concerned countries if such a Latin labelling is acceptable. The fact that Mr?Komatsu took the pain of working out this proposal, tends to prove it is *not*. Best regards, Marcel Schneider From charupdate at orange.fr Mon Aug 3 09:21:52 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Mon, 3 Aug 2015 16:21:52 +0200 (CEST) Subject: Emoji characters for food allergens Message-ID: <1352082955.11634.1438611712981.JavaMail.www@wwinf1g28> On 29 Jul 2015, at 10:21, William_J_G Overington wrote: >> Alternately, scanning the EAN barcode on the package could give access to a database intended for food information. This requires the use of a smartphone or other compatible device. > That is a good idea. > In which case the emoji would not need to be encoded on the package, yet would be sent by the database facility. Using EAN barcode to database and the results sent to the end user would need a two-way communication link and that could possibly mean queueing problems as the database facility would possibly be answering requests from many people. > Another possibility would be to encode the Unicode characters for the allergens contained in the food within a QR code (Quick Response Code) on the package. > Decoding could then be local, in the device being used to scan the QR code. > [...] Somehow this device-relying information system wouldn???t make me really happy. IMHO?the most straightforward communication relies on the packaging, and for this a standardized set of emojis would have been useful. For more clarity, a textual list may complete the labelling, probably using the Latin scientific names. Every allergic person must then be given by hi?s practician or other health care provider a personal list of allergens, a kind of allergen profile, both in local language and in Latin, plus the pictograph. We should perhaps take into consideration that allergen lists may be very long, and translating them to emojis will make them somewhat bulky, particularly on small packages. So the emojis will be used only if desired or required. Best regards, Marcel Schneider From charupdate at orange.fr Mon Aug 3 09:30:11 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Mon, 3 Aug 2015 16:30:11 +0200 (CEST) Subject: Emoji characters for food allergens Message-ID: <1310487596.11828.1438612211991.JavaMail.www@wwinf1g28> On 29 Jul 2015, at 15:42, William_J_G Overington wrote: [On 28 Jul 2015, at 22:26, gfb hjjhjh wrote:] >> As according to http://unicode.org/faq/emoji_dingbats.html , emoji characters do not have single semantics. Which I think it is not what the original proposer want? Or were I misunderstanding that > Garth Wallace has already indicated in his reply to your post that it was me, not the original proposer, who wanted single semantics. > [...] > The easiest thing appears to be to not call the items emoji. > I opine that a new word is needed to mean the following. > A character that looks like it is an emoji character yet has precise semantics. > There is an issue here that is, in my opinion, quite fundamental to the future of encoding items that are currently all regarded as emoji: an issue that goes far beyond the matter of encoding emoji characters for food allergens. > Communication through the language barrier is of huge importance and may become more so in the future. IMHO we???ve already overcome the language barrier, as we all communicate in English, at the image of medieval Latin communication across Europe, ancient Roman Empire communication, Koine Greek from Alexander???s conquests on. > Emoji seemed like a wonderful way to achieve communication through the language barrier. We remember that Esperanto was also a hopeful way to unify the language, raising much enthusiasm among its followers. IMHO?a pictograph based script can hardly be enough performing, unless it ends up to become a kind of new Esperanto except that it doesn???t include speech. > Yet if semantics are not defined, then there is a problem. Not only emojis, even natural language semantics are often not precisely defined, but that doesn???t hinder us in defining the semantics of a particular message by adding more words. Equally an allergen emoji might be preceded or followed by a poison emoji (U+2620) to make the health threat unambiguous. > Please consider the matter of text to speech in the draft Unicode Technical Report 51. > I remember years ago I was asked in this mailing list what chat means. > I think that discussing the meaning of chat is some classic Unicode cultural matter. > In English it is an informal talk between two or more people, in French it is a cat. As I can see, in today???s French, ?chat? has the meaning of its English homophone, except when the context makes the (original French) zoological meaning unambiguous. Having said that, I hurry up adding that the English word ?chat? has been francicized to ?tchatche?, but not very successfully. > So the sequence of Unicode characters only has meaning in the context that they are being used. And Unicode provides even language tags to disambiguate. > Now the big opportunity with emoji could be to assist communication through the language barrier. That???s exact, emojis can assist communication, but they cannot replace classical character based communication entirely. > From reading about semantics in the linked document it appears that that opportunity might be disappearing or may have gone already. > This, in my opinion, is unfortunate. > The food allergen characters could, by being precisely defined with one and only one meaning, be either an exception to the general situation or could be the start of a trend. We cannot define precisely and irrevocably the meaning of any grapheme, except in mathematics. We only can describe its use at a given time of history. I don???t believe that Unicode has the power of forbiding any semantics of any emoji, nor did it ever aim at. See the English apostrophe: Unicode???s primary advice has been overrun by mainstream usage. > A name other than emoji is needed for such characters that have one and only one meaning, that meaning precisely defined. Creating a new script is not in Unicode???s purpose, which is (please check if I???m right) to encode all *existing* scripts. I underscore *existing* with respect to the present context, but originally the stress is on *all*. Encoding *all* existing scripts used in present or in past times, is a great purpose and Unicode is about to reach the goal. Subsequently, *if* a user community creates and uses a *new* script made of pictographs or of other signs, Unicode can be pleased to encode it. Sure. > [...] > For example, one such character could be used to be placed before a list of emoji characters for food allergens to indicate that that a list of dietary need follows. > For example, > My dietary need is no gluten no dairy no egg > There could be a way to indicate the following. > My diet can include soya My nourishment too includes soya in form of much tonyu (whether fermented or not), and it excludes dairy, egg, meat, poultry, fish, honey; things that were very included in the past. The problem as I see it, is whether people are at ease with expressing it, or not. Personally I don???t hesitate using much natural language to explain the facts, nor do other people I know about. The difference might be that in these cases, the nourishment preferences and aversions result uniquely from the awareness of the crimes committed against the animals, whereas dietary requirements basically result from recommendations made by practicians or other health care providers. The two motivations may overlap. As communicating dietary requirements results in constraints for other people, especially cooks, servers, attendants, hosts, friends, managers, housekeepers, this communication may often be very sensitive and may induce whether self-humiliation or offence, partly also because natural language is never neutral and moreover leaves a margin to interpretation. The task may even turn out to become impossible when foreign languages are implied. Using standardized emojis can greatly alleviate the deal. The day when food allergen emojis would have been available, I would have suggested to prepare two bullet lists, stacked or side by side. In the first list, every food emoji is preceded by U+2620 ? SKULL AND CROSS BONES. In the second list, every food emoji is preceded by U+2665 ? BLACK HEART SUIT. I say ?bullet lists?, but the array may also be referred to as lists of two-emoji sequences. I can imagine that this would be received with a smile and gladly followed. > There is a situation that affects further discussion of some aspects of this matter, though not all aspects of this matter, as a totally symbolic representation could still be discussed. > http://www.unicode.org/mail-arch/unicode-ml/y2015-m06/0208.html > However, there is also the following. > http://www.oxforddictionaries.com/definition/english/moratorium > Please note the use of the word temporary in the definition. > So maybe all is not lost and discussion of all aspects will become possible at some future time. Alas, in this particular context, ?moratorium? is an euphemism with the meaning of a prohibition. Please, note that I use angle quotes to avoid making believe that I were scare-quoting the word. That???s a good example of how useful it is to disambiguate quotation quotes and scare quotes. Well I could use some supplemental words to express that, like: Alas, in this particular context, the word moratorium as it is used, is an euphemism with the meaning of a prohibition. It???s always the issue about multiple semantics vs precise definition. I hope that helps. All the best, Marcel Schneider From charupdate at orange.fr Mon Aug 3 09:36:09 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Mon, 3 Aug 2015 16:36:09 +0200 (CEST) Subject: Refining communication about poisons (was: Re: Emoji characters for food allergens) Message-ID: <1993129798.11946.1438612569896.JavaMail.www@wwinf1g28> On 29 Jul 2015, at 18:39, Doug Ewell wrote: > Andrew West wrote: >> There may be a case to be made for encoding symbols for food allergens >> for labelling purposes, but there is no case for encoding such symbols >> as a form of symbolic language for communication of dietary >> requirements. > For what little it is worth, I agree with Andrew on this. Sorry, I disagree, as I explain in my previous e-mail: http://www.unicode.org/mail-arch/unicode-ml/y2015-m08/0009.html > Earlier I mentioned U+2620 SKULL AND CROSSBONES and U+2623 BIOHAZARD SIGN, > two symbols which have been in Unicode since the dawn of time. [...] > While communication about food allergens is undoubtedly important, it's hard to imagine that communication about poisons and biohazards is any less important. Agreed. It???s even more important, as food allergies are triggered by slow poisoning through residual pesticides and food additives, and through the consumption of bad quality cereals grown with abuse of mineral fertilizers. This is why U+2620 ? SKULL AND CROSSBONES should be used in food labelling whenever the ingredients are *not* organically grown, or certain food additives are used. To complete, I therefore suggest to encode a panel of food hazard symbols, among which: + PESTICIDE RESIDUES SYMBOL + MINERAL FERTILIZER OVERUSE SYMBOL + ARTIFICIALLY COLOURED SYMBOL (use: certain synthetical food colors cause health issues) + STABILIZER SYMBOL + SALTY FOOD SYMBOL + VETERINARY DRUGS RESIDUES SYMBOL (That???s so big an issue that the FDA is validating a *new* drug residues analysis selection model for interstate milk shipping.) and so on. Equally, the artificially impoverished food ingredients like white sugar and white flour, are acting poison-like on metabolical level (more explanations would be off-topic) and must thus be declared whenever they are not recompleted with bran, germ, and molasses. To achieve this, the following pictographs will be useful: + EMPOVERISHED FOOD WARNING SYMBOL + MISSING BRAN AND GERM SYMBOL + MISSING MOLASSES SYMBOL Declaring the least and most probably unexistent traces of food allergens, but concealing from the consumers all these health threatening poisons that are likewise purposely added to everyday food, or the basic carbs are transformed to, is a particularly insidious form of hypocrisy. This criticism must be taken as a motivation to encode these new pictographs. It does not target in any way the proposer of the allergen emojis, nor any other person here around. It refers to the economical background of food allergen labelling, and thus has its place in this thread. Best regards, Marcel Schneider From charupdate at orange.fr Mon Aug 3 10:08:14 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Mon, 3 Aug 2015 17:08:14 +0200 (CEST) Subject: Emoji characters for food allergens In-Reply-To: <63A5860F-743C-4D50-8BD9-FF279842FF1E@ngalt.com> References: <20150728122416.665a7a7059d7ee80bb4d670165c8327d.e45e67032a.wbe@email03.secureserver.net> <27556497.34395.1438177379567.JavaMail.defaultUser@defaultHost> <63A5860F-743C-4D50-8BD9-FF279842FF1E@ngalt.com> Message-ID: <371935304.12799.1438614495033.JavaMail.www@wwinf1g28> On 03 Aug 2015, at 10:57, Nathan Sharfi wrote: > I've recently tried to closely follow the care tags on my clothes instead of dumping most of them in the cold/cold batch. When I look at the care tags, I squint at the hieroglyphs[https://en.wikipedia.org/wiki/Laundry_symbol] for five seconds, give up, and then start looking for instructions written in English ? that is, useful instructions. I'm sorry to really disagree with this little understandable criticism of laundry symbols. The most encountered of the care tags are self-explaining, as the washing and iron temperature limits or discouraging. The other symbols mainly concern dry cleaning and laundry professionals. Many clothes are shipped across the world, so English would not always be suitable as a language. Further, symbols stay longer readable while text is often washed out. > > I'd imagine a chef trying to 'read' dietary-needs symbols would be similarly trying, only with dire consequences for getting it wrong. That's another case. All chefs understand English, so presenting allergen lists in English is a working strategy. The concern added by William is about how to present such a list, as he wishes a symbol for "I'm allergic to" and a symbol for "My diet can include". For this I suggest the poison and heart symbols. To follow your advice, these may be used as bullets for lists written in English. > > I can see why someone might want to communicate their allergies in a language-agnostic manner while traveling abroad, but for that to work, everyone would need to memorize a bunch of pictographs on the off chance that a foreign traveller is incapable of conveying his or her allergies in a mutually understood spoken/written language. This seems like a worse strategy than carrying around a card that says "I can't have nuts or eggs". I understand the issue. Best regards, Marcel Schneider From doug at ewellic.org Mon Aug 3 13:02:38 2015 From: doug at ewellic.org (Doug Ewell) Date: Mon, 03 Aug 2015 11:02:38 -0700 Subject: Windows keyboard restrictions (was: Re: Windows 10 release) Message-ID: <20150803110238.665a7a7059d7ee80bb4d670165c8327d.301dafd5ac.wbe@email03.secureserver.net> Marcel Schneider wrote: > The bug on Windows I encountered at the end of July has been > definitely identified and reconstructed. After ninety-five drivers > compiled since the bug appeared, I can tell so much as that the > problem is related to the length of the so-called ligatures. When the > MSKLC was built, they were limited to four characters on Windows (see > glossary in the MSKLC Help). On my machine the maximal length is 16 > characters. The problem is that this is not equal on all shift states > and perhaps keys. Roughly, I can put five characters on modification > number three, that is normally AltGr, but not on #4 (Shift+AltGr). As far as I can tell, the limit for a ligature on a Windows keyboard layout is four UTF-16 code points: MSKLC help, under "Validation Reference": "Ligatures cannot contain more than four UTF-16 code points" Presentation from IUC 23 by Michael Kaplan (author of MSKLC) and Cathy Wissink: http://tinyurl.com/o49r4bz KbdEdit: http://www.kbdedit.com/ MUFI: http://folk.uib.no/hnooh/mufi/keyboards/WinXPkeyboard.html I understand that there are some tools (such as Keyboard Layout Manager) that claim a higher limit, and it may even be possible in some cases to assign more than four, but the DOCUMENTED limit appears to be four. (If you claim that it is not, please provide a link to the relevant official documentation, and note that C++ code showing 16 fields is not documentation.) It is not a bug for software to fail to perform BEYOND its documented limits. Since you are so very eager to declare this a bug, or a collection of bugs, rather than a design limitation, I strongly recommend you get in touch with Microsoft Technical Support and express your concerns to them. Make sure to let them know just how certain you are that these are bugs. See if they'll send you a T-shirt. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From asmus-inc at ix.netcom.com Mon Aug 3 14:01:09 2015 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Mon, 3 Aug 2015 12:01:09 -0700 Subject: Emoji characters for food allergens In-Reply-To: <371935304.12799.1438614495033.JavaMail.www@wwinf1g28> References: <20150728122416.665a7a7059d7ee80bb4d670165c8327d.e45e67032a.wbe@email03.secureserver.net> <27556497.34395.1438177379567.JavaMail.defaultUser@defaultHost> <63A5860F-743C-4D50-8BD9-FF279842FF1E@ngalt.com> <371935304.12799.1438614495033.JavaMail.www@wwinf1g28> Message-ID: <55BFBA75.8090500@ix.netcom.com> An HTML attachment was scrubbed... URL: From asmus-inc at ix.netcom.com Mon Aug 3 14:38:06 2015 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Mon, 3 Aug 2015 12:38:06 -0700 Subject: Emoji characters for food allergens In-Reply-To: <371935304.12799.1438614495033.JavaMail.www@wwinf1g28> References: <20150728122416.665a7a7059d7ee80bb4d670165c8327d.e45e67032a.wbe@email03.secureserver.net> <27556497.34395.1438177379567.JavaMail.defaultUser@defaultHost> <63A5860F-743C-4D50-8BD9-FF279842FF1E@ngalt.com> <371935304.12799.1438614495033.JavaMail.www@wwinf1g28> Message-ID: <55BFC31E.2060607@ix.netcom.com> An HTML attachment was scrubbed... URL: From gwalla at gmail.com Mon Aug 3 15:22:01 2015 From: gwalla at gmail.com (Garth Wallace) Date: Mon, 3 Aug 2015 13:22:01 -0700 Subject: Olympic sports emoji In-Reply-To: References: <20150727151200.665a7a7059d7ee80bb4d670165c8327d.8d64c4d981.wbe@email03.secureserver.net> Message-ID: On Mon, Jul 27, 2015 at 10:55 PM, Garth Wallace wrote: > > On Mon, Jul 27, 2015 at 3:12 PM, Doug Ewell wrote: >> Leo Broukhis wrote: >> >>> Fonts vary and can be copyrighted, no doubt, but Unicode is not about >>> fonts. >> >> I was going to bust out the Apple logo as an analogy to the Olympic >> symbols, but apparently the Apple logo is trademarked and not merely >> copyrighted, so never mind. >> >> In any case, if this is just a character/glyph thing, then there >> shouldn't be a problem using either the existing emoji or the ones >> proposed in L2/15-196R for Olympic sports, since the glyphs can simply >> be styled as needed. > > Would this be considered within the normal range of glyphic variation? > Would an icon of two pugilists fighting be an acceptable rendering of > a BOXING GLOVE emoji? > > BTW, speaking as a martial artist myself, I have to say an empty dogi > is an odd representation for martial arts, even specifically Japanese > ones. The proposal says that it could be used for judo, karate, and > tae kwon do; it at least matches the first two (they are distinct, but > not in a way that would , and practice uniforms for TKD are similar, > but competitive TKD under WTF rules (including Olympic competition) > uses several pieces of protective equipment (helmet, gloves, chest > guard) with colored padding over the dobok. Also, has anyone else noticed that the proposed WRESTLING emoji doesn't depict competitive wrestling? It's a pair of shirtless men in baggy pants standing straight up, with one apparently grabbing the other by the ponytail and hitting his face. From petercon at microsoft.com Mon Aug 3 17:24:25 2015 From: petercon at microsoft.com (Peter Constable) Date: Mon, 3 Aug 2015 22:24:25 +0000 Subject: Emoji characters for food allergens In-Reply-To: <55BFBA75.8090500@ix.netcom.com> References: <20150728122416.665a7a7059d7ee80bb4d670165c8327d.e45e67032a.wbe@email03.secureserver.net> <27556497.34395.1438177379567.JavaMail.defaultUser@defaultHost> <63A5860F-743C-4D50-8BD9-FF279842FF1E@ngalt.com> <371935304.12799.1438614495033.JavaMail.www@wwinf1g28> <55BFBA75.8090500@ix.netcom.com> Message-ID: Once back when I was living in Thailand, I was riding in a taxi to the Bangkok airport on a recently-opened highway. There were road signs posted at intervals that had a two-digit number (?60? or something like that) enclosed in a circle. Having had enough experience with road signage in my home country and also other countries, I recognized this to be a speed limit. But knowing common practices for how many Thais at the time would obtain their driver?s license, and the education level of many Thais coming from rural areas to work as taxi drivers in Bangkok, I was curious enough to ask the driver what the sign meant. (He being monolingual, this was all in Thai.) He thought for a moment and then responded that it was the distance to the airport. Anecdote aside, the assumption of these discussions is that symbols are iconic ? which means that the symbol communicates a conventional semantic. And the point of this being _conventional_ is that the semantic is not self-evident from the appearance of the image, but rather is based on a shared agreement. For example, a photograph of a chair is not iconic since it is an ostensive rendition of an actual chair. But a symbol of an iron with a dot inside it intended to mean ?can be ironed with low heat? is iconic because it?s meaning is conventional, and like any convention, must be learned. Some conventions may be universally learned, but very few are. Most are limited to particular cultures, and even if used in many cultures, may be learned by only small portions of the given culture. Even something like a speed limit sign that a driver without a given culture sees every day and is expected to understand is not necessarily something that the driver has learned. Much less something like icons for handling of laundry, which have been used in several countries for a few decades now but that nobody has ever been required to learn, and that few people actually do learn to any great extent. Peter From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Asmus Freytag (t) Sent: Monday, August 3, 2015 12:01 PM To: unicode at unicode.org Subject: Re: Emoji characters for food allergens I'm sorry to really disagree with this little understandable criticism of laundry symbols. The most encountered of the care tags are self-explaining, as the washing and iron temperature limits or discouraging. The other symbols mainly concern dry cleaning and laundry professionals. The laundry symbols are like traffic signs. The ones you see daily aren't difficult to remember, but any there are always some rare ones that are a bit baffling. What you apparently do not realize is that in significant parts of the world, these symbols are not common (or occur only as adjunct to text). There's therefore no daily reinforcement at all. Where you live, the situation is reversed; no wonder you are baffled. All chefs understand English, I would regard that statement to have a very high probability of being wrong. Which would make any conclusions based on it invalid. -------------- next part -------------- An HTML attachment was scrubbed... URL: From asmus-inc at ix.netcom.com Mon Aug 3 18:14:57 2015 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Mon, 3 Aug 2015 16:14:57 -0700 Subject: Emoji characters for food allergens In-Reply-To: References: <20150728122416.665a7a7059d7ee80bb4d670165c8327d.e45e67032a.wbe@email03.secureserver.net> <27556497.34395.1438177379567.JavaMail.defaultUser@defaultHost> <63A5860F-743C-4D50-8BD9-FF279842FF1E@ngalt.com> <371935304.12799.1438614495033.JavaMail.www@wwinf1g28> <55BFBA75.8090500@ix.netcom.com> Message-ID: <55BFF5F1.6000204@ix.netcom.com> Nice anecdote. I share the concerns you raise in your reflection on the limits of shared conventions. Unicode cannot be so constrained that it encodes only universally accepted icons, but it should be constrained to not encode characters on foot of possible conventions that are not actually demonstrated anywhere. There's currently no convention for denoting allergens by emoji (pictorial renditions), so that usage is something that is speculative at the moment. Not as speculative is the suggestion that certain food items should be added - it seems to be an acceptable principle to encode "iconic" foods. That would argue for emoji for milk and bread (w/o cross aliasing it as gluten), but not for soy beans, for example. A./ On 8/3/2015 3:24 PM, Peter Constable wrote: > > Once back when I was living in Thailand, I was riding in a taxi to the > Bangkok airport on a recently-opened highway. There were road signs > posted at intervals that had a two-digit number (?60? or something > like that) enclosed in a circle. Having had enough experience with > road signage in my home country and also other countries, I recognized > this to be a speed limit. > > But knowing common practices for how many Thais at the time would > obtain their driver?s license, and the education level of many Thais > coming from rural areas to work as taxi drivers in Bangkok, I was > curious enough to ask the driver what the sign meant. (He being > monolingual, this was all in Thai.) He thought for a moment and then > responded that it was the distance to the airport. > > Anecdote aside, the assumption of these discussions is that symbols > are iconic ? which means that the symbol communicates a conventional > semantic. And the point of this being _/conventional/_ is that the > semantic is not self-evident from the appearance of the image, but > rather is based on a shared agreement. For example, a photograph of a > chair is not iconic since it is an ostensive rendition of an actual > chair. But a symbol of an iron with a dot inside it intended to mean > ?can be ironed with low heat? is iconic because it?s meaning is > conventional, and like any convention, must be learned. > > Some conventions may be universally learned, but very few are. Most > are limited to particular cultures, and even if used in many cultures, > may be learned by only small portions of the given culture. Even > something like a speed limit sign that a driver without a given > culture sees every day and is expected to understand is not > necessarily something that the driver has learned. Much less something > like icons for handling of laundry, which have been used in several > countries for a few decades now but that nobody has ever been required > to learn, and that few people actually do learn to any great extent. > > Peter > > *From:*Unicode [mailto:unicode-bounces at unicode.org] *On Behalf Of > *Asmus Freytag (t) > *Sent:* Monday, August 3, 2015 12:01 PM > *To:* unicode at unicode.org > *Subject:* Re: Emoji characters for food allergens > > I'm sorry to really disagree with this little understandable criticism of laundry symbols. The most encountered of the care tags are self-explaining, as the washing and iron temperature limits or discouraging. The other symbols mainly concern dry cleaning and laundry professionals. > > > The laundry symbols are like traffic signs. The ones you see daily > aren't difficult to remember, but any there are always some rare ones > that are a bit baffling. What you apparently do not realize is that in > significant parts of the world, these symbols are not common (or occur > only as adjunct to text). There's therefore no daily reinforcement at all. > > Where you live, the situation is reversed; no wonder you are baffled. > > > All chefs understand English, > > > I would regard that statement to have a very high probability of being > wrong. Which would make any conclusions based on it invalid. > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mark at kli.org Mon Aug 3 18:24:07 2015 From: mark at kli.org (Mark E. Shoulson) Date: Mon, 03 Aug 2015 19:24:07 -0400 Subject: Emoji characters for food allergens In-Reply-To: <1310487596.11828.1438612211991.JavaMail.www@wwinf1g28> References: <1310487596.11828.1438612211991.JavaMail.www@wwinf1g28> Message-ID: <55BFF817.2070401@kli.org> On 08/03/2015 10:30 AM, Marcel Schneider wrote: > On 29 Jul 2015, at 15:42, William_J_G Overington wrote: > >> Emoji seemed like a wonderful way to achieve communication through the language barrier. > We remember that Esperanto was also a hopeful way to unify the language, raising much enthusiasm among its followers. IMHO a pictograph based script can hardly be enough performing, unless it ends up to become a kind of new Esperanto except that it doesn???t include speech. It's already noted that this is totally out of scope for Unicode, but if you're interested in this kind of pictographic pidgin, take a look at https://www.kwikpoint.com/ Someone already did some of it. ~mark From richard.wordingham at ntlworld.com Wed Aug 5 14:32:02 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Wed, 5 Aug 2015 20:32:02 +0100 Subject: Mongolian Joining Type In-Reply-To: <20150802222319.25a2dd7e@JRWUBU2> References: <20150802222319.25a2dd7e@JRWUBU2> Message-ID: <20150805203202.0d148761@JRWUBU2> On Sun, 2 Aug 2015 22:23:19 +0100 Richard Wordingham wrote: > As far as I can tell, it is currently supposed to be controlled by the > property joining_type. However, this property appears only to have > been non-trivially assigned to the characters of the Mongolian script > from Version 6.3.0. How was categorisation assigned before then? Is there anyone alive and here who remembers? Or knows where to find the information? (As opposed to merely knowing where the information ought to have been recorded.) Richard. From roozbeh at unicode.org Wed Aug 5 14:48:59 2015 From: roozbeh at unicode.org (Roozbeh Pournader) Date: Wed, 5 Aug 2015 12:48:59 -0700 Subject: Mongolian Joining Type In-Reply-To: <20150805203202.0d148761@JRWUBU2> References: <20150802222319.25a2dd7e@JRWUBU2> <20150805203202.0d148761@JRWUBU2> Message-ID: These were the original proposals: http://www.unicode.org/L2/L2012/12202-shaping.txt http://www.unicode.org/L2/L2012/12360-mong-shaping.txt (with considerable UTC discussions). A good trick is going through the posted UTC minutes and searching for the topic you are interested in. Or just do Google searches, restricting your search to site:unicode.org and adding "L2" to the search string. On Wed, Aug 5, 2015 at 12:32 PM, Richard Wordingham < richard.wordingham at ntlworld.com> wrote: > On Sun, 2 Aug 2015 22:23:19 +0100 > Richard Wordingham wrote: > > > As far as I can tell, it is currently supposed to be controlled by the > > property joining_type. However, this property appears only to have > > been non-trivially assigned to the characters of the Mongolian script > > from Version 6.3.0. How was categorisation assigned before then? > > Is there anyone alive and here who remembers? Or knows where to find > the information? (As opposed to merely knowing where the information > ought to have been recorded.) > > Richard. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Wed Aug 5 17:49:52 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Wed, 5 Aug 2015 23:49:52 +0100 Subject: Mongolian Joining Type In-Reply-To: References: <20150802222319.25a2dd7e@JRWUBU2> <20150805203202.0d148761@JRWUBU2> Message-ID: <20150805234952.52a8fcb2@JRWUBU2> On Wed, 5 Aug 2015 12:48:59 -0700 Roozbeh Pournader wrote: > These were the original proposals: > http://www.unicode.org/L2/L2012/12202-shaping.txt > http://www.unicode.org/L2/L2012/12360-mong-shaping.txt > (with considerable UTC discussions). > > A good trick is going through the posted UTC minutes and searching > for the topic you are interested in. Or just do Google searches, > restricting your search to site:unicode.org and adding "L2" to the > search string. So how did you obtain the joining types for U+180E MONGOLIAN VOWEL SEPARATOR (MVS) (specified as U) and U+202F NARROW NO-BREAK SPACE (NNBSP) (defaulting to U)? Did you study the Mongolian variation sequences? Did someone tell you how they behaved? One problem from your source is that UTC discussions are rarely minuted. The decisions are recorded, but not the reasoning. The other, is that I am interested in what the state of affairs was before the change. I have a suspicion that no-one had defined the meaning of the joining forms because 'it was obvious'. There are arguments going round that NNBSP acts as though joining to the following character (the one to the right in horizontal text), which would make it joining type L. MVS seems a bit of an oddity. The standardized variants make most sense if it is of joining type T ('transparent') or D ('dual_joining'), but a further contextual substitution is still required if there is no variation selector. Richard. From richard.wordingham at ntlworld.com Wed Aug 5 21:00:14 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Thu, 6 Aug 2015 03:00:14 +0100 Subject: Mongolian Joining Type In-Reply-To: References: <20150802222319.25a2dd7e@JRWUBU2> <20150805203202.0d148761@JRWUBU2> <20150805234952.52a8fcb2@JRWUBU2> Message-ID: <20150806030014.7aa9906d@JRWUBU2> On Wed, 5 Aug 2015 17:26:57 -0700 Roozbeh Pournader wrote: > On Wed, Aug 5, 2015 at 3:49 PM, Richard Wordingham < > richard.wordingham at ntlworld.com> wrote: > > MVS seems a bit of an oddity. The > > standardized variants make most sense if it is of joining type T > > ('transparent') or D ('dual_joining'), but a further contextual > > substitution is still required if there is no variation selector. > That seems contradictory with what the Core Specification says ... Please tell me where. I couldn't find anything helpful when I looked. > ... and > how I understand the MVS. Would you please provide examples for it > behaving like a T or D? In isolation, the form of the vowel after MVS is produced by the following combinations: 1820 180B; second form; final # MONGOLIAN LETTER A 1821 180B; second form; final # MONGOLIAN LETTER E (They're the same glyph.) For these to be a final, as opposed to an isolated form, MVS must be T, D or L. In isolation, the form of the consonant before MVS is produced by the following combinations: 1828 180C; third form; medial # MONGOLIAN LETTER NA 182C 180D; fourth form; medial # MONGOLIAN LETTER QA 182D 180C; third form; medial # MONGOLIAN LETTER GA 1836 180C; third form; medial # MONGOLIAN LETTER YA For these to be medial, MVS must be T, D or R. Consequently, MVS must be T or D! If there are no variation selectors, it doesn't really matter what MVS is, provided the contextual changes triggering on MVS change all four forms (isolated, initial, medial and final). The Mongolian Baiti font is in the process of abandoning support for the above variations in accordance with a deeply buried proposal to tinker with the encoding of Mongolian. (Unicode string encodings aren't stable until there's a large volume of use or a change would be too embarrassing.) Richard. From charupdate at orange.fr Thu Aug 6 02:43:54 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Thu, 6 Aug 2015 09:43:54 +0200 (CEST) Subject: Windows keyboard restrictions In-Reply-To: <20150803110238.665a7a7059d7ee80bb4d670165c8327d.301dafd5ac.wbe@email03.secureserver.net> References: <20150803110238.665a7a7059d7ee80bb4d670165c8327d.301dafd5ac.wbe@email03.secureserver.net> Message-ID: <32441471.4754.1438847034766.JavaMail.www@wwinf1e15> A part of the documentation you request is available: ?Download Windows Driver Kit Version 7.1.0 from Official Microsoft Download Center.? N. p., 1 Dec. 2014. Web. 1 Dec. 2014. https://www.microsoft.com/en-us/download/details.aspx?id=11800 C:\WinDDK\7600.16385.1\inc\api\kbd.h Line 469, and preceding. --- Note: This file can be accessed on-line on third party websites. You may wish to try a Bing* search for "kbd.h". * My preferred internet search engine is Bing??? [This ballot box U+2610 is on my keyboard at Shift+Kana+L.] --- To circumvent the issues araising from the word ?bug?, we may simply ban that and focus on a few comments: * Ligature is an internal name for Wchar sequences that are * generated when a specified key is pressed. * * The ligatures are all in *one* table. [That is, unlike the key mapping allocation tables, which *can* be more than one, and often are so.] * The number of characters of the longest ligature * determines the number replacing the in * static ALLOC_SECTION_LDATA LIGATURE * and in * , * sizeof(aLigature[0]), * (PLIGATURE1)aLigature * below in the static ALLOC_SECTION_LDATA KBDTABLES. * * The maximum length of ligatures is 16 characters. * Characters from 17th on are discarded. * * The ligatures table must be defined for characters, * whether in kbd.h, or kbdfrfre.h, or here before, * using the following define: * TYPEDEF_LIGATURE() * For clarification, a trailing comment is added: * // LIGATURE, *PLIGATURE * Tables for up to 5 characters length are already defined in * C:\WinDDK\7600.16385.1\inc\api\kbd.h. * * The lasting Wchar fields of each ligature that is shorter than * the maximum length defined, may be filled up with 0xF000, or with * WCH_NONE as defined in kbd.h, or NONE if defined in the custom header. * These entries may be shortened, especially when the ligatures table * is not edited in a single spreadsheet table. What???s new for me, is that ?sometimes? [scare quotes], the ligature length must not exceed four characters. I already knew what???s written in the MSKLC Help about this topic, and I explained in my previous e-mail that, when the MSKLC?was built, Windows did not support more than four characters per ligature. (That???s the only straightforward explanation of this point of the MSKLC.) As this proved to be insufficient, Microsoft must have decided to raise the limit to sixteen. Following some advice I programmed a ligature of 35 characters on Sat Feb 28, 2015; thus I've got the opportunity of seeing that 16?characters were inserted, while the overflow was discarded. Windows worked normally. The mapped virtual key was VK_OEM_COMMA, and the modification number was?8, that was Shift+AltGr+Kana (which it is still today). After the recent event, we may add the following: * CAUTION: THE INITIAL MAXIMUM LENGTH OF LIGATURES WAS 4 CHARACTERS. * DEPENDING ON THE SHIFT STATE, IT MAY HAPPEN THAT LONGER LIGATURES * CAUSE THE KEYBOARD DRIVER TO FAIL. * IF THE FAILING DRIVER IS THE ONE OF THE DEFAULT KEYBOARD LAYOUT, * THE SYSTEM MAY NOT WORK AS EXPECTED. I???m hopeful that you will agree upon this formulation, and I hope that helps. Best regards, Marcel Schneider On 03 Aug 2015, at 20:02, Doug Ewell wrote: > Marcel Schneider wrote: > > > The bug on Windows I encountered at the end of July has been > > definitely identified and reconstructed. After ninety-five drivers > > compiled since the bug appeared, I can tell so much as that the > > problem is related to the length of the so-called ligatures. When the > > MSKLC was built, they were limited to four characters on Windows (see > > glossary in the MSKLC Help). On my machine the maximal length is 16 > > characters. The problem is that this is not equal on all shift states > > and perhaps keys. Roughly, I can put five characters on modification > > number three, that is normally AltGr, but not on #4 (Shift+AltGr). > > As far as I can tell, the limit for a ligature on a Windows keyboard > layout is four UTF-16 code points: > > MSKLC help, under "Validation Reference": > "Ligatures cannot contain more than four UTF-16 code points" > > Presentation from IUC 23 by Michael Kaplan (author of MSKLC) and Cathy > Wissink: > http://tinyurl.com/o49r4bz > > KbdEdit: > http://www.kbdedit.com/ > > MUFI: > http://folk.uib.no/hnooh/mufi/keyboards/WinXPkeyboard.html > > I understand that there are some tools (such as Keyboard Layout Manager) > that claim a higher limit, and it may even be possible in some cases to > assign more than four, but the DOCUMENTED limit appears to be four. (If > you claim that it is not, please provide a link to the relevant official > documentation, and note that C++ code showing 16 fields is not > documentation.) > > It is not a bug for software to fail to perform BEYOND its documented > limits. > > Since you are so very eager to declare this a bug, or a collection of > bugs, rather than a design limitation, I strongly recommend you get in > touch with Microsoft Technical Support and express your concerns to > them. Make sure to let them know just how certain you are that these are > bugs. See if they'll send you a T-shirt. > > -- > Doug Ewell | http://ewellic.org | Thornton, CO ???? > > > From charupdate at orange.fr Thu Aug 6 08:29:14 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Thu, 6 Aug 2015 15:29:14 +0200 (CEST) Subject: Windows keyboard restrictions Message-ID: <1232752979.9255.1438867754224.JavaMail.www@wwinf1h09> I've got a bug in my mailbox. While taking care to send my e-mail in plain text, I got it converted somehow to HTML with all "tags" disappearing. So I ended up replacing all < and > by single angle quotation marks. That seems safer than converting them to HTML codes. Perhaps I shouldn't call it a bug, but just that I don't know how to use a mailbox. Sorry to send it twice. N.B. I'll send two others in reply to the emoji thread, I just can't do it all at once. ___________________________________________________________________ A part of the documentation you request is available: ?Download Windows Driver Kit Version 7.1.0 from Official Microsoft Download Center.? N. p., 1 Dec. 2014. Web. 1 Dec. 2014. https://www.microsoft.com/en-us/download/details.aspx?id=11800 C:\WinDDK\7600.16385.1\inc\api\kbd.h Line 469, and preceding. --- Note: This file can be accessed on-line on third party websites. You may wish to try a Bing* search for "kbd.h". * My preferred internet search engine is Bing??? [This ballot box U+2610 is on my keyboard at Shift+Kana+L.] --- To circumvent the issues araising from the word ?bug?, we may simply ban that and focus on a few comments: * Ligature is an internal name for Wchar sequences that are * generated when a specified key is pressed. * * The ligatures are all in *one* table. * The number of characters of the longest ligature * determines the number replacing the ?n? in * static ALLOC_SECTION_LDATA LIGATURE?n? * and in * ?n?, * sizeof(aLigature[0]), * (PLIGATURE1)aLigature * below in the static ALLOC_SECTION_LDATA KBDTABLES. * * The maximum length of ligatures is 16 characters. * Characters from 17th on are discarded. * * The ligatures table must be defined for ?n? characters, * whether in kbd.h, or kbdfrfre.h, or here before, * using the following define: * TYPEDEF_LIGATURE(?n?) * For clarification, a trailing comment is added: * // LIGATURE?n?, *PLIGATURE?n? * Tables for up to 5 characters length are already defined in * C:\WinDDK\7600.16385.1\inc\api\kbd.h. * * The lasting Wchar fields of each ligature that is shorter than * the maximum length defined, may be filled up with 0xF000, or with * WCH_NONE as defined in kbd.h, or NONE if defined in the custom header. * These entries may be shortened, especially when the ligatures table * is not edited in a single spreadsheet table. What?s new for me, is that ?sometimes? [scare quotes], the ligature length must not exceed four characters. I already knew what?s written in the MSKLC Help about this topic, and I explained in my previous e-mail that, when the MSKLC?was built, Windows did not support more than four characters per ligature. (That?s the only straightforward explanation of this point of the MSKLC.) As this proved to be insufficient, Microsoft must have decided to raise the limit to sixteen. Following some advice I programmed a ligature of 35 characters on Sat Feb 28, 2015; thus I've got the opportunity of seeing that 16?characters were inserted, while the overflow was discarded. Windows worked normally. The mapped virtual key was VK_OEM_COMMA, and the modification number was?8, that was Shift+AltGr+Kana (which it is still today). After the recent event, we may add the following: * CAUTION: THE INITIAL MAXIMUM LENGTH OF LIGATURES WAS 4 CHARACTERS. * DEPENDING ON THE SHIFT STATE, IT MAY HAPPEN THAT LONGER LIGATURES * CAUSE THE KEYBOARD DRIVER TO FAIL. * IF THE FAILING DRIVER IS THE ONE OF THE DEFAULT KEYBOARD LAYOUT, * THE SYSTEM MAY NOT WORK AS EXPECTED. I?m hopeful that you will agree upon this formulation, and I hope that helps. Best regards, Marcel Schneider On 03 Aug 2015, at 20:02, Doug Ewell wrote: > Marcel Schneider wrote: > > > The bug on Windows I encountered at the end of July has been > > definitely identified and reconstructed. After ninety-five drivers > > compiled since the bug appeared, I can tell so much as that the > > problem is related to the length of the so-called ligatures. When the > > MSKLC was built, they were limited to four characters on Windows (see > > glossary in the MSKLC Help). On my machine the maximal length is 16 > > characters. The problem is that this is not equal on all shift states > > and perhaps keys. Roughly, I can put five characters on modification > > number three, that is normally AltGr, but not on #4 (Shift+AltGr). > > As far as I can tell, the limit for a ligature on a Windows keyboard > layout is four UTF-16 code points: > > MSKLC help, under "Validation Reference": > "Ligatures cannot contain more than four UTF-16 code points" > > Presentation from IUC 23 by Michael Kaplan (author of MSKLC) and Cathy > Wissink: > http://tinyurl.com/o49r4bz > > KbdEdit: > http://www.kbdedit.com/ > > MUFI: > http://folk.uib.no/hnooh/mufi/keyboards/WinXPkeyboard.html > > I understand that there are some tools (such as Keyboard Layout Manager) > that claim a higher limit, and it may even be possible in some cases to > assign more than four, but the DOCUMENTED limit appears to be four. (If > you claim that it is not, please provide a link to the relevant official > documentation, and note that C++ code showing 16 fields is not > documentation.) > > It is not a bug for software to fail to perform BEYOND its documented > limits. > > Since you are so very eager to declare this a bug, or a collection of > bugs, rather than a design limitation, I strongly recommend you get in > touch with Microsoft Technical Support and express your concerns to > them. Make sure to let them know just how certain you are that these are > bugs. See if they'll send you a T-shirt. > > -- > Doug Ewell | http://ewellic.org | Thornton, CO ???? > > > From doug at ewellic.org Thu Aug 6 11:00:21 2015 From: doug at ewellic.org (Doug Ewell) Date: Thu, 06 Aug 2015 09:00:21 -0700 Subject: Windows keyboard restrictions Message-ID: <20150806090021.665a7a7059d7ee80bb4d670165c8327d.dee6bc8c58.wbe@email03.secureserver.net> Marcel Schneider wrote: > A part of the documentation you request is available: > > ?Download Windows Driver Kit Version 7.1.0 from Official Microsoft > Download Center.? N. p., 1 Dec. 2014. Web. 1 Dec. 2014. > > https://www.microsoft.com/en-us/download/details.aspx?id=11800 > > C:\WinDDK\7600.16385.1\inc\api\kbd.h > > Line 469, and preceding. This snippet of code -- again, code is not documentation -- is the only place in the entire DDK that gives any indication that anyone thought a ligature of more than 4 code points could be valid: > #define TYPEDEF_LIGATURE(n) typedef struct _LIGATURE##n { \ > BYTE VirtualKey; \ > WORD ModificationNumber; \ > WCHAR wch[n]; \ > } LIGATURE##n, *KBD_LONG_POINTER PLIGATURE##n; > [...] > TYPEDEF_LIGATURE(5) // LIGATURE5, *PLIGATURE5; No code within the DDK, including the samples, appears to use TYPEDEF_LIGATURE(5) or any larger value. So I don't see any evidence in the code that the DDK actually supports ligatures longer than 4 code points. > To circumvent the issues araising from the word ?bug?, we may simply > ban that and focus on a few comments: > * Ligature is an internal name for Wchar sequences that are > * generated when a specified key is pressed. > [...] > * The maximum length of ligatures is 16 characters. > * Characters from 17th on are discarded. I can't find this text anywhere within the DDK (not even the substrings "Wchar sequences" or "length of ligatures"), unless for some reason it's in UTF-16 encoded text. So I also don't see any documentation that the DDK supports ligatures longer than 4 code points. > What???s new for me, is that ?sometimes? [scare quotes], the ligature > length must not exceed four characters. I already knew what???s > written in the MSKLC Help about this topic, and I explained in my > previous e-mail that, when the MSKLC was built, Windows did not > support more than four characters per ligature. (That???s the only > straightforward explanation of this point of the MSKLC.) As this > proved to be insufficient, Microsoft must have decided to raise the > limit to sixteen. Speculation is also not documentation. Seriously, please take this to Microsoft or to one of the forums where the Driver Development Kit is discussed. This has nothing to do with Unicode. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From richard.wordingham at ntlworld.com Thu Aug 6 12:32:32 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Thu, 6 Aug 2015 18:32:32 +0100 Subject: Windows keyboard restrictions In-Reply-To: <20150806090021.665a7a7059d7ee80bb4d670165c8327d.dee6bc8c58.wbe@email03.secureserver.net> References: <20150806090021.665a7a7059d7ee80bb4d670165c8327d.dee6bc8c58.wbe@email03.secureserver.net> Message-ID: <20150806183232.6f2e4b5e@JRWUBU2> On Thu, 06 Aug 2015 09:00:21 -0700 "Doug Ewell" wrote: > Seriously, please take this to Microsoft or to one of the forums where > the Driver Development Kit is discussed. This has nothing to do with > Unicode. That depends on the availability of Tavultesoft Keyman. The UK has been discussing whether a certain user-perceived character should be encoded as a single character in a new script. Users ought to have this character on their keyboards, but there is a worry about technical problems if it is encoded as a sequence of three characters, i.e. six UTF-16 code units. If Windows easily supports a ligature of six UTF-16 code units, then one argument for encoding it is eliminated. Richard. From doug at ewellic.org Thu Aug 6 12:56:51 2015 From: doug at ewellic.org (Doug Ewell) Date: Thu, 06 Aug 2015 10:56:51 -0700 Subject: Windows keyboard restrictions Message-ID: <20150806105651.665a7a7059d7ee80bb4d670165c8327d.8ed83d4adc.wbe@email03.secureserver.net> Richard Wordingham wrote: > The UK has been discussing whether a certain user-perceived character > should be encoded as a single character in a new script. Users ought > to have this character on their keyboards, but there is a worry about > technical problems if it is encoded as a sequence of three characters, > i.e. six UTF-16 code units. What is this character? Is it currently encoded as three SMP characters? What are they? -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From jcb+unicode at inf.ed.ac.uk Thu Aug 6 13:08:14 2015 From: jcb+unicode at inf.ed.ac.uk (Julian Bradfield) Date: Thu, 6 Aug 2015 19:08:14 +0100 (BST) Subject: Windows keyboard restrictions References: <20150806090021.665a7a7059d7ee80bb4d670165c8327d.dee6bc8c58.wbe@email03.secureserver.net> <20150806183232.6f2e4b5e@JRWUBU2> Message-ID: On 2015-08-06, Richard Wordingham wrote: > That depends on the availability of Tavultesoft Keyman. The UK has been > discussing whether a certain user-perceived character should be encoded > as a single character in a new script. Users ought to have this > character on their keyboards, but there is a worry about technical > problems if it is encoded as a sequence of three characters, i.e. six > UTF-16 code units. If Windows easily supports a ligature of six UTF-16 > code units, then one argument for encoding it is eliminated. Unicode is supposed to be for the (sadly probably rather short) life of human civilization, until we have no more need for text. Using an ephemeral property of an ephemeral operating system for ephemeral computers in an encoding argument makes no sense. -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. From charupdate at orange.fr Thu Aug 6 15:09:32 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Thu, 6 Aug 2015 22:09:32 +0200 (CEST) Subject: Emoji characters for food allergens In-Reply-To: <55BFF817.2070401@kli.org> References: <1310487596.11828.1438612211991.JavaMail.www@wwinf1g28> <55BFF817.2070401@kli.org> Message-ID: <148825130.21047.1438891772329.JavaMail.www@wwinf1j18> On 04 Aug 2015, at 01:24, Mark E. Shoulson wrote: > if you're interested in this kind of pictographic pidgin, take a look at https://www.kwikpoint.com/ Someone already did some of it. Personally I can???t do, nor support thoroughly, anything about pictographic language. I just answer some e-mails. I???m too busy with implementing a relatively modest subset of Unicode at keyboard driver level. But it???s indeed very interesting to learn about this already thriving publishing, and I???m glad that even human lives have been saved thanks to this new way to overcome the language barrier. So as this post refers to one of my replies, I assume the task of thanking Mr?Shoulson for this information. I note that Kwikpoint Instructional Systems are used notably when there is no means of calling a translator, as a pragmatic approach like in the example on the home page, where body language is used to complete the item on which it is understandable. By a lucky coincidence, this example has been performed in the country of ancient Babel. Applying the point-to-picture method to food allergens, one could wish to point to skull-and-crossbones, then to an ear of wheat or a loaf of bread, then to an egg, then to a cheese wedge or a glass of milk, then to a lupin flower, then to some kinds of nuts, finishing with skull-and-crossbones again. Because as Mr?Freytag points out, the allergen meaning of a food symbol cannot be induced safely enough. And as he explains, it???s desirable that the needed symbols be at least highly iconic, and ideally regulated by other standards bodies than Unicode. I do wish that Kwikpoint be so successful that the symbols it creates for missing items become widely popular. Best regards, Marcel On 04 Aug 2015, at 01:24, Mark E. Shoulson wrote: > On 08/03/2015 10:30 AM, Marcel Schneider wrote: > > On 29 Jul 2015, at 15:42, William_J_G Overington wrote: > > > >> Emoji seemed like a wonderful way to achieve communication through the language barrier. > > We remember that Esperanto was also a hopeful way to unify the language, raising much enthusiasm among its followers. IMHO a pictograph based script can hardly be enough performing, unless it ends up to become a kind of new Esperanto except that it doesn?t include speech. > > It's already noted that this is totally out of scope for Unicode, but if > you're interested in this kind of pictographic pidgin, take a look at > https://www.kwikpoint.com/ Someone already did some of it. > > ~mark > > From charupdate at orange.fr Thu Aug 6 15:59:20 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Thu, 6 Aug 2015 22:59:20 +0200 (CEST) Subject: Emoji characters for food allergens In-Reply-To: <27139592.47254.1438882560402.JavaMail.defaultUser@defaultHost> References: <1310487596.11828.1438612211991.JavaMail.www@wwinf1g28> <55BFF817.2070401@kli.org> <27139592.47254.1438882560402.JavaMail.defaultUser@defaultHost> Message-ID: <640641349.21541.1438894760208.JavaMail.www@wwinf1j28> I believe that standardizing A numbers for allergens, as Mr Overington suggests commenting the interesting blog post he's sharing, is an excellent idea. ? I think so because these numbers be added to the names, this would be even better than the E numbers, behind which often the additives are hidden, some of which have long-time harmful effects, and the continuous consumption of most of which causes overall health issues. ? Consequently, there are good solutions on-going, so that the support Unicode cannot safely provide, will be replaced. ? Best regards, ? Marcel ? ? > Message du 06/08/15 19:36 > De : "William_J_G Overington" > A : "Marcel Schneider" , mark at kli.org, komatsu at google.com, unicode at unicode.org, gwalla at gmail.com > Copie ? : > Objet : Re: Emoji characters for food allergens > > Please may I draw to your attention the following blog post. > > http://www.michellesblog.co.uk/emoji-ing-food-allergens/ > > The blog is by the same lady that runs the following website, a specialist website about food allergens and freefrom food.. > > http://www.foodsmatter.com/ > > William Overington > > 6 August 2015 > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Thu Aug 6 16:56:46 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Thu, 6 Aug 2015 22:56:46 +0100 Subject: Windows keyboard restrictions In-Reply-To: <20150806105651.665a7a7059d7ee80bb4d670165c8327d.8ed83d4adc.wbe@email03.secureserver.net> References: <20150806105651.665a7a7059d7ee80bb4d670165c8327d.8ed83d4adc.wbe@email03.secureserver.net> Message-ID: <20150806225646.46f94c6d@JRWUBU2> On Thu, 06 Aug 2015 10:56:51 -0700 "Doug Ewell" wrote: > Richard Wordingham wrote: > > > The UK has been discussing whether a certain user-perceived > > character should be encoded as a single character in a new script. > > Users ought to have this character on their keyboards, but there is > > a worry about technical problems if it is encoded as a sequence of > > three characters, i.e. six UTF-16 code units. > > What is this character? Is it currently encoded as three SMP > characters? What are they? It's part of an unencoded, living script. There is no suitable contiguous place for the script in the BMP. There is a set of characters within the script that appear to be sequences of three characters, and encoding these characters as single elements almost makes about as much sense as encoding English on the basis that it represents the sound [hw], not the sound [wh]. Several of the sequences of three characters occur in the region's language of high culture and religion, which apparently is also written in the script. The 'UK has been discussing' means there has been discussion of what position the UK should take over this set of characters in the ISO 10646 amendment process. Richard. From doug at ewellic.org Thu Aug 6 17:31:57 2015 From: doug at ewellic.org (Doug Ewell) Date: Thu, 06 Aug 2015 15:31:57 -0700 Subject: Windows keyboard restrictions Message-ID: <20150806153157.665a7a7059d7ee80bb4d670165c8327d.3746d61c4e.wbe@email03.secureserver.net> Richard Wordingham wrote: > It's part of an unencoded, living script. There is no suitable > contiguous place for the script in the BMP. There is a set of > characters within the script that appear to be sequences of three > characters, and encoding these characters as single elements almost > makes about as much sense as encoding English on the basis that > it represents the sound [hw], not the sound [wh]. Several of the > sequences of three characters occur in the region's language of high > culture and religion, which apparently is also written in the script. If this is about murmured consonants in Newa, the arguments presented in L2/14-281, both for and against, seem more relevant than whether a cluster of three SMP characters can fit on a single key in a Windows keyboard layout. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From wjgo_10009 at btinternet.com Thu Aug 6 12:36:00 2015 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Thu, 6 Aug 2015 18:36:00 +0100 (BST) Subject: Emoji characters for food allergens In-Reply-To: <55BFF817.2070401@kli.org> References: <1310487596.11828.1438612211991.JavaMail.www@wwinf1g28> <55BFF817.2070401@kli.org> Message-ID: <27139592.47254.1438882560402.JavaMail.defaultUser@defaultHost> Please may I draw to your attention the following blog post. http://www.michellesblog.co.uk/emoji-ing-food-allergens/ The blog is by the same lady that runs the following website, a specialist website about food allergens and freefrom food.. http://www.foodsmatter.com/ William Overington 6 August 2015 From philip_chastney at yahoo.com Fri Aug 7 05:01:55 2015 From: philip_chastney at yahoo.com (philip chastney) Date: Fri, 7 Aug 2015 03:01:55 -0700 Subject: Windows keyboard restrictions Message-ID: <1438941715.78626.YahooMailBasic@web181504.mail.ne1.yahoo.com> -------------------------------------------- On Thu, 6/8/15, Julian Bradfield wrote: .> On 2015-08-06, Richard Wordingham > > wrote: > > That depends on the availability of Tavultesoft Keyman.? The UK has been > > discussing whether a certain user-perceived character should be encoded > > as a single character in a new script.? Users ought to have this > > character on their keyboards, but there is a worry about technical > > problems if it is encoded as a sequence of three characters, i.e. six > > UTF-16 code units.? If Windows easily supports a ligature of six UTF-16 > > code units, then one argument for encoding it is eliminated. > Unicode is supposed to be for the (sadly probably rather short) life > of human civilization, until we have no more need for text. Using an > ephemeral property of an ephemeral operating system for ephemeral > computers in an encoding argument makes no sense. requirements, too, can be ephemeral the Oxford English Dictionary aims to include every word in "general use" since Chaucer, where "general use" means it was continuously used in that sense for a minimum of 10 years (or something along those lines) when "ghettoblaster" was included, the story made it into the newspapers -- when did you last even see a ghettoblaster? but still, a definition may be useful for somebody in fifty years' time writing a survey of English novels from the 1980s, so the word's inclusion is justified I also remember last Christmas being surprised to see a dingbat in use -- will all those dingbats in Unicode be of use in a few years time? will emoji? /phil From doug at ewellic.org Fri Aug 7 11:26:56 2015 From: doug at ewellic.org (Doug Ewell) Date: Fri, 07 Aug 2015 09:26:56 -0700 Subject: Windows keyboard restrictions Message-ID: <20150807092656.665a7a7059d7ee80bb4d670165c8327d.154138f225.wbe@email03.secureserver.net> Michael Kaplan, author of MSKLC, reports that not only is the limit on UTF-16 code points in a Windows keyboard ligature still 4, it is likely to remain 4 for the foreseeable future: http://www.siao2.com/2015/08/07/8770668856267196989.aspx "People who want input methods capable of handling more than four UTF-16 code points really need to look into IMEs (Input Method Editors) which are all now run through TSF (the Text Services Framework), a completely different system of input that allows such things, admittedly at the price of a lot of complexity." This should settle the matter. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From richard.wordingham at ntlworld.com Fri Aug 7 13:54:15 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Fri, 7 Aug 2015 19:54:15 +0100 Subject: Windows keyboard restrictions In-Reply-To: <20150807092656.665a7a7059d7ee80bb4d670165c8327d.154138f225.wbe@email03.secureserver.net> References: <20150807092656.665a7a7059d7ee80bb4d670165c8327d.154138f225.wbe@email03.secureserver.net> Message-ID: <20150807195415.4c725da4@JRWUBU2> On Fri, 07 Aug 2015 09:26:56 -0700 "Doug Ewell" wrote: > Michael Kaplan, author of MSKLC, reports that not only is the limit on > UTF-16 code points in a Windows keyboard ligature still 4, it is > likely to remain 4 for the foreseeable future: > > http://www.siao2.com/2015/08/07/8770668856267196989.aspx It's good to see he's still with us. > "People who want input methods capable of handling more than four > UTF-16 code points really need to look into IMEs (Input Method > Editors) which are all now run through TSF (the Text Services > Framework), a completely different system of input that allows such > things, admittedly at the price of a lot of complexity." What we're waiting for is a guide we can follow, or some code we can ape. Such should be, or should have been, available in a Tavultesoft Keyman rip-off. In the mean time, I notice Micha Kaplan's comment: "even if there were, such a keyboard layout would not be compatible with any prior version of Windows;" I think that is exactly what Marcel Schneider encountered. Note further that Micha implied that he got the specification by reading a header file, exactly the sort of documentation you disallowed. The data structure (field cbLgEntry) allows for arbitrary lengths; its precise semantics may have been established by experiment. It is possible that it may have been broken for arbitrary sizes and has now been fixed. > This should settle the matter. MSKLC doesn't seem to be liked by Microsoft. Quite possibly they would like to get rid of the interface its keyboards generate. Supporting such user-defined keyboards may just be an overhead for them. Any comment from the Microsoft employees? Richard. From charupdate at orange.fr Fri Aug 7 15:40:55 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Fri, 7 Aug 2015 22:40:55 +0200 (CEST) Subject: Windows keyboard restrictions In-Reply-To: <20150807092656.665a7a7059d7ee80bb4d670165c8327d.154138f225.wbe@email03.secureserver.net> References: <20150807092656.665a7a7059d7ee80bb4d670165c8327d.154138f225.wbe@email03.secureserver.net> Message-ID: <1609214653.16412.1438980055712.JavaMail.www@wwinf1h15> On 07 Aug 2015, at 18:38, Doug Ewell wrote: > Michael Kaplan, author of MSKLC, reports that not only is the limit on > UTF-16 code points in a Windows keyboard ligature still 4, it is likely > to remain 4 for the foreseeable future: > > http://www.siao2.com/2015/08/07/8770668856267196989.aspx > > "People who want input methods capable of handling more than four UTF-16 > code points really need to look into IMEs (Input Method Editors) which > are all now run through TSF (the Text Services Framework), a completely > different system of input that allows such things, admittedly at the > price of a lot of complexity." > > This should settle the matter. I?wouldn?t have made a ?battle? of that. Please, note that I?m quoting somebody else; these quotes cannot be mistaken for scare quotes (which BTW would probably have been more appropriate, and thus more expected). And I wouldn?t have answered any more. I just don?t want to let the Mailing List believe that I?agreed being classified as ?fighting the [bad] fight?, if not even as a ?bad boy? (that isn?t quoted from here). So unfortunately I?can?t help replying again. For all ?documentation?, this a bit vulgar blog post that is being shared, cites internal references (other blog posts from the same author on the same web site). The header file it refers to, remains unquoted and unlinked. Thus, this blog post is biased with the authority bias. I?m not quite sure whether people are conscious that by contesting the accuracy of the original actual Windows keyboard driver header file (kbd.h), they are insulting the developer(s) who wrote it, as well as the company that stands behind him/them. For not wanting to make anybody loose face, I didn?t mention that a copy of the cited and quoted header file is included in the MSKLC. The version 1.4 of which dates from Thu, Jan 25, 2007, ?23:14:22, whereas the included kbd.h shows ?10-Jan-1991 GregoryW? in the file history. Therefore, my supposition (I?hadn?t looked up that!) that ?when the MSKLC?was built, Windows did not support more than four characters per ligature. (That?s the only straightforward explanation of this point of the MSKLC.)? turns out to be completely wrong (except the parenthesized disclaimer). I could become more explicit, but I just stand away in order not to heat up the discussion with ad personam conclusions. At this unexpected point of the thread, I?m extremely sickened. At the same time, the shared blog post helps me to understand a bit better some asperities of the overall (most of the time) rather sympathetic MSKLC. I often wondered why the description page [https://msdn.microsoft.com/fr-fr/goglobal/bb964665.aspx] and even less the download page [https://www.microsoft.com/en-us/download/details.aspx?id=22339] have not been updated (no mention of Windows?8 on the former, no mention even of Windows?7 in the system requirements on the download page), and why there?s no 2.0 version of the MSKLC. Most times I?answered to myself that the little interest on users? side discouraged Microsoft from investing in such an update.?That?s now to be revised. I?d never imagined that a limitation in the MSKLC (not the only, but the most striking one) could be justified and defended the way it is. IMO it would have been wise to limit this thread to ?Ligature length on Windows?. Now that it extended to all ?Windows keyboard limitations?, let?s extend a bit more to prevent further disruptions. I?m not here to criticize Microsoft. I?ask everybody to be honest and to answer for himself one single question: How on earth can I prefer Bing if I?were battling against Microsoft? Does anybody really believe that I?m annoying myself to find more bugs? So please remember that by the time, the Redmond company got the unlucky reputation of not listening to its users. I?ve got the strong hope that this tendency has been reversed, but I?still believe that as soon as Unicode implementation is concerned, the Unicode Mailing List is one of the best places to send the topic. I still believe it today, as this thread has taught me a lot. Hopeful that this will end in a constructive way, Marcel Schneider -------------- next part -------------- An HTML attachment was scrubbed... URL: From charupdate at orange.fr Fri Aug 7 15:59:16 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Fri, 7 Aug 2015 22:59:16 +0200 (CEST) Subject: Windows keyboard restrictions In-Reply-To: <20150807195415.4c725da4@JRWUBU2> References: <20150807092656.665a7a7059d7ee80bb4d670165c8327d.154138f225.wbe@email03.secureserver.net> <20150807195415.4c725da4@JRWUBU2> Message-ID: <1166060353.16668.1438981156446.JavaMail.www@wwinf1h15> On 07 Aug 2015, at 21:04, Richard Wordingham wrote: > On Fri, 07 Aug 2015 09:26:56 -0700 > "Doug Ewell" wrote: > > > Michael Kaplan, author of MSKLC, reports that not only is the limit on > > UTF-16 code points in a Windows keyboard ligature still 4, it is > > likely to remain 4 for the foreseeable future: > > > > http://www.siao2.com/2015/08/07/8770668856267196989.aspx > > It's good to see he's still with us. I had a great meaning of the authors of the MSKLC. And I had still when I learned here that MSKLC is the work of a single author. I don't tell more. > > > "People who want input methods capable of handling more than four > > UTF-16 code points really need to look into IMEs (Input Method > > Editors) which are all now run through TSF (the Text Services > > Framework), a completely different system of input that allows such > > things, admittedly at the price of a lot of complexity." Dismissing people to complex IMEs while a simple solution is (or can be) available at little expense, is symptomatical of user unfriendly software management. I brought the good news that SIXTEEN UNICODE CODE POINTS can be generated by a single key stroke on Windows six dot one. The only bad news, because of which I've e-mailed to the List, is that that wasn't working in one single circumstance. It was obvious that the main thing to do, is to inform about this fact, so that other people mustn't search for a bug in the driver if it's only that. > > What we're waiting for is a guide we can follow, or some code we can > ape. Such should be, or should have been, available in a Tavultesoft > Keyman rip-off. > > In the mean time, I notice Micha Kaplan's comment: > > "even if there were, such a keyboard layout would not be compatible with > any prior version of Windows;" > > I think that is exactly what Marcel Schneider encountered. Not really. We are talking of a ligatures feature that was programmed in 1991. So it may be possible that the same event is likely to occur on Windows Seven and later. But Mr?Kaplan is addressing as "prior", Windows until Eight (dot one). > Note further that Micha implied that he got the specification by reading a > header file, exactly the sort of documentation you disallowed. > > The data structure (field cbLgEntry) allows for arbitrary lengths; its > precise semantics may have been established by experiment. Without any false modesty I can tell that I established a limit as far as my machine is concerned, and that this limit is 16 characters per ligature; now I stated some exception but that doesn't invalidate the principle. To say it all, I have actually one ligature with 16 characters, one with 15, about one with 7 and so on. > It is possible that it may have been broken for arbitrary sizes and has now > been fixed. > > > This should settle the matter. > > MSKLC doesn't seem to be liked by Microsoft. Quite possibly they would > like to get rid of the interface its keyboards generate. Supporting > such user-defined keyboards may just be an overhead for them. Any > comment from the Microsoft employees? I'm impatient to read this comment, and I'm joining my expectations to Mr Wordingham's. Best regards, Marcel -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Fri Aug 7 16:34:47 2015 From: doug at ewellic.org (Doug Ewell) Date: Fri, 07 Aug 2015 14:34:47 -0700 Subject: Windows keyboard restrictions Message-ID: <20150807143447.665a7a7059d7ee80bb4d670165c8327d.2c6723ffdc.wbe@email03.secureserver.net> Richard Wordingham wrote: > It's good to see he's still with us. Still out there, just not on this list. > What we're waiting for is a guide we can follow, or some code we can > ape. Such should be, or should have been, available in a Tavultesoft > Keyman rip-off. I guess you mean that such a guide would expose the Windows infrastructure to help explain the inner workings of the third-party software. That would be informative only to the extent it was accurate. > In the mean time, I notice Micha Kaplan's comment: > > "even if there were, such a keyboard layout would not be compatible > with any prior version of Windows;" > > I think that is exactly what Marcel Schneider encountered. Sort of. Michael was saying that IF the Windows limit had been increased from 4 to something higher, and someone implemented a keyboard taking advantage of that higher limit, it would not work on older versions. Marcel implemented a keyboard that took advantage of a higher limit which never existed on Windows, so it doesn't work on ANY version. But wait! Didn't he say it worked some of the time, with some shift states but not others? Sure it did, for the same reason that a buffer overrun in C doesn't always cause a program crash or a security hole. Sometimes, if you're lucky, the memory being overwritten doesn't contain critical data at the time of the overwrite. Sometimes you're not so lucky. > Note further that Micha implied that he got the specification by > reading a header file, exactly the sort of documentation you > disallowed. I wasn't looking for documentation that the well-known limit of 4 existed in the first place, or had not been changed. I was looking for documentation that it HAD been changed. That's where the burden of proof lies. Michael probably has more extensive expert knowledge of the Windows keyboard subsystem than anyone else, which is why I asked him. > The data structure (field cbLgEntry) allows for arbitrary lengths; its > precise semantics may have been established by experiment. It is > possible that it may have been broken for arbitrary sizes and has now > been fixed. I don't know what "has now been fixed" means. I haven't seen any evidence that anything about this has changed since the '90s. > MSKLC doesn't seem to be liked by Microsoft. Quite possibly they would > like to get rid of the interface its keyboards generate. Supporting > such user-defined keyboards may just be an overhead for them. I doubt they ever have to provide support for user-defined keyboards. I see that MSKLC itself "is distributed 'as is', with no obligations or technical support from Microsoft Corporation." If we're speculating on Microsoft's intent, my guess is that the move to TSF is some sort of attempt to consolidate desktop, tablet, and phone keyboard behavior into a single framework. I confess I don't know much about TSF. > Any comment from the Microsoft employees? -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From doug at ewellic.org Fri Aug 7 17:01:54 2015 From: doug at ewellic.org (Doug Ewell) Date: Fri, 07 Aug 2015 15:01:54 -0700 Subject: Windows keyboard restrictions Message-ID: <20150807150154.665a7a7059d7ee80bb4d670165c8327d.43590305ac.wbe@email03.secureserver.net> Marcel Schneider wrote: > I just don?t want to let the Mailing List believe that I agreed being > classified as ?fighting the [bad] fight? And I don't think Michael implied that. I just want to get the technical facts, so that hopefully they can take the place of speculation and presumptions of buggy behavior. Please note that "bug" is not necessarily a bad word -- we developers create real bugs all the time, and hopefully own up to them -- but it rankles when someone applies the word to software that works as designed, when that person either doesn't understand or doesn't agree with the intended behavior. > Thus, this blog post is biased with the authority bias. It's from someone at Microsoft with expert knowledge of the Windows keyboard subsystem, if that's what you mean. > I?m not quite sure whether people are conscious that by contesting the > accuracy of the original actual Windows keyboard driver header file > (kbd.h), they are insulting the developer(s) who wrote it, as well as > the company that stands behind him/them. kbd.h contains exactly zero examples of keyboards with ligatures with more than 4 code points. I downloaded and installed the whole DDK just to find this out, not realizing I already had a copy in my MSKLC folder. > For not wanting to make anybody loose face, I didn?t mention that a > copy of the cited and quoted header file is included in the MSKLC. Yep, I could have saved a lot of time if I'd noticed that. > The version 1.4 of which dates from Thu, Jan 25, 2007, ?23:14:22, > whereas the included kbd.h shows ?10-Jan-1991 GregoryW? in the file > history. My copy says: @@BEGIN_DDKSPLIT * 10-Jan-1991 GregoryW * 23-Apr-1991 IanJa VSC_TO_VK _* macros from oemtab.c @@END_DDKSPLIT Looks like things have been pretty stable since 1991. > Therefore, my supposition (I hadn?t looked up that!) that ?when the > MSKLC was built, Windows did not support more than four characters per > ligature. (That?s the only straightforward explanation of this point > of the MSKLC.)? turns out to be completely wrong (except the > parenthesized disclaimer). I could become more explicit, but I just > stand away in order not to heat up the discussion with ad personam > conclusions. Good idea. What led you to the conclusion that this limit had been increased, anyway? ("On my machine the maximal length is 16 characters.") I'm still curious about that. > I often wondered why the description page > [https://msdn.microsoft.com/fr-fr/goglobal/bb964665.aspx] and even > less the download page > [https://www.microsoft.com/en-us/download/details.aspx?id=22339] have > not been updated (no mention of Windows 8 on the former, no mention > even of Windows 7 in the system requirements on the download page), > and why there?s no 2.0 version of the MSKLC. Microsoft simply hasn't dedicated any resources (Michael or anyone else) to updating MSKLC. Michael has blogged about this many, many times in the past few years. Big companies make the decisions that they make, for the reasons they have. > I?m not here to criticize Microsoft. I ask everybody to be honest and > to answer for himself one single question: How on earth can I prefer > Bing if I were battling against Microsoft? Does anybody really believe > that I?m annoying myself to find more bugs? I apologize for my tone in this thread. See my explanation above of when "bug" is an appropriate conclusion to draw, and when it isn't. That got me started. > Hopeful that this will end in a constructive way, Agreed. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From doug at ewellic.org Fri Aug 7 17:21:16 2015 From: doug at ewellic.org (Doug Ewell) Date: Fri, 07 Aug 2015 15:21:16 -0700 Subject: Windows keyboard restrictions Message-ID: <20150807152116.665a7a7059d7ee80bb4d670165c8327d.f3e8563f5d.wbe@email03.secureserver.net> Marcel Schneider wrote: > I brought the good news that SIXTEEN UNICODE CODE POINTS can be > generated by a single key stroke on Windows six dot one. The only bad > news, because of which I've e-mailed to the List, is that that wasn't > working in one single circumstance. It was obvious that the main thing > to do, is to inform about this fact, so that other people mustn't > search for a bug in the driver if it's only that. But that's what I've been trying to say. The maximum isn't 16, it's 4. "That wasn't working" is the expected behavior here. If you were able to create a keyboard layout where 16 code points ever worked on Windows 7 (which reports itself as "6.1"), it was purely by accident -- because Windows 7 did not check for the overrun, and because the overrun did not happen to cause any collateral damage. If you have a light bulb that's rated for 110 volts, and you apply 220 volts to it and for some reason the bulb doesn't burn out immediately, that doesn't mean 220 volts is the correct operating environment for that bulb. It means you got lucky. If there's a bug here, it's that Windows didn't detect that the limit had been exceeded, and respond by locking out the key. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From Andrew.Glass at microsoft.com Fri Aug 7 19:11:46 2015 From: Andrew.Glass at microsoft.com (Andrew Glass (WINDOWS)) Date: Sat, 8 Aug 2015 00:11:46 +0000 Subject: Windows keyboard restrictions In-Reply-To: <20150807152116.665a7a7059d7ee80bb4d670165c8327d.f3e8563f5d.wbe@email03.secureserver.net> References: <20150807152116.665a7a7059d7ee80bb4d670165c8327d.f3e8563f5d.wbe@email03.secureserver.net> Message-ID: Sorry to be late to this thread. I'm the Program Manager responsible for MSKLC at this time. As far as the history here, I can only reiterate Michael's point that making significant changes to user32.dll faces significant, perhaps insurmountable headwinds. There would have to be compelling reasons to make any kind of changes here. If you have specific feedback for Microsoft on this issue, please follow up with me off line. Thanks, Andrew Glass -----Original Message----- From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Doug Ewell Sent: Friday, August 7, 2015 3:21 PM To: Unicode Mailing List Cc: Marcel Schneider Subject: Re: Windows keyboard restrictions Marcel Schneider wrote: > I brought the good news that SIXTEEN UNICODE CODE POINTS can be > generated by a single key stroke on Windows six dot one. The only bad > news, because of which I've e-mailed to the List, is that that wasn't > working in one single circumstance. It was obvious that the main thing > to do, is to inform about this fact, so that other people mustn't > search for a bug in the driver if it's only that. But that's what I've been trying to say. The maximum isn't 16, it's 4. "That wasn't working" is the expected behavior here. If you were able to create a keyboard layout where 16 code points ever worked on Windows 7 (which reports itself as "6.1"), it was purely by accident -- because Windows 7 did not check for the overrun, and because the overrun did not happen to cause any collateral damage. If you have a light bulb that's rated for 110 volts, and you apply 220 volts to it and for some reason the bulb doesn't burn out immediately, that doesn't mean 220 volts is the correct operating environment for that bulb. It means you got lucky. If there's a bug here, it's that Windows didn't detect that the limit had been exceeded, and respond by locking out the key. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From lang.support at gmail.com Sat Aug 8 02:05:26 2015 From: lang.support at gmail.com (Andrew Cunningham) Date: Sat, 8 Aug 2015 17:05:26 +1000 Subject: Windows keyboard restrictions In-Reply-To: <20150807195415.4c725da4@JRWUBU2> References: <20150807092656.665a7a7059d7ee80bb4d670165c8327d.154138f225.wbe@email03.secureserver.net> <20150807195415.4c725da4@JRWUBU2> Message-ID: On Saturday, 8 August 2015, Richard Wordingham< richard.wordingham at ntlworld.com> wrote: Michael did do a series of blog posts on building TSF based input methods years ago. Something I tinkered with off and on. > What we're waiting for is a guide we can follow, or some code we can > ape. Such should be, or should have been, available in a Tavultesoft > Keyman rip-off. > I don't believe in rip-offs esp when there a free versions and the enhanced version doesn't cost much. But that said there is KMFL on linux which handles a subset of the keyman definition files. And Keith Striebly, before he died, did a port of the kmfl lib to windows. But I doubt anyone is maintaining it. But reality is that the use cases discussed in this and related threads do not need fairly complex or sophisticated layouts. So kmfl and derivates should be fine respite how limited I consider them. Alternative there are a range of input frameworks developed in se asia that would be easy to work with as well. Alternative input frameworks have been around for years. Its up to use them or not use them. I don't see much point bleating about the limitations of the win32 keyboard model. Just use amlternative input framework .. wether it is TSF table based input, keyman , kmfl port to windows or any of a large slather of input frameworks that are available out there. Andrew > -- Andrew Cunningham Project Manager, Research and Development (Social and Digital Inclusion) Public Libraries and Community Engagement State Library of Victoria 328 Swanston Street Melbourne VIC 3000 Australia Ph: +61-3-8664-7430 Mobile: 0459 806 589 Email: acunningham at slv.vic.gov.au lang.support at gmail.com http://www.openroad.net.au/ http://www.mylanguage.gov.au/ http://www.slv.vic.gov.au/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Sat Aug 8 05:05:06 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Sat, 8 Aug 2015 11:05:06 +0100 Subject: Windows keyboard restrictions In-Reply-To: References: <20150807092656.665a7a7059d7ee80bb4d670165c8327d.154138f225.wbe@email03.secureserver.net> <20150807195415.4c725da4@JRWUBU2> Message-ID: <20150808110506.58ec0cc8@JRWUBU2> On Sat, 8 Aug 2015 17:05:26 +1000 Andrew Cunningham wrote: > On Saturday, 8 August 2015, Richard Wordingham< > richard.wordingham at ntlworld.com> wrote: > > Michael did do a series of blog posts on building TSF based input > methods years ago. Something I tinkered with off and on. Does this mean that one can put it all together from his reconstituted blog? I don't know how much was salvaged. Michael has publicly complimented Marc Durdin on being able to find his way through the published Microsoft documentation to make TSF work for him once Microsoft had fixed the bugs he had identified. > > What we're waiting for is a guide we can follow, or some code we can > > ape. Such should be, or should have been, available in a > > Tavultesoft Keyman rip-off. > I don't believe in rip-offs esp when there a free versions and the > enhanced version doesn't cost much. > But that said there is KMFL on linux which handles a subset of the > keyman definition files. And Keith Striebly, before he died, did a > port of the kmfl lib to windows. But I doubt anyone is maintaining it. I was thinking that, at the very least, that his package was working code that one could study. While the porting was morally questionable, I'm not aware of any issues with the code obtaining the keyboard input, discovering the current text context or delivering the text changes once derived. > But reality is that the use cases discussed in this and related > threads do not need fairly complex or sophisticated layouts. So kmfl > and derivates should be fine respite how limited I consider them. Do very recent systems allow ibus input for one's password when logging in? On Ubuntu 12.04 I only see the keyboards defined via X, which only guarantee codepoint by codepoint input. Application compatibility with KMfL has increased, but sophisticated layouts are liable to break. I have seen regressions. For example, when using an XSAMPA-inspired NFC-generating IPA keyboard layout that changes the characters sent (it uses backslash cycles through sets of characters), rescinding characters has failed and the application has stored both sets of characters. Admittedly, last time the problem came and went the set up was a bit complex - I was using Ubuntu as the X-client, Windows 7 as the X-server, and using the X-client to provide the IME. I should be thankful it ever works. I suspect the problem was in the application. Last month Google document wasn't working with the same IPA keyboard on Firefox on Ubuntu, though I don't know if it has ever worked - I don't have much occasion to type IPA in Google document. > I don't see much point bleating about the limitations of the win32 > keyboard model. Just use amlternative input framework .. wether it is > TSF table based input, keyman , kmfl port to windows or any of a > large slather of input frameworks that are available out there. The interface structure used by DLL for win32 supports arbitrary (well, up to 60 at least) ligature lengths. Therefore it isn't obvious that 4 should be the maximum length, especially as I have seen code around that implies that the maximum length is extended by 3 in 'FE' versions. 4 *characters* isn't an unreasonable limit. However, we are now getting minor scripts in modern use that are encoded in the SMP, and for them the limit drops to 2 characters. They also lose the deadkey capability from MSKLC. Richard. From charupdate at orange.fr Sat Aug 8 05:56:40 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Sat, 8 Aug 2015 12:56:40 +0200 (CEST) Subject: Windows keyboard restrictions In-Reply-To: <20150807150154.665a7a7059d7ee80bb4d670165c8327d.43590305ac.wbe@email03.secureserver.net> References: <20150807150154.665a7a7059d7ee80bb4d670165c8327d.43590305ac.wbe@email03.secureserver.net> Message-ID: <159876547.3316.1439031400773.JavaMail.www@wwinf2232> On 08 Aug 2015, at 00:18, Doug Ewell wrote: > Marcel Schneider wrote: > > > I just don?t want to let the Mailing List believe that I agreed being > > classified as ?fighting the [bad] fight? > > And I don't think Michael implied that. I just want to get the technical > facts, so that hopefully they can take the place of speculation and > presumptions of buggy behavior. > > Please note that "bug" is not necessarily a bad word -- we developers > create real bugs all the time, and hopefully own up to them -- but it > rankles when someone applies the word to software that works as > designed, when that person either doesn't understand or doesn't agree > with the intended behavior. Excuse me please to have referred to the phenomenon as being a bug. I got strong habits with that word at the time I was writing down my observations and sending a few of them to the Microsoft Community Answers Forum. Indeed some handled user unfriendly design limitations, like the disabling of Ctrl+A in the Formula Bar of Excel, or real bugs that have been fixed in the following version of Word, like the disabling of automatic word break when an apostrophe other than U+0027 is included, or the undue application of autocorrect after an automatic word break. But that's past. It's just to explain that the banned word was always about the first thing that came at mind to me. Well, sometimes even when the problem was that I didn't know how to use the software :-| > > > Thus, this blog post is biased with the authority bias. > > It's from someone at Microsoft with expert knowledge of the Windows > keyboard subsystem, if that's what you mean. Not at all... The problem is the bias, not the authority. I don't take an authority-contesting posture. Let's take the example of somebody making a speech with a presentation about keyboards on Windows. Depending on how the topic is labelled, if it's a general outline of the whole keyboard UI, he must speak about all possible modifiers, that is Shift, Ctrl, Alt, Alt+Ctrl which is AltGr, Kana, to quote just the easiest to implement. There is also Oyayubi (Right Oyayubi, Left Oyayubi), but that was for Fujitsu terminals and I don't see how it could work on Western keyboards; perhaps it does. But Kana is so obvious it is implementable on KbdEdit. It's very useful, along with its toggle KanaLock (VK_KANA). This is why, the Help Glossary of the MSKLC cites the WDK and provides a download link. Then he must speak about chaining dead keys, because that's a Windows supported feature (you equally can implement using KbdEdit). When it comes to dead keys, he is forbidden to make hes audience believe that a dead key is just a combination of two keys, the first of which shows no effect. He may say this to take on the topic, but he mustn't end without mentioning that dead keys can be chained on Windows. Imagine the man making such a speech not in Praha but in Hanoi. Will he wait for the QA to tell that Windows allows to press two dead keys before a letter key to get letters with two diacritics, as they are used in Vietnamese and encoded in Unicode? If he does, it would be wise not to post the PowerPoint on the internet. > > > I?m not quite sure whether people are conscious that by contesting the > > accuracy of the original actual Windows keyboard driver header file > > (kbd.h), they are insulting the developer(s) who wrote it, as well as > > the company that stands behind him/them. > > kbd.h contains exactly zero examples of keyboards with ligatures with > more than 4 code points. I downloaded and installed the whole DDK just > to find this out, not realizing I already had a copy in my MSKLC folder. What did you exactly find out? That there are no examples of keyboards with ligatures? That's accurate. In the actual Windows Driver Kit (WDK), there are zero examples of keyboards with ligatures. This point is noteworthy, as it says much about the support Microsoft grants developers of keyboard layouts. Tell me what's the use of that poor samples collection, letting you alone with programming a ligatures table from scratch! Fortunately, I got around this job. But that's not the topic. Now about what kbd.h contains. It contains a define for a ligature table with two characters, then it contains a define for a ligature table with three characters, then one for a table with four characters, than one for five. Oh what? Yes, for a ligature table containing ligatures of five whole Unicode characters. This define has been quoted in this thread, so there's nothing new. Further we know that kbd.h is not the only header file of a given keyboard layout. Each driver has its dedicated header file. To put what in? Scan code to virtual key undefines and new defines, but also all other needed defines, among which the define of a longer ligature table, which can also be inserted just before the table. I will say with all that, that the developer must look by himself. He is given a number of hints and advice in the comments, but that's roughly all. And unfortunately it isn't complete. At least not about keyboard drivers. > > > > For not wanting to make anybody loose face, I didn?t mention that a > > copy of the cited and quoted header file is included in the MSKLC. > > Yep, I could have saved a lot of time if I'd noticed that. Sorry. > > > The version 1.4 of which dates from Thu, Jan 25, 2007, ?23:14:22, > > whereas the included kbd.h shows ?10-Jan-1991 GregoryW? in the file > > history. > > My copy says: > > @@BEGIN_DDKSPLIT > * 10-Jan-1991 GregoryW > * 23-Apr-1991 IanJa VSC_TO_VK _* macros from oemtab.c > @@END_DDKSPLIT That's what my copy says too, but I focussed on the author of the biggest part, as Mr Ian Ja only added the macros from oemtab.c. And on the date, which [MSKLC]\inc\kbd.h is the only file to provide, the History in [WDK]\inc\kbd.h being empty. > > Looks like things have been pretty stable since 1991. > > > Therefore, my supposition (I hadn?t looked up that!) that ?when the > > MSKLC was built, Windows did not support more than four characters per > > ligature. (That?s the only straightforward explanation of this point > > of the MSKLC.)? turns out to be completely wrong (except the > > parenthesized disclaimer). I could become more explicit, but I just > > stand away in order not to heat up the discussion with ad personam > > conclusions. > > Good idea. Objectively, we must induce that there was a briefing to limit ligature support to four characters despite of Windows being built to support far more, and so on. You know, when people invoke the hell when making assertions, I'm quite doubtful. > > What led you to the conclusion that this limit had been increased, > anyway? ("On my machine the maximal length is 16 characters.") I'm still > curious about that. The limit being increased, was not a conclusion of mine, it was an advice I got on a web page somewhere^^ There's been a conclusion of mine, the history of which we can read up in one of my previous replies. In the archive it's all wrecked: http://www.unicode.org/mail-arch/unicode-ml/y2015-m08/0024.html I must resend it. In the meantime, it may be quoted: >>> Following some advice I programmed a ligature of 35 characters on Sat Feb 28, 2015; thus I've got the opportunity of seeing that 16 characters were inserted, while the overflow was discarded. Windows worked normally. The mapped virtual key was VK_OEM_COMMA, and the modification number was 8, that was Shift+AltGr+Kana (which it is still today). > > > I often wondered why the description page > > [https://msdn.microsoft.com/fr-fr/goglobal/bb964665.aspx] and even > > less the download page > > [https://www.microsoft.com/en-us/download/details.aspx?id=22339] have > > not been updated (no mention of Windows 8 on the former, no mention > > even of Windows 7 in the system requirements on the download page), > > and why there?s no 2.0 version of the MSKLC. > > Microsoft simply hasn't dedicated any resources (Michael or anyone else) > to updating MSKLC. Michael has blogged about this many, many times in > the past few years. Big companies make the decisions that they make, for > the reasons they have. I'm sorry, I didn't read Michael's blog posts about not updating the MSKLC. He must be angry. It's a pity for everybody. But I understand also the point of view of the company it depends on. To invest in free software that allows users to get more independent of charmaps and autocorrect and IMEs, may be somewhat outside the business model. ? But the main reason may be that the need is already catered for, notably by Tavultesoft Keyman. However, if the 2.0 MSKLC would have sticked with four-character ligatures........ > > > I?m not here to criticize Microsoft. I ask everybody to be honest and > > to answer for himself one single question: How on earth can I prefer > > Bing if I were battling against Microsoft? Does anybody really believe > > that I?m annoying myself to find more bugs? > > I apologize for my tone in this thread. See my explanation above of when > "bug" is an appropriate conclusion to draw, and when it isn't. That got > me started. It's all right. I apologize again on my behalf. > > > Hopeful that this will end in a constructive way, > > Agreed. Best regards, Marcel -------------- next part -------------- An HTML attachment was scrubbed... URL: From charupdate at orange.fr Sat Aug 8 06:06:58 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Sat, 8 Aug 2015 13:06:58 +0200 (CEST) Subject: Windows keyboard restrictions Message-ID: <1793223358.3384.1439032018720.JavaMail.www@wwinf2232> I think about another "bug" in my mailbox, so that this mail I sent in Plain Text on Thu, 6 Aug 2015, landed all wrecked in the Archive. Please may I resend this for that behalf. (I already ended up replacing all < and > by single angle quotation marks.) If you are a Mailing List subscriber, please don't take notice :-| ___________________________________________________________________ A part of the documentation you request is available: ?Download Windows Driver Kit Version 7.1.0 from Official Microsoft Download Center.? N. p., 1 Dec. 2014. Web. 1 Dec. 2014. https://www.microsoft.com/en-us/download/details.aspx?id=11800 C:\WinDDK\7600.16385.1\inc\api\kbd.h Line 469, and preceding. --- Note: This file can be accessed on-line on third party websites. You may wish to try a Bing* search for "kbd.h". * My preferred internet search engine is Bing??? [This ballot box U+2610 is on my keyboard at Shift+Kana+L.] --- To circumvent the issues araising from the word ?bug?, we may simply ban that and focus on a few comments: * Ligature is an internal name for Wchar sequences that are * generated when a specified key is pressed. * * The ligatures are all in *one* table. * The number of characters of the longest ligature * determines the number replacing the ?n? in * static ALLOC_SECTION_LDATA LIGATURE?n? * and in * ?n?, * sizeof(aLigature[0]), * (PLIGATURE1)aLigature * below in the static ALLOC_SECTION_LDATA KBDTABLES. * * The maximum length of ligatures is 16 characters. * Characters from 17th on are discarded. * * The ligatures table must be defined for ?n? characters, * whether in kbd.h, or kbdfrfre.h, or here before, * using the following define: * TYPEDEF_LIGATURE(?n?) * For clarification, a trailing comment is added: * // LIGATURE?n?, *PLIGATURE?n? * Tables for up to 5 characters length are already defined in * C:\WinDDK\7600.16385.1\inc\api\kbd.h. * * The lasting Wchar fields of each ligature that is shorter than * the maximum length defined, may be filled up with 0xF000, or with * WCH_NONE as defined in kbd.h, or NONE if defined in the custom header. * These entries may be shortened, especially when the ligatures table * is not edited in a single spreadsheet table. What?s new for me, is that ?sometimes? [scare quotes], the ligature length must not exceed four characters. I already knew what?s written in the MSKLC Help about this topic, and I explained in my previous e-mail that, when the MSKLC?was built, Windows did not support more than four characters per ligature. (That?s the only straightforward explanation of this point of the MSKLC.) As this proved to be insufficient, Microsoft must have decided to raise the limit to sixteen. Following some advice I programmed a ligature of 35 characters on Sat Feb 28, 2015; thus I've got the opportunity of seeing that 16?characters were inserted, while the overflow was discarded. Windows worked normally. The mapped virtual key was VK_OEM_COMMA, and the modification number was?8, that was Shift+AltGr+Kana (which it is still today). After the recent event, we may add the following: * CAUTION: THE INITIAL MAXIMUM LENGTH OF LIGATURES WAS 4 CHARACTERS. * DEPENDING ON THE SHIFT STATE, IT MAY HAPPEN THAT LONGER LIGATURES * CAUSE THE KEYBOARD DRIVER TO FAIL. * IF THE FAILING DRIVER IS THE ONE OF THE DEFAULT KEYBOARD LAYOUT, * THE SYSTEM MAY NOT WORK AS EXPECTED. I?m hopeful that you will agree upon this formulation, and I hope that helps. Best regards, Marcel Schneider On 03 Aug 2015, at 20:02, Doug Ewell wrote: > Marcel Schneider wrote: > > > The bug on Windows I encountered at the end of July has been > > definitely identified and reconstructed. After ninety-five drivers > > compiled since the bug appeared, I can tell so much as that the > > problem is related to the length of the so-called ligatures. When the > > MSKLC was built, they were limited to four characters on Windows (see > > glossary in the MSKLC Help). On my machine the maximal length is 16 > > characters. The problem is that this is not equal on all shift states > > and perhaps keys. Roughly, I can put five characters on modification > > number three, that is normally AltGr, but not on #4 (Shift+AltGr). > > As far as I can tell, the limit for a ligature on a Windows keyboard > layout is four UTF-16 code points: > > MSKLC help, under "Validation Reference": > "Ligatures cannot contain more than four UTF-16 code points" > > Presentation from IUC 23 by Michael Kaplan (author of MSKLC) and Cathy > Wissink: > http://tinyurl.com/o49r4bz > > KbdEdit: > http://www.kbdedit.com/ > > MUFI: > http://folk.uib.no/hnooh/mufi/keyboards/WinXPkeyboard.html > > I understand that there are some tools (such as Keyboard Layout Manager) > that claim a higher limit, and it may even be possible in some cases to > assign more than four, but the DOCUMENTED limit appears to be four. (If > you claim that it is not, please provide a link to the relevant official > documentation, and note that C++ code showing 16 fields is not > documentation.) > > It is not a bug for software to fail to perform BEYOND its documented > limits. > > Since you are so very eager to declare this a bug, or a collection of > bugs, rather than a design limitation, I strongly recommend you get in > touch with Microsoft Technical Support and express your concerns to > them. Make sure to let them know just how certain you are that these are > bugs. See if they'll send you a T-shirt. > > -- > Doug Ewell | http://ewellic.org | Thornton, CO ???? > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From charupdate at orange.fr Sat Aug 8 07:05:17 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Sat, 8 Aug 2015 14:05:17 +0200 (CEST) Subject: Windows keyboard restrictions In-Reply-To: References: <20150807152116.665a7a7059d7ee80bb4d670165c8327d.f3e8563f5d.wbe@email03.secureserver.net> Message-ID: <1149930762.7711.1439035518032.JavaMail.www@wwinf1j11> On 08 Aug 2015, at 02:19, Andrew Glass (WINDOWS) wrote: > Sorry to be late to this thread. I'm the Program Manager responsible for MSKLC at this time. As far as the history here, I can only reiterate Michael's point that making significant changes to user32.dll faces significant, perhaps insurmountable headwinds. There would have to be compelling reasons to make any kind of changes here. If you have specific feedback for Microsoft on this issue, please follow up with me off line. Thank you. While *one* dimension of this thread is to get minor changes performed in order to asset ligatures support for 16 characters uniformly in Windows keyboard drivers, the main concern at the actual point of the thread is to know something about the actual support as well as at the time of MSKLC: 1. On Windows, up to how many characters may be inserted with one single key stroke: 1.1. At the time of MSKLC 1.0? 1.2. When MSKLC was updated from version 1.3 to 1.4? 1.3. At the time of Windows Seven, that is 6.1, Build 7601 (SP1)? 1.4. Today, that is on Windows 10? It is supposed that a keyboard driver is used in whose source a ligature table is defined for whatever number of characters (2, 3, 4, 5, 6, ... 16, ... 32, ... 60, ... 100, ...). 2. Supposed that Windows supported more than four characters per ligature: 2.1. Why has the MSKLC been limited to four characters per ligature? 2.2. Who or what body made the demand of the limitation to four characters? 2.3. Why does the MSKLC Help state (Glossary - Ligature) that the maximum number supported by Windows is four characters? 2.4. How Microsoft dealt with user demands for support of longer ligatures? Best regards, Marcel Schneider -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Sat Aug 8 07:51:40 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Sat, 8 Aug 2015 13:51:40 +0100 Subject: Windows keyboard restrictions In-Reply-To: <1149930762.7711.1439035518032.JavaMail.www@wwinf1j11> References: <20150807152116.665a7a7059d7ee80bb4d670165c8327d.f3e8563f5d.wbe@email03.secureserver.net> <1149930762.7711.1439035518032.JavaMail.www@wwinf1j11> Message-ID: <20150808135140.4d0dc8d7@JRWUBU2> On Sat, 8 Aug 2015 14:05:17 +0200 (CEST) Marcel Schneider wrote: > 2. Supposed that Windows supported more than four characters per > ligature: > 2.1. Why has the MSKLC been limited to four characters per > ligature? Because that was believed to be the architectural limit. Note however, that it isn't 4 *characters* that is the limit, but 4 UTF-16 code units. Richard. From charupdate at orange.fr Sat Aug 8 08:26:31 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Sat, 8 Aug 2015 15:26:31 +0200 (CEST) Subject: Windows keyboard restrictions In-Reply-To: <20150807152116.665a7a7059d7ee80bb4d670165c8327d.f3e8563f5d.wbe@email03.secureserver.net> References: <20150807152116.665a7a7059d7ee80bb4d670165c8327d.f3e8563f5d.wbe@email03.secureserver.net> Message-ID: <971687439.5732.1439040391883.JavaMail.www@wwinf2202> On 08 Aug 2015, at 00:30, Doug Ewell wrote: > Marcel Schneider wrote: > > > I brought the good news that SIXTEEN UNICODE CODE POINTS can be > > generated by a single key stroke on Windows six dot one. The only bad > > news, because of which I've e-mailed to the List, is that that wasn't > > working in one single circumstance. It was obvious that the main thing > > to do, is to inform about this fact, so that other people mustn't > > search for a bug in the driver if it's only that. > > But that's what I've been trying to say. The maximum isn't 16, it's 4. > "That wasn't working" is the expected behavior here. > > If you were able to create a keyboard layout where 16 code points ever > worked on Windows 7 (which reports itself as "6.1"), Indeed I didn't check on Wikipedia that Windows 7 has the same version number, build number and service pack as Windows 7 Starter which was delivered with the netbooks. So my Windows is the true Windows 7 except some limitations, but *not* the one that one couldn't open more than three applications simultaneously. This limitation we've been saved from, appears to me as a paradigm of all limitations of that sort: a useless worsening of the usability and of the usefulness of a product. There is a use but it's economical, to allow manufacturers to buy the OS a bit cheaper, with respect to the overall price of netbooks. Now, question: What's the advantage of being limited to four characters, even if you're a professional and corporate user? Or a scholar? > it was purely by accident -- because Windows 7 did not check for the overrun, > and because the overrun did not happen to cause any collateral damage. Windows *did* check for the overflow! This is why *sixteen* characters *only* were inserted, *not* thirty-five. > > If you have a light bulb that's rated for 110 volts, and you apply 220 > volts to it and for some reason the bulb doesn't burn out immediately, > that doesn't mean 220 volts is the correct operating environment for > that bulb. It means you got lucky. That seems a good reasoning. I'm just not quite sure whether limiting ligatures to four instead of sixteen may be compared to electrotechnics. > > If there's a bug here, it's that Windows didn't detect that the limit > had been exceeded, and respond by locking out the key. Again, Windows did detect that the ligature was far too long, and consequently limited it to 16. And it did so *without* any collateral damage: no app blocked, no keyboard disabled, just a handful of characters not inserted while they were programmed. That's not worth mentioning except for the case study. Sixteen on one single keystroke is IMHO largely enough. But four is *not*. That is what Microsoft knows, and that is why Microsoft asked its Windows developers to raise the limit, IMHO. Best regards, Marcel From charupdate at orange.fr Sat Aug 8 09:22:57 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Sat, 8 Aug 2015 16:22:57 +0200 (CEST) Subject: Windows keyboard restrictions In-Reply-To: <20150808135140.4d0dc8d7@JRWUBU2> References: <20150807152116.665a7a7059d7ee80bb4d670165c8327d.f3e8563f5d.wbe@email03.secureserver.net> <1149930762.7711.1439035518032.JavaMail.www@wwinf1j11> <20150808135140.4d0dc8d7@JRWUBU2> Message-ID: <2027666553.10656.1439043777467.JavaMail.www@wwinf1j19> On 08 Aug 2015, at 15:01, Richard Wordingham wrote: > On Sat, 8 Aug 2015 14:05:17 +0200 (CEST) > Marcel Schneider wrote: > > > 2. Supposed that Windows supported more than four characters per > > ligature: > > > 2.1. Why has the MSKLC been limited to four characters per > > ligature? > > Because that was believed to be the architectural limit. Note however, > that it isn't 4 *characters* that is the limit, but 4 UTF-16 code units. I'm very puzzled about this being UTF-16 code units, as stated also in the MSKLC Help. In the driver source kbd*****.c, each of those entities is referred to as WCHAR, which is meant to mean (^^) "UNICODE CHARACTER". Indeed we can write 0x1234 for a given WCHAR in a driver source, but also 0x101234 for another given WCHAR if that's its code point. Nowhere there is any UTF appearing. Despite of having looked up the Unicode FAQs about Unicode transformation formats, I'm unable to make the link. > Because that was believed to be the architectural limit. I'm urged not to speculate, and to stick with facts and documentation. Now look here an authoritative expert who is reduced to stand upon his believes about keyboard limitations while working as an employee of the company where the same keyboard limitations were designed, implemented, compliled, released and shipped from-----or NOT. At his place I would have asked my boss for accessing the Windows keyboard layout framework source files and development roadmaps. Turning it the other way round, Microsoft must not ask somebody to write some keyboard creating software without granting him full access to all documentation. At least that's my opinion. BTW, I would like to have everybody note that a Help section of another software is *not* documentation (with the meaning the word has in this thread). Nor is a PowerPoint. Nor are third party keyboard software websites. Nor is anything that is not a comment in a source file, or a technical document issued by the department that really worked out the discussed software; or a code line, because in my belief, this is strong evidence. Thank you for your comment. Further, we're awaiting the responses from Mr?Glass at Microsoft. Best regards, Marcel Schneider -------------- next part -------------- An HTML attachment was scrubbed... URL: From eliz at gnu.org Sat Aug 8 09:31:30 2015 From: eliz at gnu.org (Eli Zaretskii) Date: Sat, 08 Aug 2015 17:31:30 +0300 Subject: Windows keyboard restrictions In-Reply-To: <2027666553.10656.1439043777467.JavaMail.www@wwinf1j19> References: <20150807152116.665a7a7059d7ee80bb4d670165c8327d.f3e8563f5d.wbe@email03.secureserver.net> <1149930762.7711.1439035518032.JavaMail.www@wwinf1j11> <20150808135140.4d0dc8d7@JRWUBU2> <2027666553.10656.1439043777467.JavaMail.www@wwinf1j19> Message-ID: <834mk9rgvx.fsf@gnu.org> > Date: Sat, 8 Aug 2015 16:22:57 +0200 (CEST) > From: Marcel Schneider > > I'm very puzzled about this being UTF-16 code units, as stated also in the > MSKLC Help. In the driver source kbd*****.c, each of those entities is referred > to as WCHAR, which is meant to mean (^^) "UNICODE CHARACTER". Indeed we can > write 0x1234 for a given WCHAR in a driver source, but also 0x101234 for > another given WCHAR if that's its code point. Nowhere there is any UTF > appearing. Despite of having looked up the Unicode FAQs about Unicode > transformation formats, I'm unable to make the link. The Windows WCHAR is a 16-bit data type. What Windows documentation calls "Unicode characters" are Unicode codepoints encoded in UTF-16. From charupdate at orange.fr Sat Aug 8 10:10:58 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Sat, 8 Aug 2015 17:10:58 +0200 (CEST) Subject: Windows keyboard restrictions In-Reply-To: <834mk9rgvx.fsf@gnu.org> References: <20150807152116.665a7a7059d7ee80bb4d670165c8327d.f3e8563f5d.wbe@email03.secureserver.net> <1149930762.7711.1439035518032.JavaMail.www@wwinf1j11> <20150808135140.4d0dc8d7@JRWUBU2> <2027666553.10656.1439043777467.JavaMail.www@wwinf1j19> <834mk9rgvx.fsf@gnu.org> Message-ID: <646624248.11997.1439046658425.JavaMail.www@wwinf1m26> On 08 Aug 2015, at 16;39, Eli Zaretskii wrote: > > The Windows WCHAR is a 16-bit data type. What Windows documentation > calls "Unicode characters" are Unicode codepoints encoded in UTF-16. > Thanks a lot! Best regards, Marcel Schneider -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Sat Aug 8 10:44:55 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Sat, 8 Aug 2015 16:44:55 +0100 Subject: Windows keyboard restrictions In-Reply-To: <2027666553.10656.1439043777467.JavaMail.www@wwinf1j19> References: <20150807152116.665a7a7059d7ee80bb4d670165c8327d.f3e8563f5d.wbe@email03.secureserver.net> <1149930762.7711.1439035518032.JavaMail.www@wwinf1j11> <20150808135140.4d0dc8d7@JRWUBU2> <2027666553.10656.1439043777467.JavaMail.www@wwinf1j19> Message-ID: <20150808164455.7b244c50@JRWUBU2> On Sat, 8 Aug 2015 16:22:57 +0200 (CEST) Marcel Schneider wrote: > Further, we're awaiting the responses from Mr?Glass at Microsoft. See http://unicode.org/pipermail/unicode/2015-August/002465.html . More information would take time. Richard. From doug at ewellic.org Sat Aug 8 12:36:02 2015 From: doug at ewellic.org (Doug Ewell) Date: Sat, 8 Aug 2015 11:36:02 -0600 Subject: Windows keyboard restrictions In-Reply-To: <1149930762.7711.1439035518032.JavaMail.www@wwinf1j11> References: <20150807152116.665a7a7059d7ee80bb4d670165c8327d.f3e8563f5d.wbe@email03.secureserver.net> <1149930762.7711.1439035518032.JavaMail.www@wwinf1j11> Message-ID: Now that I know Andrew is the PM for MSKLC ?, and can answer Marcel's questions (publicly or privately) with authority, I'll duck out of this thread. ? I'm glad to hear that there is such a person. I was afraid the project had been left to die. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From: Marcel Schneider Sent: Saturday, August 8, 2015 6:05 To: Andrew Glass (WINDOWS) Cc: Doug Ewell ; Unicode Mailing List Subject: RE: Windows keyboard restrictions On 08 Aug 2015, at 02:19, Andrew Glass (WINDOWS) wrote: > Sorry to be late to this thread. I'm the Program Manager responsible > for MSKLC at this time. As far as the history here, I can only > reiterate Michael's point that making significant changes to > user32.dll faces significant, perhaps insurmountable headwinds. There > would have to be compelling reasons to make any kind of changes here. > If you have specific feedback for Microsoft on this issue, please > follow up with me off line. Thank you. While *one* dimension of this thread is to get minor changes performed in order to asset ligatures support for 16 characters uniformly in Windows keyboard drivers, the main concern at the actual point of the thread is to know something about the actual support as well as at the time of MSKLC: 1. On Windows, up to how many characters may be inserted with one single key stroke: 1.1. At the time of MSKLC 1.0? 1.2. When MSKLC was updated from version 1.3 to 1.4? 1.3. At the time of Windows Seven, that is 6.1, Build 7601 (SP1)? 1.4. Today, that is on Windows 10? It is supposed that a keyboard driver is used in whose source a ligature table is defined for whatever number of characters (2, 3, 4, 5, 6, ... 16, ... 32, ... 60, ... 100, ...). 2. Supposed that Windows supported more than four characters per ligature: 2.1. Why has the MSKLC been limited to four characters per ligature? 2.2. Who or what body made the demand of the limitation to four characters? 2.3. Why does the MSKLC Help state (Glossary - Ligature) that the maximum number supported by Windows is four characters? 2.4. How Microsoft dealt with user demands for support of longer ligatures? Best regards, Marcel Schneider From asmus-inc at ix.netcom.com Sat Aug 8 14:09:12 2015 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Sat, 8 Aug 2015 12:09:12 -0700 Subject: Windows keyboard restrictions In-Reply-To: <971687439.5732.1439040391883.JavaMail.www@wwinf2202> References: <20150807152116.665a7a7059d7ee80bb4d670165c8327d.f3e8563f5d.wbe@email03.secureserver.net> <971687439.5732.1439040391883.JavaMail.www@wwinf2202> Message-ID: <55C653D8.4010007@ix.netcom.com> An HTML attachment was scrubbed... URL: From marc at keyman.com Sat Aug 8 15:46:14 2015 From: marc at keyman.com (Marc Durdin) Date: Sat, 8 Aug 2015 20:46:14 +0000 Subject: Windows keyboard restrictions In-Reply-To: <20150808110506.58ec0cc8@JRWUBU2> References: <20150807092656.665a7a7059d7ee80bb4d670165c8327d.154138f225.wbe@email03.secureserver.net> <20150807195415.4c725da4@JRWUBU2> <20150808110506.58ec0cc8@JRWUBU2> Message-ID: <1CEDD746887FFF4B834688E7AF5FDA5A821B2135@federation.tavultesoft.local> Richard Wordingham wrote: > On Sat, 8 Aug 2015 17:05:26 +1000 > Andrew Cunningham wrote: > > > > Michael did do a series of blog posts on building TSF based input > > methods years ago. Something I tinkered with off and on. > > Does this mean that one can put it all together from his reconstituted blog? I > don't know how much was salvaged. Michael has publicly complimented Marc > Durdin on being able to find his way through the published Microsoft > documentation to make TSF work for him once Microsoft had fixed the bugs he > had identified. The TSF documentation is definitely sparse and there are some real challenges in getting up to speed with what is a very complex API. But the real challenges in input method implementation come more from the vast array of interpretations of how keyboard input should be consumed by Windows apps, and the extremely patchy support for TSF (and "foreign" language input) in apps. That's where I've killed thousands of hours over the last 15 years; once you get your head around the TSF model it's not too hard to code to. Clearly, you've seen some of the same compatibility problems with KMFL. And our experiences on Mac OS X, Android and iOS are much the same. For example, Norbert Lindenberg's excellent blog on developing keyboards for iOS details much that is missing from the API docs: http://norbertlindenberg.com/2014/12/developing-keyboards-for-ios/ There is a massive cost to developing -- and maintaining -- a native code input method for each language and each OS. I'm really trying to minimize this cost with Keyman Developer 9. Keyman Developer 9 is a free product (http://keyman.com/developer/). It is currently in beta but is relatively stable. From charupdate at orange.fr Sat Aug 8 15:57:25 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Sat, 8 Aug 2015 22:57:25 +0200 (CEST) Subject: Windows keyboard restrictions In-Reply-To: References: <20150807152116.665a7a7059d7ee80bb4d670165c8327d.f3e8563f5d.wbe@email03.secureserver.net> <1149930762.7711.1439035518032.JavaMail.www@wwinf1j11> Message-ID: <2105459356.15150.1439067445761.JavaMail.www@wwinf1k29> On 08 Aug 2015, at 19;45, Doug Ewell wrote: > Now that I know Andrew is the PM for MSKLC ?, Probably Mr Glass wasn't Mr Kaplan's boss, so he is to overtake a legacy without having been involved in its generating. I didn't well notice the chronological relationship, so I asked questions whose answers could need to search the archives.--- > and can answer Marcel's questions (publicly or privately) with authority, Mr Kaplan too is authoritative. The difference might be that Mr?Glass has actually access to all the needed documentation. > I'll duck out of this thread. You're not supposed to. But in any case, I would like to thank you for all you brought in into this thread. It has been very enriching and brought some insight I wouldn't have got. > > ? I'm glad to hear that there is such a person. I was afraid the project > had been left to die. Indeed there seems to be like a malediction upon the MSKLC. The uppermost problem now is that reputations are linked to the low limit of ligatures length. Supposed the low limit is untrue, then Mr?Glass can hardly answer these questions publicly. If he agrees to do so privately, I'll be bound by a secret and will be hindered in providing help for my eventual future keyboard drivers. I don't know how to get out of trouble. If I write on a web page that we can have up to 16 UTF-16 code units per ligature, there can always be somebody starting up who's telling that's wrong and my drivers were a hack. Probably I do end up wishing there would never have been an MSKLC. At least we might think that possibly there is no update because 2.0 would have stuck with that low limit. We hope there will be enough solutions for all Unicode implementations. Best regards, Marcel -------------- next part -------------- An HTML attachment was scrubbed... URL: From asmus-inc at ix.netcom.com Sat Aug 8 16:07:08 2015 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Sat, 8 Aug 2015 14:07:08 -0700 Subject: Windows keyboard restrictions In-Reply-To: <2105459356.15150.1439067445761.JavaMail.www@wwinf1k29> References: <20150807152116.665a7a7059d7ee80bb4d670165c8327d.f3e8563f5d.wbe@email03.secureserver.net> <1149930762.7711.1439035518032.JavaMail.www@wwinf1j11> <2105459356.15150.1439067445761.JavaMail.www@wwinf1k29> Message-ID: <55C66F7C.6080908@ix.netcom.com> An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Sun Aug 9 05:58:20 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Sun, 9 Aug 2015 11:58:20 +0100 Subject: ZWJ as a Ligature Suppressor Message-ID: <20150809115820.67e0eead@JRWUBU2> According to the text just after TUS 7.0.0 Figure 23-3 (http://www.unicode.org/versions/Unicode7.0.0/ch23.pdf#G25237), ZWJ suppresses ligatures in Arabic script. Does this rule apply to other normally cursive joined scripts, e.g. Syriac and Mongolian? Am I right in thinking that for an OpenType font for other scripts, the font writer must take precautions to prevent ZWJ accidentally suppressing ligatures that would be better suppressed by ZWNJ or ? From richard.wordingham at ntlworld.com Sun Aug 9 06:09:19 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Sun, 9 Aug 2015 12:09:19 +0100 Subject: Standardised Encoding of Text Message-ID: <20150809120919.1adacf7c@JRWUBU2> Is there any mechanism to standardise the encoding of text that is composed of encoded characters that are all from a specific script or the common script? Richard. From eik at iki.fi Sun Aug 9 06:46:31 2015 From: eik at iki.fi (Erkki I Kolehmainen) Date: Sun, 9 Aug 2015 14:46:31 +0300 Subject: Standardised Encoding of Text In-Reply-To: <20150809120919.1adacf7c@JRWUBU2> References: <20150809120919.1adacf7c@JRWUBU2> Message-ID: <000001d0d299$0929fbc0$1b7df340$@fi> Sorry, but I find myself having a serious problem in understanding what this is about. Sincerely, Erkki I. Kolehmainen Tilkankatu 12 A 3, 00300 Helsinki, Finland Mob: +358400825943, Tel: +358943682643, Fax: +35813318116 -----Alkuper?inen viesti----- L?hett?j?: Unicode [mailto:unicode-bounces at unicode.org] Puolesta Richard Wordingham L?hetetty: 9. elokuuta 2015 14:09 Vastaanottaja: unicode at unicode.org Aihe: Standardised Encoding of Text Is there any mechanism to standardise the encoding of text that is composed of encoded characters that are all from a specific script or the common script? Richard. From richard.wordingham at ntlworld.com Sun Aug 9 08:58:11 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Sun, 9 Aug 2015 14:58:11 +0100 Subject: Standardised Encoding of Text In-Reply-To: <000001d0d299$0929fbc0$1b7df340$@fi> References: <20150809120919.1adacf7c@JRWUBU2> <000001d0d299$0929fbc0$1b7df340$@fi> Message-ID: <20150809145811.3d1296d2@JRWUBU2> On Sun, 9 Aug 2015 14:46:31 +0300 "Erkki I Kolehmainen" wrote: > Sorry, but I find myself having a serious problem in understanding > what this is about. In some cases the TUS lays down in detail the order of characters and their interpretation. While Europeans have canonical combining classes to standardise the order of combining marks, lesser breeds tend not to receive them. It gets even worse when combining marks are defined by the combination of control character(s) and what appears to be a base character. For example, the order for the Khmer script was laid down in great detail. Similarly, the order for Burmese was laid out in great detail. However, as support for other languages was added to the 'Myanmar' script, the ordering rules to cover the new characters were not promptly laid down. So the question is, how does one rectify the situation where the text in the Unicode Standard for a script is woefully inadequate. Richard. From mark at macchiato.com Sun Aug 9 10:10:01 2015 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Sun, 9 Aug 2015 17:10:01 +0200 Subject: Standardised Encoding of Text In-Reply-To: <20150809145811.3d1296d2@JRWUBU2> References: <20150809120919.1adacf7c@JRWUBU2> <000001d0d299$0929fbc0$1b7df340$@fi> <20150809145811.3d1296d2@JRWUBU2> Message-ID: While it would be good to document more scripts, and more language options per script, that is always subject to getting experts signed up to develop them. What I'd really like to see instead of documentation is a data-based approach. For example, perhaps the addition of real data to CLDR for a "basic-validity-check" on a language-by-language basis. It might be possible to use a BNF grammar for the components, for which we are already set up. For example, something like (this was a quick and dirty transcription): $word := $syllable+; $syllable := $B [R C] (S R?)* (Z? V)? $O? $S?; # UnicodeSets $R := [\u17CC]; $C := []; $S := []; $V := [] $Z := [:joiner:] $O := [...] $B := [[:sc=khmer:]&[:L:]-$R-$C-$S-$V-$Z-$O] The more these could use existing properties, like Indic_Positional_Category or IndicSyllabicCategory, the better. Doing this would have far more of an impact than just a textual description, in that it could executed by code, for at least a reference implementation. Mark *? Il meglio ? l?inimico del bene ?* On Sun, Aug 9, 2015 at 3:58 PM, Richard Wordingham < richard.wordingham at ntlworld.com> wrote: > On Sun, 9 Aug 2015 14:46:31 +0300 > "Erkki I Kolehmainen" wrote: > > > Sorry, but I find myself having a serious problem in understanding > > what this is about. > > In some cases the TUS lays down in detail the order of characters and > their interpretation. While Europeans have canonical combining classes > to standardise the order of combining marks, lesser breeds tend not to > receive them. It gets even worse when combining marks are defined by > the combination of control character(s) and what appears to be a base > character. For example, the order for the Khmer script was laid > down in great detail. Similarly, the order for Burmese was laid out in > great detail. However, as support for other languages was added to > the 'Myanmar' script, the ordering rules to cover the new characters > were not promptly laid down. > > So the question is, how does one rectify the situation where the text > in the Unicode Standard for a script is woefully inadequate. > > Richard. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Sun Aug 9 12:10:14 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Sun, 9 Aug 2015 18:10:14 +0100 Subject: Standardised Encoding of Text In-Reply-To: References: <20150809120919.1adacf7c@JRWUBU2> <000001d0d299$0929fbc0$1b7df340$@fi> <20150809145811.3d1296d2@JRWUBU2> Message-ID: <20150809181014.232da747@JRWUBU2> On Sun, 9 Aug 2015 17:10:01 +0200 Mark Davis ?? wrote: > While it would be good to document more scripts, and more language > options per script, that is always subject to getting experts signed > up to develop them. > > What I'd really like to see instead of documentation is a data-based > approach. > > For example, perhaps the addition of real data to CLDR for a > "basic-validity-check" on a language-by-language basis. CLDR is currently not useful. Are you really going to get Mayan time formats when the script is encoded? Without them, there will be no CLDR data. I would like to add data to a Pali in Thai script locale (or two - there are two Thai-script Pali writing systems, one with an implicit vowel and another without) to get proper word- and line-breaking. However, I'm stymied because the basic requirements for a locale are beyond me. It's telling that, the last time I looked, there was no Latin locale. I don't know the usage of the administration of the Church of Rome, which appears to be what CLDR wants for Latin. (My first degree was conferred in Latin, and it wasn't conferred in Rome.) Fortunately, one doesn't need that for a Latin spell-checker, and the default word- and line-breaking work well-enough. Until some sets up locale data for Tai Khuen (or Tai Lue), we probably won't have a locale to store Lanna script rules with. > It might be > possible to use a BNF grammar for the components, for which we are > already set up. Are you sure? Microsft's Universal Script Engine (USE) intended design has a rule for well-formed syllables which essentially contains a fragment, when just looking at dependent vowels: [:InPC=Top:]*[:InPC=Bottom]* Are you set up to say whether the following NFD Tibetan fragment conforms to it? Example: The sequence of InPC values is . There are other examples around, but this is a pleasant one to think about. (The USE definition got more complicated when confronted with harsh reality. That confrontation may have happened very early in the design.) > For example, something like (this was a quick and > dirty transcription): > > $word := $syllable+; Martin Hosken put something like that together for the Lanna script. On careful inspection: (a) It seemed to allow almost anything; (b) It was not too lax. Much later, I have realised that (c) It was too strict if read as it was meant to be read, i.e. not literally. (d) It overlooked a logogram for 'elephant' that contains a marginally dependent vowel. Though it might indeed be useful in general, the formal description would need to be accompanied by an explanation of what was happening. The problem with the Lanna script is that it allows a lot of abbreviation, and it makes sense to store the undeleted characters in their normal order. The result of this is that one often can't say a sequence is non-standard unless you know roughly how to pronounce it. > Doing this would have far more of an impact than just a textual > description, in that it could executed by code, for at least a > reference implementation. I don't like the idea of associating the description with language rather than script. Imagine the trouble you'll have with Tamil purists. They'll probably want to ban several consonants. You'll end up needing a locale for Sanskrit in the Tamil script. Richard. From richard.wordingham at ntlworld.com Sun Aug 9 13:38:45 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Sun, 9 Aug 2015 19:38:45 +0100 Subject: Standardised Encoding of Text In-Reply-To: References: <20150809120919.1adacf7c@JRWUBU2> <000001d0d299$0929fbc0$1b7df340$@fi> <20150809145811.3d1296d2@JRWUBU2> Message-ID: <20150809193845.5931b6cf@JRWUBU2> On Sun, 9 Aug 2015 17:10:01 +0200 Mark Davis wrote: > While it would be good to document more scripts, and more language > options per script, that is always subject to getting experts signed > up to develop them. > > What I'd really like to see instead of documentation is a data-based > approach. > > For example, perhaps the addition of real data to CLDR for a > "basic-validity-check" on a language-by-language basis. One aspect this would not help with is with letter forms that do not resemble their forms in the code charts. The code charts usually broadly answer the question "What does this code represent?". They don't answer the question, "What code points represent this glyph?". Problems I've seen in Tai Tham are the use of U+1A57 TAI THAM CONSONANT SIGN LA TANG LAI for the sequence and of for . The problem is that the subscript forms for U+1A43 and U+1A3F are only documented in the proposals. The subscript consonant signs probably add to the confusion of anyone working from the code chart. The people making the errors were far from ignorant of the script. Richard. From mark at macchiato.com Sun Aug 9 14:14:38 2015 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Sun, 9 Aug 2015 21:14:38 +0200 Subject: Standardised Encoding of Text In-Reply-To: <20150809181014.232da747@JRWUBU2> References: <20150809120919.1adacf7c@JRWUBU2> <000001d0d299$0929fbc0$1b7df340$@fi> <20150809145811.3d1296d2@JRWUBU2> <20150809181014.232da747@JRWUBU2> Message-ID: Mark *? Il meglio ? l?inimico del bene ?* On Sun, Aug 9, 2015 at 7:10 PM, Richard Wordingham < richard.wordingham at ntlworld.com> wrote: > On Sun, 9 Aug 2015 17:10:01 +0200 > Mark Davis ?? wrote: > > > While it would be good to document more scripts, and more language > > options per script, that is always subject to getting experts signed > > up to develop them. > > > > What I'd really like to see instead of documentation is a data-based > > approach. > > > > For example, perhaps the addition of real data to CLDR for a > > "basic-validity-check" on a language-by-language basis. > > CLDR is currently not useful. Are you really going to get Mayan time > formats when the script is encoded? Without them, there will be no CLDR > data. ?That is a misunderstanding. CLDR provides both locale (language) specific data for formatting, collation, etc., but also data about languages. It is not limited to the first. > > It might be > > possible to use a BNF grammar for the components, for which we are > > already set up. > > Are you sure? I said "might be possible". That normally indicates that a degree of uncertainty. That is, "no, I'm not sure". T here is no reason to be unnecessarily argumentative; it doesn't exactly encourage people to explore solutions to a problem. > Microsft's Universal Script Engine (USE) intended design > has a rule for well-formed syllables which essentially contains a > fragment, when just looking at dependent vowels: > > [:InPC=Top:]*[:InPC=Bottom]* > > Are you set up to say whether the following NFD Tibetan fragment > conforms to it? > > Example: > > The sequence of InPC values is . There are other examples > around, but this is a pleasant one to think about. > > (The USE definition got more complicated when confronted with harsh > reality. That confrontation may have happened very early in the > design.) > > > For example, something like (this was a quick and > > dirty transcription): > > > > $word := $syllable+; > > > Martin Hosken put something like that together for the Lanna script. > On careful inspection: > > (a) It seemed to allow almost anything; > (b) It was not too lax. > > Much later, I have realised that > > (c) It was too strict if read as it was meant to be read, i.e. not > literally. > (d) It overlooked a logogram for 'elephant' that contains a marginally > dependent vowel. > > Though it might indeed be useful in general, the formal description > would need to be accompanied by an explanation of what was happening. > The problem with the Lanna script is that it allows a lot of > abbreviation, and it makes sense to store the undeleted characters in > their normal order. The result of this is that one often can't say a > sequence is non-standard unless you know roughly how to pronounce it. > I don't think any algorithmic description would get all and only those strings that would be acceptable to writers of the language. What you'd end up with is a mechanism that had three values: clearly ok (eg, cat), clearly bogus (eg, a\u0308\u0308\u0308\u0308), and somewhere in between. > > Doing this would have far more of an impact than just a textual > > description, in that it could executed by code, for at least a > > reference implementation. > > I don't like the idea of associating the description with language > rather than script. Imagine the trouble you'll have with Tamil > purists. They'll probably want to ban several consonants. You'll end > up needing a locale for Sanskrit in the Tamil script. > ?Someone was just saying " However, as support for other languages was added to ? ? the 'Myanmar' script, the ordering rules to cover the new characters were not promptly laid down. "? If the goal for the script rules is to cover all languages customarily written with that script, one way to do that is to develop the language rules as they come, and make sure that the script rules are broadened if necessary for each language. But there is also utility to having the language rules, especially for high-frequency languages. ? > > Richard. > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Sun Aug 9 16:03:37 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Sun, 9 Aug 2015 22:03:37 +0100 Subject: Standardised Encoding of Text In-Reply-To: References: <20150809120919.1adacf7c@JRWUBU2> <000001d0d299$0929fbc0$1b7df340$@fi> <20150809145811.3d1296d2@JRWUBU2> <20150809181014.232da747@JRWUBU2> Message-ID: <20150809220337.147e0d72@JRWUBU2> On Sun, 9 Aug 2015 21:14:38 +0200 Mark Davis ?? wrote: > Mark > > *? Il meglio ? l?inimico del bene ?* > > On Sun, Aug 9, 2015 at 7:10 PM, Richard Wordingham < > richard.wordingham at ntlworld.com> wrote: > > > On Sun, 9 Aug 2015 17:10:01 +0200 > > Mark Davis ?? wrote: > > > For example, perhaps the addition of real data to CLDR for a > > > "basic-validity-check" on a language-by-language basis. > > CLDR is currently not useful. Are you really going to get Mayan > > time formats when the script is encoded? Without them, there will > > be no CLDR data. > ?That is a misunderstanding. CLDR provides both locale (language) > specific data for formatting, collation, etc., but also data about > languages. It is not limited to the first. I'm basing my statement on the 'minimal data commitment' listed in http://cldr.unicode.org/index/cldr-spec/minimaldata . If there is a sustained failure to provide 4 main data/time formats, the locale may be removed. > > > It might be > > > possible to use a BNF grammar for the components, for which we are > > > already set up. > > Are you sure? > I said "might be possible". That normally indicates that a degree of > uncertainty. That is, "no, I'm not sure". > There is no reason to be unnecessarily argumentative; it doesn't > exactly encourage people to explore solutions to a problem. I was responding to the 'for which we are already set up'. The problem is that canonical equivalence can make it very difficult to specify a syntax. The text segmentation appendices suggest that you have already hit trouble with canonical equivalence; I suspect you have tools set up to prevent such problems recurring. With a view to analysing the effects of analysing the rquirements of the USE, I investigated the effects of canonical equivalence on regular expressions. I eventually discovered the relevant mathematical theory - it replaces strings by 'traces', which for our purposes are fully decomposed character strings modulo canonical equivalence. I found very little interest in the matter on this list. I gave the example of the regular expression [:InPC=Top:]*[:InPC=Bottom:]* Usefully converting that expression to specify NFD equivalents in accordance with UTS#18 Version 17 Section 2.1 is non-trivial, though it is doable. I have a feeling that some have claimed that an expression like that is already in NFD. > I don't think any algorithmic description would get all and only those > strings that would be acceptable to writers of the language. What > you'd end up with is a mechanism that had three values: clearly ok > (eg, cat), clearly bogus (eg, a\u0308\u0308\u0308\u0308), and > somewhere in between. What have you got against 8th derivatives? -:) You are looking at a different issue to me. One of the issues is rather that for a word of one syllable, there should only be one order per meaning, appearance and pronunciation for a pair of non-commuting combining marks. For non-Indic scripts, that is generally handled by ensuring that different orders of non-commuting combining marks render differently. > If the goal for the script rules is to cover all languages customarily > written with that script, one way to do that is to develop the > language rules as they come, and make sure that the script rules are > broadened if necessary for each language. But there is also utility > to having the language rules, especially for high-frequency languages. The language rules serve a different function. The sequence "xxxxlttttuuupppp" is clearly not English, but it is a perfectly acceptable string for sorting, searching and rendering. Richard. From charupdate at orange.fr Mon Aug 10 06:08:11 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Mon, 10 Aug 2015 13:08:11 +0200 (CEST) Subject: The role of documentation in implementation (was: Re: Windows keyboard limitations) Message-ID: <2027820082.7245.1439204891699.JavaMail.www@wwinf1m23> On 08 Aug 2015, at 15:01, Richard Wordingham wrote: > On Sat, 8 Aug 2015 14:05:17 +0200 (CEST) > Marcel Schneider wrote: > > > 2. Supposed that Windows supported more than four characters per > > ligature: > > > 2.1. Why has the MSKLC been limited to four characters per > > ligature? > > Because that was believed to be the architectural limit. Note however, > that it isn't 4 *characters* that is the limit, but 4 UTF-16 code units. Richard (be it allowed to use first name to conform to the usage) turned out to be the only person to answer one of my listed questions. Thanks again and my apologies for the tone of my reply, which finally has been influenced by the tone of the blog post, when I were talking about his author?but this too I?m sorry about and ask Michael to forgive me as we forgive his value lowering.** Remembering that I presumed that the limit was really four in old Windows versions and that it has been changed, I can now go a step further admitting that the fundamental limit has always been sixteen since ligatures were implemented, and that lowering it to four in one single application was triggered by a cluster of circumstances that produced a belief. To assess the effects of a lack of documentation, we may cross-reference the situation with two quotations from the same header file [MSKLC]\inc\kbd.h, lines 731?sqq and 751?sqq: // There is no documented way to detect the manufacturer of the computer that // is currently running an application. However, a Windows-based application // can detect the type of OEM Windows by using the return value of the // GetKeyboardType() function. // Application programs can use these OEM IDs to distinguish the type of OEM // Windows. Note, however, that this method is not documented, so Microsoft // may not support it in the future version of Windows. May we conclude that whenever documentation is missing, we are allowed to rely on test results and other observations? About ligatures support, Andrew?Glass and previously Michael?Kaplan assure that there will be no major change. Given that on Windows?7, sixteen code units are supported on all** current (tested) shift states, this allows to conclude that stability must not necessarily rely on documentation (unlike it is suggested in the second quotation above). Thanks to the parent thread, I take notice however that when documentation is missing, unusual behavior of software must not be referred to as bug. If this thread, which spins off from a closed thread, is not followed up, we may take these statements for granted and build development policies upon. Best regards, Marcel ** Note: More than four code units per ligature works now on *all* tested shift states. The reason why the unexpected behavior is now eliminated, seems to be (in my belief) that VK_OEM_PA1 has been replaced with VK_OEM_AX. To map KBDKANA (the Kana modifier key) to Left Alt, I had redefined scan code T38 as VK_OEM_PA1, found in kbd.h. Now on Sun Aug 09, 2015, 23:16 (yesterday in the evening) I found in ?WINUSER.H? (named in capitals) that VK_OEM_AX is far more obvious, as its default scan code stands between two that are actually used on the default French keyboard and are mapped to VK_OEM_8 and VK_OEM_102. I believe that the presence of the less usual VK_OEM_PA1 (part of ?Nokia/Ericsson definitions?) made Windows more sensitive. I?m happy that this issue is so appeasingly and gratifyingly resolved. I present my apologies for the trouble it has made, as well as my thanks for the many pieces of information it has brought up. Marcel -------------- next part -------------- An HTML attachment was scrubbed... URL: From charupdate at orange.fr Mon Aug 10 06:10:34 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Mon, 10 Aug 2015 13:10:34 +0200 (CEST) Subject: Implementing SMP on a UTF-16 OS (was: Windows keyboard limitations) Message-ID: <1473422952.7279.1439205034095.JavaMail.www@wwinf1m23> On 08 Aug 2015, at 16:39, Eli Zaretskii wrote: > The Windows WCHAR is a 16-bit data type. What Windows documentation > calls "Unicode characters" are Unicode codepoints encoded in UTF-16. I turn out to be unable to use Unicode codepoints encoded in UTF-16 in the C source of the keyboard driver. When in the static ALLOC_SECTION_LDATA VK_TO_WCHARS9 aVkToWch9[], I use [...],(0xd835,0xdcea),(0xd835,0xdcd0),[...] I?get the error ?initializer is not a constant?, and when I simply use [...],0xd835 0xdcea,0xd835 0xdcd0,[...] I get the syntax error ?constant? with a cascade of comma errors. I note that the MSKLC converts to ligatures of a surrogates pair any SMP character mapped on a key, and that it cannot admit any SMP character in a dead list. Such an MSKLC layout with U+1D4EA??? and U+1D4D0??? works on the built-in Notepad, while Word displays .notdef boxes that convert to code points and then to glyphs using Alt+C twice. About LibreOffice and Notepad++, they are unable to display these characters even when pasted from Word or Notepad. Please don?t dismiss this issue towards other mailing lists or fora, because on such topics it is very hard out there to get any useful answer. And please don?t lock it out of the Unicode Mailing List, because it?s a Unicode implementation topic. Best regards, Marcel -------------- next part -------------- An HTML attachment was scrubbed... URL: From charupdate at orange.fr Mon Aug 10 06:13:20 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Mon, 10 Aug 2015 13:13:20 +0200 (CEST) Subject: The role of documentation in implementation (was: Re: Windows keyboard restrictions) Message-ID: <1784631313.7359.1439205200198.JavaMail.www@wwinf1m23> I confused the parent thread labelling. Please read: The role of documentation in implementation (was: Re: Windows keyboard restrictions) ? On 08 Aug 2015, at 15:01, Richard Wordingham wrote: > On Sat, 8 Aug 2015 14:05:17 +0200 (CEST) > Marcel Schneider wrote: > > > 2. Supposed that Windows supported more than four characters per > > ligature: > > > 2.1. Why has the MSKLC been limited to four characters per > > ligature? > > Because that was believed to be the architectural limit. Note however, > that it isn't 4 *characters* that is the limit, but 4 UTF-16 code units. Richard (be it allowed to use first name to conform to the usage) turned out to be the only person to answer one of my listed questions. Thanks again and my apologies for the tone of my reply, which finally has been influenced by the tone of the blog post, when I were talking about his author?but this too I?m sorry about and ask Michael to forgive me as we forgive his value lowering.** Remembering that I presumed that the limit was really four in old Windows versions and that it has been changed, I can now go a step further admitting that the fundamental limit has always been sixteen since ligatures were implemented, and that lowering it to four in one single application was triggered by a cluster of circumstances that produced a belief. To assess the effects of a lack of documentation, we may cross-reference the situation with two quotations from the same header file [MSKLC]\inc\kbd.h, lines 731?sqq and 751?sqq: // There is no documented way to detect the manufacturer of the computer that // is currently running an application. However, a Windows-based application // can detect the type of OEM Windows by using the return value of the // GetKeyboardType() function. // Application programs can use these OEM IDs to distinguish the type of OEM // Windows. Note, however, that this method is not documented, so Microsoft // may not support it in the future version of Windows. May we conclude that whenever documentation is missing, we are allowed to rely on test results and other observations? About ligatures support, Andrew?Glass and previously Michael?Kaplan assure that there will be no major change. Given that on Windows?7, sixteen code units are supported on all** current (tested) shift states, this allows to conclude that stability must not necessarily rely on documentation (unlike it is suggested in the second quotation above). Thanks to the parent thread, I take notice however that when documentation is missing, unusual behavior of software must not be referred to as bug. If this thread, which spins off from a closed thread, is not followed up, we may take these statements for granted and build development policies upon. Best regards, Marcel ** Note: More than four code units per ligature works now on *all* tested shift states. The reason why the unexpected behavior is now eliminated, seems to be (in my belief) that VK_OEM_PA1 has been replaced with VK_OEM_AX. To map KBDKANA (the Kana modifier key) to Left Alt, I had redefined scan code T38 as VK_OEM_PA1, found in kbd.h. Now on Sun Aug 09, 2015, 23:16 (yesterday in the evening) I found in ?WINUSER.H? (named in capitals) that VK_OEM_AX is far more obvious, as its default scan code stands between two that are actually used on the default French keyboard and are mapped to VK_OEM_8 and VK_OEM_102. I believe that the presence of the less usual VK_OEM_PA1 (part of ?Nokia/Ericsson definitions?) made Windows more sensitive. I?m happy that this issue is so appeasingly and gratifyingly resolved. I present my apologies for the trouble it has made, as well as my thanks for the many pieces of information it has brought up. Marcel -------------- next part -------------- An HTML attachment was scrubbed... URL: From charupdate at orange.fr Mon Aug 10 10:02:44 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Mon, 10 Aug 2015 17:02:44 +0200 (CEST) Subject: The role of documentation in implementation (was: Re: Windows keyboard restrictions) Message-ID: <336224144.13348.1439218964690.JavaMail.www@wwinf1m19> I'm brought to draw your attention to the fact that presumably my "buggy" mailbox inverted the order of the Copy Addressees, which I find reversed in both mailboxes where I can follow (but not answer in both, under my name associated with a fitting custom mail address). This order is mainly determined by the intervention chronology. It may not be important but formally I'm expected to stick with. Further as I was very short of time I forgot checking to add everybody. Sorry please. It may be current to put a person's name and address in Copy instead as Main Addressee in order not to seem urging him to reply, as a more informal sending. Please retrieve the list as I ordered it, taken from my outbox and completed: ? "Doug Ewell" ;? "Richard Wordingham" ; "Julian Bradfield" ; "Andrew Glass (WINDOWS)" ; "Andrew Cunningham" ; "Eli Zaretskii" ; "Asmus Freytag (t)" ; "Marc Durdin" ? On 08 Aug 2015, at 15:01, Richard Wordingham wrote: > On Sat, 8 Aug 2015 14:05:17 +0200 (CEST) > Marcel Schneider wrote: > > > 2. Supposed that Windows supported more than four characters per > > ligature: > > > 2.1. Why has the MSKLC been limited to four characters per > > ligature? > > Because that was believed to be the architectural limit. Note however, > that it isn't 4 *characters* that is the limit, but 4 UTF-16 code units. Richard (be it allowed to use first name to conform to the usage) turned out to be the only person to answer one of my listed questions. Thanks again and my apologies for the tone of my reply, which finally has been influenced by the tone of the blog post, when I were talking about his author?but this too I?m sorry about and ask Michael to forgive me as we forgive his value lowering.** Remembering that I presumed that the limit was really four in old Windows versions and that it has been changed, I can now go a step further admitting that the fundamental limit has always been sixteen since ligatures were implemented, and that lowering it to four in one single application was triggered by a cluster of circumstances that produced a belief. To assess the effects of a lack of documentation, we may cross-reference the situation with two quotations from the same header file [MSKLC]\inc\kbd.h, lines 731?sqq and 751?sqq: // There is no documented way to detect the manufacturer of the computer that // is currently running an application. However, a Windows-based application // can detect the type of OEM Windows by using the return value of the // GetKeyboardType() function. // Application programs can use these OEM IDs to distinguish the type of OEM // Windows. Note, however, that this method is not documented, so Microsoft // may not support it in the future version of Windows. May we conclude that whenever documentation is missing, we are allowed to rely on test results and other observations? About ligatures support, Andrew?Glass and previously Michael?Kaplan assure that there will be no major change. Given that on Windows?7, sixteen code units are supported on all** current (tested) shift states, this allows to conclude that stability must not necessarily rely on documentation (unlike it is suggested in the second quotation above). Thanks to the parent thread, I take notice however that when documentation is missing, unusual behavior of software must not be referred to as bug. If this thread, which spins off from a closed thread, is not followed up, we may take these statements for granted and build development policies upon. Best regards, Marcel ** Note: More than four code units per ligature works now on *all* tested shift states. The reason why the unexpected behavior is now eliminated, seems to be (in my belief) that VK_OEM_PA1 has been replaced with VK_OEM_AX. To map KBDKANA (the Kana modifier key) to Left Alt, I had redefined scan code T38 as VK_OEM_PA1, found in kbd.h. Now on Sun Aug 09, 2015, 23:16 (yesterday in the evening) I found in ?WINUSER.H? (named in capitals) that VK_OEM_AX is far more obvious, as its default scan code stands between two that are actually used on the default French keyboard and are mapped to VK_OEM_8 and VK_OEM_102. I believe that the presence of the less usual VK_OEM_PA1 (part of ?Nokia/Ericsson definitions?) made Windows more sensitive. I?m happy that this issue is so appeasingly and gratifyingly resolved. I present my apologies for the trouble it has made, as well as my thanks for the many pieces of information it has brought up. Marcel -------------- next part -------------- An HTML attachment was scrubbed... URL: From petercon at microsoft.com Mon Aug 10 11:38:56 2015 From: petercon at microsoft.com (Peter Constable) Date: Mon, 10 Aug 2015 16:38:56 +0000 Subject: bang mail Message-ID: I don't think it's helpful or even polite to send bang (high priority) mail to this list. Cheers, Peter -------------- next part -------------- An HTML attachment was scrubbed... URL: From petercon at microsoft.com Mon Aug 10 11:51:01 2015 From: petercon at microsoft.com (Peter Constable) Date: Mon, 10 Aug 2015 16:51:01 +0000 Subject: Standardised Encoding of Text In-Reply-To: <20150809193845.5931b6cf@JRWUBU2> References: <20150809120919.1adacf7c@JRWUBU2> <000001d0d299$0929fbc0$1b7df340$@fi> <20150809145811.3d1296d2@JRWUBU2> <20150809193845.5931b6cf@JRWUBU2> Message-ID: Richard, you can always submit a document to UTC with proposed text to add to the Tai Tham block description in a future version. Peter -----Original Message----- From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Richard Wordingham Sent: Sunday, August 9, 2015 11:39 AM To: Unicode Public Subject: Re: Standardised Encoding of Text On Sun, 9 Aug 2015 17:10:01 +0200 Mark Davis wrote: > While it would be good to document more scripts, and more language > options per script, that is always subject to getting experts signed > up to develop them. > > What I'd really like to see instead of documentation is a data-based > approach. > > For example, perhaps the addition of real data to CLDR for a > "basic-validity-check" on a language-by-language basis. One aspect this would not help with is with letter forms that do not resemble their forms in the code charts. The code charts usually broadly answer the question "What does this code represent?". They don't answer the question, "What code points represent this glyph?". Problems I've seen in Tai Tham are the use of U+1A57 TAI THAM CONSONANT SIGN LA TANG LAI for the sequence and of for . The problem is that the subscript forms for U+1A43 and U+1A3F are only documented in the proposals. The subscript consonant signs probably add to the confusion of anyone working from the code chart. The people making the errors were far from ignorant of the script. Richard. From petercon at microsoft.com Mon Aug 10 11:52:56 2015 From: petercon at microsoft.com (Peter Constable) Date: Mon, 10 Aug 2015 16:52:56 +0000 Subject: bang mail In-Reply-To: References: Message-ID: Possible exception: you've sent mail with a URL that points to something you learned was malicious and want to advise people not to click on that link. From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Peter Constable Sent: Monday, August 10, 2015 9:39 AM To: unicode at unicode.org Subject: bang mail I don't think it's helpful or even polite to send bang (high priority) mail to this list. Cheers, Peter -------------- next part -------------- An HTML attachment was scrubbed... URL: From dzo at bisharat.net Mon Aug 10 12:00:18 2015 From: dzo at bisharat.net (dzo at bisharat.net) Date: Mon, 10 Aug 2015 17:00:18 +0000 Subject: bang mail In-Reply-To: References: Message-ID: <1855182813-1439226019-cardhu_decombobulator_blackberry.rim.net-1709997657-@b14.c2.bise6.blackberry> Agreed. Thank you, Peter. Basic list netiquette IMO (though it seems some people are passing on the "high importance" tags inadvertently when replying). Sent via BlackBerry by AT&T -----Original Message----- From: Peter Constable Sender: "Unicode" Date: Mon, 10 Aug 2015 16:38:56 To: unicode at unicode.org Subject: bang mail I don't think it's helpful or even polite to send bang (high priority) mail to this list. Cheers, Peter -------------- next part -------------- An HTML attachment was scrubbed... URL: From charupdate at orange.fr Mon Aug 10 12:08:10 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Mon, 10 Aug 2015 19:08:10 +0200 (CEST) Subject: Implementing SMP on a UTF-16 OS (was: Windows keyboard restrictions) Message-ID: <1348099338.17889.1439226490802.JavaMail.www@wwinf1m19> On 10 Aug 2015, at 13;21, I wrote: > I note that the MSKLC converts to ligatures of a surrogates pair any SMP character mapped on a key, and that it cannot admit any SMP character in a dead list. > Such an MSKLC layout with U+1D4EA ?? and U+1D4D0 ?? works on the built-in Notepad, while Word displays .notdef boxes that convert to code points and then to glyphs using Alt+C twice. I should mention too that my OS is Windows 7 Starter (same version, build number and service pack as Windows Seven). Word is Word Starter 2010. Marcel -------------- next part -------------- An HTML attachment was scrubbed... URL: From charupdate at orange.fr Mon Aug 10 12:16:51 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Mon, 10 Aug 2015 19:16:51 +0200 (CEST) Subject: bang mail In-Reply-To: References: Message-ID: <898142779.18218.1439227011781.JavaMail.www@wwinf1m19> On 10 Aug 2015, at 18:48, Peter Constable wrote: > I don?t think it?s helpful or even polite to send bang (high priority) mail to this list. Being the one who did, I apologize (once more), for this pushiness. Best regards, Marcel -------------- next part -------------- An HTML attachment was scrubbed... URL: From Andrew.Glass at microsoft.com Mon Aug 10 12:58:24 2015 From: Andrew.Glass at microsoft.com (Andrew Glass (WINDOWS)) Date: Mon, 10 Aug 2015 17:58:24 +0000 Subject: ZWJ as a Ligature Suppressor In-Reply-To: <20150809115820.67e0eead@JRWUBU2> References: <20150809115820.67e0eead@JRWUBU2> Message-ID: Hi Richard, To ligate or not to ligate is up to the font designer. Normally, GSUB lookups that perform ligation will be broken by the presence of ZWJ or ZWNJ. If a font designer wishes to ligate in the presence of a ZWJ or ZWNJ then they could choose to include appropriate glyph sequences in their ligation lookups. For example: glyphA glyphB -> glyphC glyphA ZWJ glyphB -> glyphC Cheers, Andrew Andrew Glass Ph.D. Program Manager Shell Text Input Group | Windows | Microsoft -----Original Message----- From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Richard Wordingham Sent: Sunday, August 9, 2015 3:58 AM To: unicode at unicode.org Subject: ZWJ as a Ligature Suppressor According to the text just after TUS 7.0.0 Figure 23-3 (http://www.unicode.org/versions/Unicode7.0.0/ch23.pdf#G25237), ZWJ suppresses ligatures in Arabic script. Does this rule apply to other normally cursive joined scripts, e.g. Syriac and Mongolian? Am I right in thinking that for an OpenType font for other scripts, the font writer must take precautions to prevent ZWJ accidentally suppressing ligatures that would be better suppressed by ZWNJ or ? From richard.wordingham at ntlworld.com Mon Aug 10 13:26:03 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Mon, 10 Aug 2015 19:26:03 +0100 Subject: ZWJ as a Ligature Suppressor In-Reply-To: References: <20150809115820.67e0eead@JRWUBU2> Message-ID: <20150810192603.5dafd0a7@JRWUBU2> On Mon, 10 Aug 2015 17:58:24 +0000 "Andrew Glass (WINDOWS)" wrote: I had asked: >> According to the text just after TUS 7.0.0 Figure 23-3 >> (http://www.unicode.org/versions/Unicode7.0.0/ch23.pdf#G25237), ZWJ >> suppresses ligatures in Arabic script. Does this rule apply to other >> normally cursive joined scripts, e.g. Syriac and Mongolian? > To ligate or not to ligate is up to the font designer. Normally, GSUB > lookups that perform ligation will be broken by the presence of ZWJ > or ZWNJ. If a font designer wishes to ligate in the presence of a ZWJ > or ZWNJ then they could choose to include appropriate glyph sequences > in their ligation lookups. For example: > glyphA glyphB -> glyphC > glyphA ZWJ glyphB -> glyphC So, any rule as to what ZWJ means is not implemented in the OpenType engine, but rather in the font. (As is the rule that 'a' does not look like 'b'.) For which scripts may a font designer defensibly omit the duplicate with ZWJ? The TUS says Arabic is one. Are there any others? Richard. From khaledhosny at eglug.org Mon Aug 10 14:00:41 2015 From: khaledhosny at eglug.org (Khaled Hosny) Date: Mon, 10 Aug 2015 21:00:41 +0200 Subject: ZWJ as a Ligature Suppressor In-Reply-To: References: <20150809115820.67e0eead@JRWUBU2> Message-ID: <20150810190041.GA29473@khaled-laptop> This is not always true, some rendering engines (like HarfBuzz) try to follow the Unicode rules so ZWJ does not break ligatures except in Arabic where the standard says it should be interpreted as . Regards, Khaled On Mon, Aug 10, 2015 at 05:58:24PM +0000, Andrew Glass (WINDOWS) wrote: > Hi Richard, > > To ligate or not to ligate is up to the font designer. Normally, GSUB lookups that perform ligation will be broken by the presence of ZWJ or ZWNJ. If a font designer wishes to ligate in the presence of a ZWJ or ZWNJ then they could choose to include appropriate glyph sequences in their ligation lookups. For example: > > glyphA glyphB -> glyphC > glyphA ZWJ glyphB -> glyphC > > Cheers, > > Andrew > > > Andrew Glass Ph.D. > Program Manager > Shell Text Input Group | Windows | Microsoft > > -----Original Message----- > From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Richard Wordingham > Sent: Sunday, August 9, 2015 3:58 AM > To: unicode at unicode.org > Subject: ZWJ as a Ligature Suppressor > > According to the text just after TUS 7.0.0 Figure 23-3 (http://www.unicode.org/versions/Unicode7.0.0/ch23.pdf#G25237), ZWJ suppresses ligatures in Arabic script. Does this rule apply to other normally cursive joined scripts, e.g. Syriac and Mongolian? > > Am I right in thinking that for an OpenType font for other scripts, the font writer must take precautions to prevent ZWJ accidentally suppressing ligatures that would be better suppressed by ZWNJ or ? From unicode at maxtruxa.com Mon Aug 10 14:08:55 2015 From: unicode at maxtruxa.com (Max Truxa) Date: Mon, 10 Aug 2015 21:08:55 +0200 Subject: Implementing SMP on a UTF-16 OS Message-ID: Hi Marcel, from what I can see in the short piece of code you posted, it looks like you are trying to somehow "group" the surrogate pairs (which does not make any sense to me). Correct syntax would be: [...] 0xD835, 0xDCEA, 0xD835, 0xDCD0, [...] IMO this mailing list is not the right place for questions about C syntax, is it not? Best regards, Max From charupdate at orange.fr Mon Aug 10 14:46:39 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Mon, 10 Aug 2015 21:46:39 +0200 (CEST) Subject: Implementing SMP on a UTF-16 OS (was: Windows keyboard restrictions) In-Reply-To: References: Message-ID: <2079848060.15640.1439235999068.JavaMail.www@wwinf1d19> Hi Max, On 10 Aug 2015, at 20:25, Max Truxa wrote: > IMO this mailing list is not the right place for questions about C syntax, is it not? Indeed it isn't. Would it not be about Unicode implementation, I wouldn't have sent it to the Unicode Mailing List. Whatever language, it's about getting SMP characters into the place where the OS stores the keyboard layout. I stick with the idea that this mustn't stay limited to the BMP. IMHO the ligatures made of a surrogates pair in MSKLC keyboards are a sort of workaround, while something isn't really fit for UTF-16. The idea wasn't to throttle the OS down to BMP, was it? The SMP simply didn't exist yet. Now it does, things get screwed up. > Correct syntax would be: [...] 0xD835, 0xDCEA, 0xD835, 0xDCD0, [...] The problem with the commas here is that they don't only separate, they increment the modification number. The trailing surrogate must stay together with the leading one on the same shift state. As I say, it's got screwed up. However, I'll try with just removing the parentheses, hoping that the surrogates will be automatically grouped together. By contrast, I've the good news to bring in that the test SMP keyboard layout works on Word?2013. When I press the key with U+1D4EA and U+1D4D0, the glyphs are directly inserted. So there's one less problem. Best regards, Marcel -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Mon Aug 10 15:25:03 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Mon, 10 Aug 2015 21:25:03 +0100 Subject: Implementing SMP on a UTF-16 OS (was: Windows keyboard restrictions) In-Reply-To: <2079848060.15640.1439235999068.JavaMail.www@wwinf1d19> References: <2079848060.15640.1439235999068.JavaMail.www@wwinf1d19> Message-ID: <20150810212503.58cc24d0@JRWUBU2> On Mon, 10 Aug 2015 21:46:39 +0200 (CEST) Marcel Schneider wrote: > > Correct syntax would be: [...] 0xD835, 0xDCEA, 0xD835, 0xDCD0, [...] > The problem with the commas here is that they don't only separate, > they increment the modification number. The trailing surrogate must > stay together with the leading one on the same shift state. Non-BMP characters must be entered as 'ligatures'. > By contrast, I've the good news to bring in that the test SMP > keyboard layout works on Word?2013. When I press the key with U+1D4EA > and U+1D4D0, the glyphs are directly inserted. So there's one less > problem. Curiously, I think it would work for my cuneiform keyboard with Windows 2002 if I hadn't chosen the only Mesopotamian locale available at the time, Iraqi Arabic. I think I'm suffering from Word being too clever by half - the font and keyboard kept changing when I claimed the text was left-to-right. Perhaps Word 2002 has received better patches. The first time I used it on Windows 7 some Thai was rendered with black boxes. I installed the extensions to handle OpenDocument, and the problems went away. If I change the font back to a Hittite font I have, the text appears as it should. I created the keyboard for doing Babylonian maths, so I don't think a Turkish locale would have been appropriate, despite my using a Hittite font. It would probably work better, though. Richard. From richard.wordingham at ntlworld.com Mon Aug 10 15:25:20 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Mon, 10 Aug 2015 21:25:20 +0100 Subject: Implementing SMP on a UTF-16 OS In-Reply-To: References: Message-ID: <20150810212520.53544f3e@JRWUBU2> On Mon, 10 Aug 2015 21:08:55 +0200 Max Truxa wrote: > Correct syntax would be: [...] 0xD835, 0xDCEA, 0xD835, 0xDCD0, [...] > IMO this mailing list is not the right place for questions about C > syntax, is it not? View it as a gripe about UTF-16 and how confusing things get when code units are referred to as characters. From charupdate at orange.fr Mon Aug 10 15:53:11 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Mon, 10 Aug 2015 22:53:11 +0200 (CEST) Subject: Implementing SMP on a UTF-16 OS In-Reply-To: <20150810212503.58cc24d0@JRWUBU2> References: <2079848060.15640.1439235999068.JavaMail.www@wwinf1d19> <20150810212503.58cc24d0@JRWUBU2> Message-ID: <1761086697.16696.1439239991108.JavaMail.www@wwinf1d19> On Mon, 10 Aug 2015, at 22:33, Richard Wordingham wrote: > Non-BMP characters must be entered as 'ligatures'. This is bad news for a universal Latin keyboard layout, where a number of SMP characters should be available trough dead keys, or Compose. We can implement Compose as a dead key chaining tree, but it seems to be limited to the BMP. The mathematical letters are part of the symbols, and it would be handy to get them too with dead keys, as Compose, &, &, for the script alphabet. But the deadtrans combined character argument must be one code unit, not one character. So there seems to be no place for SMP. This is clearly a Unicode implementation problem. C and C++ should be standardized for handling of UTF-16. IMO we cannot consider that Windows supports UTF-16 for internal use, if it does not support surrogates pairs except with workarounds using ligatures. I may be wrong, but that's how I see the problem now. Best regards, Marcel -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Mon Aug 10 15:58:32 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Mon, 10 Aug 2015 21:58:32 +0100 Subject: Implementing SMP on a UTF-16 OS In-Reply-To: <1761086697.16696.1439239991108.JavaMail.www@wwinf1d19> References: <2079848060.15640.1439235999068.JavaMail.www@wwinf1d19> <20150810212503.58cc24d0@JRWUBU2> <1761086697.16696.1439239991108.JavaMail.www@wwinf1d19> Message-ID: <20150810215832.5c3246cf@JRWUBU2> On Mon, 10 Aug 2015 22:53:11 +0200 (CEST) Marcel Schneider wrote: > On Mon, 10 Aug 2015, at 22:33, Richard Wordingham wrote: > > Non-BMP characters must be entered as 'ligatures'. > This is clearly a Unicode implementation problem. C and C++ should be > standardized for handling of UTF-16. IMO we cannot consider that > Windows supports UTF-16 for internal use, if it does not support > surrogates pairs except with workarounds using ligatures. Perhaps this is why Windows offers a new method of keyboard mapping, via the Text Services Framework (TSF). > I may be wrong, but that's how I see the problem now. I think you're not looking hard enough. Richard. From unicode at maxtruxa.com Tue Aug 11 01:27:09 2015 From: unicode at maxtruxa.com (Max Truxa) Date: Tue, 11 Aug 2015 08:27:09 +0200 Subject: Implementing SMP on a UTF-16 OS In-Reply-To: <1761086697.16696.1439239991108.JavaMail.www@wwinf1d19> References: <2079848060.15640.1439235999068.JavaMail.www@wwinf1d19> <20150810212503.58cc24d0@JRWUBU2> <1761086697.16696.1439239991108.JavaMail.www@wwinf1d19> Message-ID: On Aug 10, 2015 10:53 PM, "Marcel Schneider" wrote: > > This is clearly a Unicode implementation problem. C and C++ should be standardized for handling of UTF-16. IMO we cannot consider that Windows supports UTF-16 for internal use, if it does not support surrogates pairs except with workarounds using ligatures. C and C++ *are* "standardized for handling of UTF-16"... and UTF-8... and UTF-32. If you are interested in this topic just search for "C++ Unicode string literals" and "C++ Unicode character literals" which are standardized since C11/C++11 (with the exception of UTF-8 character literals which will follow in C++11; don't know about C though). The reason you won't be able to easily use these features is because the compiler shipping with the WDK is still only supporting C89/C90. And sadly for us driver developers Microsoft will not change this. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at maxtruxa.com Tue Aug 11 01:29:59 2015 From: unicode at maxtruxa.com (Max Truxa) Date: Tue, 11 Aug 2015 08:29:59 +0200 Subject: Implementing SMP on a UTF-16 OS In-Reply-To: References: <2079848060.15640.1439235999068.JavaMail.www@wwinf1d19> <20150810212503.58cc24d0@JRWUBU2> <1761086697.16696.1439239991108.JavaMail.www@wwinf1d19> Message-ID: On Aug 11, 2015 8:27 AM, "Max Truxa" wrote: > > (with the exception of UTF-8 character literals which will follow in C++11; don't know about C though). Sorry for that typo. UTF-8 character literals will follow in C++17. -------------- next part -------------- An HTML attachment was scrubbed... URL: From mark at macchiato.com Tue Aug 11 02:35:38 2015 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Tue, 11 Aug 2015 09:35:38 +0200 Subject: Bogus glyphs for halfwidth characters Message-ID: For halfwidth characters like U+FF70 HALFWIDTH KATAKANA-HIRAGANA PROLONGED SOUND MARK I'm getting bogus glyphs from Kaiti* and Songti* (and STKaiti, STSongti). ? See screenshot below. ?Anyone else see that, or know what is happening? Unfortunately, these fonts are getting picked up first in the fallback chain? for my browser, so they are pretty apparent! [image: Inline image 1] Mark -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Screen Shot 2015-08-11 at 09.32.26.png Type: image/png Size: 27906 bytes Desc: not available URL: From A.Schappo at lboro.ac.uk Tue Aug 11 05:07:13 2015 From: A.Schappo at lboro.ac.uk (Andre Schappo) Date: Tue, 11 Aug 2015 10:07:13 +0000 Subject: Bogus glyphs for halfwidth characters In-Reply-To: References: Message-ID: <71A7E9CE-3BC3-4B75-98C4-8CD672C8647A@lboro.ac.uk> Yes. Ditto. Mac OSX 10.10.4 Broken CMAPs? Andr? Schappo On 11 Aug 2015, at 08:35, Mark Davis ?? wrote: For halfwidth characters like U+FF70 HALFWIDTH KATAKANA-HIRAGANA PROLONGED SOUND MARK I'm getting bogus glyphs from Kaiti* and Songti* (and STKaiti, STSongti). ? See screenshot below. ?Anyone else see that, or know what is happening? Unfortunately, these fonts are getting picked up first in the fallback chain? for my browser, so they are pretty apparent! Mark -------------- next part -------------- An HTML attachment was scrubbed... URL: From A.Schappo at lboro.ac.uk Tue Aug 11 11:00:24 2015 From: A.Schappo at lboro.ac.uk (Andre Schappo) Date: Tue, 11 Aug 2015 16:00:24 +0000 Subject: Bogus glyphs for halfwidth characters In-Reply-To: References: Message-ID: The bug is consistent. All the below fonts are by Changzhou SinoType Technology and U+FF70 is at font glyph 147 Andr? Schappo On 11 Aug 2015, at 08:35, Mark Davis ?? wrote: For halfwidth characters like U+FF70 HALFWIDTH KATAKANA-HIRAGANA PROLONGED SOUND MARK I'm getting bogus glyphs from Kaiti* and Songti* (and STKaiti, STSongti). ? See screenshot below. ?Anyone else see that, or know what is happening? Unfortunately, these fonts are getting picked up first in the fallback chain? for my browser, so they are pretty apparent! Mark -------------- next part -------------- An HTML attachment was scrubbed... URL: From tom at bluesky.org Tue Aug 11 11:10:35 2015 From: tom at bluesky.org (Tom Gewecke) Date: Tue, 11 Aug 2015 12:10:35 -0400 Subject: Bogus glyphs for halfwidth characters In-Reply-To: References: Message-ID: <08FF27D2-971B-4A40-8D1E-F6DA202AB8E8@bluesky.org> It looks like glyphs 132-194 are all mislabeled as halfwidth katakana-hiragana. On Aug 11, 2015, at 12:00 PM, Andre Schappo wrote: > > The bug is consistent. All the below fonts are by Changzhou SinoType Technology and U+FF70 is at font glyph 147 > > Andr? Schappo > > On 11 Aug 2015, at 08:35, Mark Davis ?? wrote: > >> For halfwidth characters like U+FF70 HALFWIDTH KATAKANA-HIRAGANA PROLONGED SOUND MARK I'm getting bogus glyphs from Kaiti* and Songti* (and STKaiti, STSongti). ? See screenshot below. >> >> ?Anyone else see that, or know what is happening? Unfortunately, these fonts are getting picked up first in the fallback chain? for my browser, so they are pretty apparent! >> >> >> >> Mark >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From charupdate at orange.fr Tue Aug 11 13:49:08 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Tue, 11 Aug 2015 20:49:08 +0200 (CEST) Subject: bang mail In-Reply-To: <000001d0d3a6$696939c0$3c3bad40$@fi> References: <898142779.18218.1439227011781.JavaMail.www@wwinf1m19> <000001d0d3a6$696939c0$3c3bad40$@fi> Message-ID: <1160363707.13272.1439318948241.JavaMail.www@wwinf1f27> I should not bring more explanations about a behavior of mine that has been identified as inappropriate. However, in this particular circumstance, I?d like to outline briefly why I banged my e-mails: 1 - I observed that at normal priority, an e-mail takes much more time until it is received. 2 - As often I?take much time and pain to make accurate e-mails, I looked for a way to get the addressee at least take a glance among the mass of e-mails that is said to be constantly received. 3 - When I?d e-mailed to the List while forgetting to set the bang, I thought, ?argh, I hope the addressees won?t take notice that I?made a difference in treatment.? Thanks to the Unicode Mailing List, I now learned that an e-mail is better viewed when the prioritization tool hadn?t been used. That?s very useful and I?d like to personally thank Peter for having taken the initiative of preventing me, as well as dzo and Erkki for having answered in this thread. Best regards, Marcel -------------- next part -------------- An HTML attachment was scrubbed... URL: From charupdate at orange.fr Tue Aug 11 13:51:33 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Tue, 11 Aug 2015 20:51:33 +0200 (CEST) Subject: The role of documentation in implementation (was: Re: Windows keyboard restrictions) In-Reply-To: <336224144.13348.1439218964690.JavaMail.www@wwinf1m19> References: <336224144.13348.1439218964690.JavaMail.www@wwinf1m19> Message-ID: <1679568348.13334.1439319093909.JavaMail.www@wwinf1f27> I?ve got a driver with a five code units ligature on Shift+Ctrl+Alt, and where Word (and Excel) opened. As I was in a hurry and wrote in English, I didn?t notice that the dead keys were disabled. That was the driver I?was writing about when I spun off this thread. Now I?ve compiled a driver where the only difference is that two complete lines (the ones with the ellipses) are inverted, at a place that isn?t sorting sensitive. Word and Excel are blocked. I?tried again the driver above: Firefox and Zotero are blocked. It?s very hard for me to write this to the Mailing List, but honestly I must admit that having not enough time, I didn?t work properly. All that, and some more, leads me to the conclusion that when Windows was built, there was often not enough time to write up the documentation; or there was a fear that such a documentation could be copied and carried away. So the teams were told not to waste any time upon. These are suppositions, but as everybody at Microsoft mustn?t disclose any information about how things were done (for the same reason that there?s little documentation), we?re reduced to build our own views, to get at least some working idea. So now I?believe that when Michael?Kaplan did his own tests, he found out that there?s a problem when he put on Shift+Ctrl+Alt a ligature that exceeded four code units, and that he asked some colleagues but nobody knew anything about, so he remembered the header file he had seen (but that perhaps he couldn?t find again because it hadn?t been documented). Really, when 16 units work on all shift states except one, an official keyboard layout software must equalize the limit at the low level. If a user read that he could put up to 16 on all shift states except on Shift+AltGr, where he could put up to?4, he would get a curious feeling. It?s like the Liebig rule: the lowest level determines the overall limit. But when developing ready-to-use keyboard layouts with the WDK, as Michael?Kaplan seems to suggest in the MSKLC glossary, we aren?t caught to stick with the safe limit and may feel free to place as many units as we find really working. Well, when I place less than five on Shift+Ctrl+Alt, I?m not forced to divulge that more wouldn?t work there. I?m not meant to write up the documentation Microsoft didn?t?:-) Best regards, Marcel -------------- next part -------------- An HTML attachment was scrubbed... URL: From asmus-inc at ix.netcom.com Tue Aug 11 14:08:48 2015 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Tue, 11 Aug 2015 12:08:48 -0700 Subject: The role of documentation in implementation (was: Re: Windows keyboard restrictions) In-Reply-To: <1679568348.13334.1439319093909.JavaMail.www@wwinf1f27> References: <336224144.13348.1439218964690.JavaMail.www@wwinf1m19> <1679568348.13334.1439319093909.JavaMail.www@wwinf1f27> Message-ID: <55CA4840.8090109@ix.netcom.com> An HTML attachment was scrubbed... URL: From charupdate at orange.fr Tue Aug 11 14:27:27 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Tue, 11 Aug 2015 21:27:27 +0200 (CEST) Subject: Implementing SMP on a UTF-16 OS Message-ID: <362174827.4688.1439321247508.JavaMail.www@wwinf1j19> On 10 Aug 2015, at 21:45, Max Truxa wrote: > from what I can see in the short piece of code you posted, it looks > like you are trying to somehow "group" the surrogate pairs (which does > not make any sense to me). > Correct syntax would be: [...] 0xD835, 0xDCEA, 0xD835, 0xDCD0, [...] On 10 Aug 2015, at 23:06, Richard Wordingham wrote: > On Mon, 10 Aug 2015 22:53:11 +0200 (CEST) > Marcel Schneider wrote: > > > On Mon, 10 Aug 2015, at 22:33, Richard Wordingham wrote: > > > > Non-BMP characters must be entered as 'ligatures'. > > > This is clearly a Unicode implementation problem. C and C++ should be > > standardized for handling of UTF-16. IMO we cannot consider that > > Windows supports UTF-16 for internal use, if it does not support > > surrogates pairs except with workarounds using ligatures. > > Perhaps this is why Windows offers a new method of keyboard > mapping, via the Text Services Framework (TSF). > > > I may be wrong, but that's how I see the problem now. > > I think you're not looking hard enough. I?ve tried to just remove the parentheses and let the string. This was compiled, but the keyboard test showed that in the keyboard driver DLL, UTF-16 strings with SMP characters aren?t handled as such. Each surrogate code unit is considered as a single character even when it?s followed by a trailing one. Only the code unit corresponding to the shift state (modification number) is taken, no matter if it?s only a surrogate and the other half comes next. Windows can handle 32?bit code units. I found evidence in C:\WinDDK\7600.16385.1\inc\api\functiondiscoverykeys.h. So I tried this in the driver source: {'A' /*T10 D01*/ ,0x01 ,'a' ,'A' ,NONE ,0xd835dcea ,0xd835dcd0 ,0x00e6 ,0x00c6 ,NONE ,NONE }, // ,0x0061 ,0x0041 But the compiler returned: warning C4305: 'initializing' : truncation from 'unsigned int' to 'WCHAR' and: error C2220: warning treated as error - no 'object' file generated I understand that the compiler read correctly the first of the 32?bit integers, but as here it expected a WCHAR, it deleted 16?bits and wouln?t go forth. On 11 Aug 2015, at 8:27, Max Truxa" wrote [corrected typo following your next e-mail]: > On Aug 10, 2015 10:53 PM, "Marcel Schneider" wrote: > > > > This is clearly a Unicode implementation problem. C and C++ should be standardized for handling of UTF-16. IMO we cannot consider that Windows supports UTF-16 for internal use, if it does not support surrogates pairs except with workarounds using ligatures. > C and C++ *are* "standardized for handling of UTF-16"... and UTF-8... and UTF-32. > If you are interested in this topic just search for "C++ Unicode string literals" and "C++ Unicode character literals" which are standardized since C11/C++11 (with the exception of UTF-8 character literals which will follow in C++17; don't know about C though). > The reason you won't be able to easily use these features is because the compiler shipping with the WDK is still only supporting C89/C90. And sadly for us driver developers Microsoft will not change this. Is this the reason why a Unicode character cannot be represented alternatively as a 32?bit integer on Windows? Being UTF-16, the OS could handle a complete surrogates pair in one single 32?bit?integer. Couldn't this be performed on driver level by modifying a program and updating this when the driver is installed? If yes, we must modify the interface so that keyboard driver DLLs are really read in UTF-16. And/or we must find another compiler. Must the Windows driver be compiled by a Microsoft compiler? Meanwhile, the only workaround I see for getting SMP characters in the deadtrans list, is that these must be programmed on two entries, so that a user must type for example Compose, &, &, &, A, 1, and then Compose, &, &, &, A, 2, to get ?? (bold script, when normal script is with two ampersands, and ?with curl?, one ampersand). (Instead of 1 and 2 we can also choose l for leading, and t for trailing.) Normally a user should be able to get this letter with five key strokes, not ten. On Word we?ve already an autocorrect for script letters (??,???), so that we should add another series for bold script (which is bolder than ?bold? ?script?). But that working on Office, not on the Notepad and elsewhere, a keyboard driver or TSF based solution is preferrable, also because typing \ s c r i p t a Space Backspace is already ten keystrokes, too! (A trailing backslash would save one.) Best regards, Marcel -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Tue Aug 11 16:06:51 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Tue, 11 Aug 2015 22:06:51 +0100 Subject: Implementing SMP on a UTF-16 OS In-Reply-To: <362174827.4688.1439321247508.JavaMail.www@wwinf1j19> References: <362174827.4688.1439321247508.JavaMail.www@wwinf1j19> Message-ID: <20150811220651.7ed376c1@JRWUBU2> On Tue, 11 Aug 2015 21:27:27 +0200 (CEST) Marcel Schneider wrote: > I?ve tried to just remove the parentheses and let the string. This > was compiled, but the keyboard test showed that in the keyboard > driver DLL, UTF-16 strings with SMP characters aren?t handled as > such. Each surrogate code unit is considered as a single character > even when it?s followed by a trailing one. Only the code unit > corresponding to the shift state (modification number) is taken, no > matter if it?s only a surrogate and the other half comes next. This is exactly what one should expect. The data is an array of UTF-16 code units rather than a UTF-16 string. Moreover, it was probably written as UCS-2. I believe it is the application that has the job of stitching the surrogate pairs together. > Is this the reason why a Unicode character cannot be represented > alternatively as a 32?bit integer on Windows? They are, from time to time. There's a Windows message that delivers a supplementary character rather a UTF-16 code unit, and fairly obviously they have to be handled as such when performing font lookups. I've a suspicion that this message hit an interoperability problem. A program that can handle pairs of surrogates but predates the message will not work with the more recent message. Therefore using the message type is deferred until applications can handle it. Therefore applications don't need to handle it, and don't. Therefore the message type doesn't get used. > Being UTF-16, the OS > could handle a complete surrogates pair in one single 32?bit?integer. > Couldn't this be performed on driver level by modifying a program and > updating this when the driver is installed? You really talking about a parallel set of routines. I suspect the answer is that Microsoft don't want to work on extending a primitive keyboarding system when TSF is available. You want to use dead keys. Why? Is it not that they are the only mechanism you have experience of. Better systems can be built, in which one sees what one is doing. Is it not much better to type 'e' and then a circumflex, and see the 'e' and then the 'e' with a circumflex? Dead keys are an imitation of a limitation of typewriter technology. If I was typing cuneiform, I'd much rather type 'bi' and see the growing sequence 'b', 'bi', '' as I typed. (What you have for a key is your choice.) TSF lets one do this. A simple extension of the keyboard definition DLLs generated by MSKLC does not. What you should be pressing for is a usable tutorial on how to do this in TSF. > If yes, we must modify the interface so that keyboard driver DLLs are > really read in UTF-16. And/or we must find another compiler. > > Must the Windows driver be compiled by a Microsoft compiler? The compiler is not the issue. The point is that the 16-bit code exists, and programs that use the 16-bit API exist. Language upgrades may make supplementary characters easier to use in programs, but that is all. They don't change existing binary interfaces. Richard. From marc at keyman.com Tue Aug 11 16:47:26 2015 From: marc at keyman.com (Marc Durdin) Date: Tue, 11 Aug 2015 21:47:26 +0000 Subject: Michael Kaplan leaves Microsoft Message-ID: <1CEDD746887FFF4B834688E7AF5FDA5A821BF3CA@federation.tavultesoft.local> This is a little off topic, I would just like to pay some respect to Michael who has been on disability leave from Microsoft for some time and has just been RIF'ed (see his blog titled "Something Happened" http://www.siao2.com/2015/08/11/8770668856267197009.aspx). Thank you to Michael for your tireless work over many years in Unicode and i18n work in Windows, for writing MSKLC and for the vast array knowledge you collected and shared in your blog over many years. I wish you all the best in the future. Marc CEO, Tavultesoft Pty Ltd Keyman: Type to the world in your language PO Box 550 Sandy Bay TAS 7006 AUSTRALIA ph: +61 3 6225 1665 mobile: +61 400 737 106 fax: +61 3 9923 6047 email: marc at keyman.com web: keyman.com skype: mcdurdin twitter: @MarcDurdin (personal) @keyman facebook: keymanapp google+: keymanapp -------------- next part -------------- An HTML attachment was scrubbed... URL: From charupdate at orange.fr Wed Aug 12 03:09:36 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Wed, 12 Aug 2015 10:09:36 +0200 (CEST) Subject: Michael Kaplan leaves Microsoft In-Reply-To: <1CEDD746887FFF4B834688E7AF5FDA5A821BF3CA@federation.tavultesoft.local> References: <1CEDD746887FFF4B834688E7AF5FDA5A821BF3CA@federation.tavultesoft.local> Message-ID: <177879370.5993.1439366976157.JavaMail.www@wwinf1c20> I'm very sad. This news hits hard. I hurry up (having e-mailed what I have) joining my thanks and wishes to Marc's. Thank you Marc for having brought it to us. No, I feel this isn't off topic :-(| Marcel On 11 Aug 2015, at 23:56, Marc Durdin wrote: This is a little off topic, I would just like to pay some respect to Michael who has been on disability leave from Microsoft for some time and has just been RIF?ed (see his blog titled ?Something Happened? http://www.siao2.com/2015/08/11/8770668856267197009.aspx). Thank you to Michael for your tireless work over many years in Unicode and i18n work in Windows, for writing MSKLC and for the vast array knowledge you collected and shared in your blog over many years. I wish you all the best in the future. Marc CEO, Tavultesoft Pty Ltd > Keyman: Type to the world in your language > PO Box 550? Sandy Bay? TAS? 7006? AUSTRALIA > ph: +61 3 6225 1665? mobile: +61 400 737 106? fax: +61 3 9923 6047 > email: marc at keyman.com? web: keyman.com? skype: mcdurdin > twitter: @MarcDurdin (personal)@keyman > facebook: keymanapp google+: keymanapp ? ? -------------- next part -------------- An HTML attachment was scrubbed... URL: From Andrew.Glass at microsoft.com Wed Aug 12 11:54:26 2015 From: Andrew.Glass at microsoft.com (Andrew Glass (WINDOWS)) Date: Wed, 12 Aug 2015 16:54:26 +0000 Subject: ZWJ as a Ligature Suppressor In-Reply-To: <20150810192603.5dafd0a7@JRWUBU2> References: <20150809115820.67e0eead@JRWUBU2> <20150810192603.5dafd0a7@JRWUBU2> Message-ID: [Speaking for Uniscribe] >So, any rule as to what ZWJ means is not implemented in the OpenType engine, but rather in the font. (As is the rule that 'a' does not look like 'b'.) Our Arabic and Universal Shaping Engines understand that ZWJ invokes a joining form for joining scripts. Ligation is handled by the font. The presence of ZWJ invokes the joining forms, but since it is not included in the ligature lookup, the ligature does not form. The ZWNJ does not invoke a joining form. Thus we can achieve the forms illustrated figure 23-3. The fi case is different because Latin is not a joining script. Furthermore, the ligated form ?, when supported, is usually a discretionary ligature. Therefore to achieve the Latin forms in 23-3, I would attach a lookup for the fi substitution to , and specify a substitution that includes ZWJ under . As Chapter 23 states (TUS 7.0, p. 804), there is no way to request a discretionary ligature in plain text for Arabic (and other joining scripts). >For which scripts may a font designer defensibly omit the duplicate with ZWJ? The TUS says Arabic is one. Are there any others? In general I would say that a designer can omit the lookup with ZWJ for joining scripts that include ligated forms. Our Mongolian font has ligatures that behave in the same way as Arabic, in that they can be blocked, but the components still join. Our Syriac fonts have lookups but these are cosmetic and don't produce a visually distinct ligated form so the impact of ZWJ is negligible, but effectively still the same as Arabic. Our other joining scripts don't have ligatures. For Indic scripts see TUS (7.0) Figure 12.7: http://www.unicode.org/versions/Unicode7.0.0/ch12.pdf Cheers, Andrew From petercon at microsoft.com Wed Aug 12 11:55:51 2015 From: petercon at microsoft.com (Peter Constable) Date: Wed, 12 Aug 2015 16:55:51 +0000 Subject: Implementing SMP on a UTF-16 OS In-Reply-To: References: <2079848060.15640.1439235999068.JavaMail.www@wwinf1d19> <20150810212503.58cc24d0@JRWUBU2> <1761086697.16696.1439239991108.JavaMail.www@wwinf1d19> Message-ID: I?m no expert on driver development, but Max?s comments got me curious. ?Windows Driver Kit (WDK) 10 is integrated with Microsoft Visual Studio 2015?? https://msdn.microsoft.com/en-us/library/windows/hardware/ff557573(v=vs.85).aspx ?In Visual Studio 2015, the C++ compiler and standard library have been updated with enhanced support for C++11 and initial support for certain C++14 features. They also include preliminary support for certain features expected to be in the C++17 standard.? https://msdn.microsoft.com/en-us/library/hh409293.aspx From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Max Truxa Sent: Monday, August 10, 2015 11:27 PM To: Marcel Schneider Cc: Unicode Mailing List Subject: Re: Implementing SMP on a UTF-16 OS On Aug 10, 2015 10:53 PM, "Marcel Schneider" > wrote: > > This is clearly a Unicode implementation problem. C and C++ should be standardized for handling of UTF-16. IMO we cannot consider that Windows supports UTF-16 for internal use, if it does not support surrogates pairs except with workarounds using ligatures. C and C++ *are* "standardized for handling of UTF-16"... and UTF-8... and UTF-32. If you are interested in this topic just search for "C++ Unicode string literals" and "C++ Unicode character literals" which are standardized since C11/C++11 (with the exception of UTF-8 character literals which will follow in C++11; don't know about C though). The reason you won't be able to easily use these features is because the compiler shipping with the WDK is still only supporting C89/C90. And sadly for us driver developers Microsoft will not change this. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at maxtruxa.com Thu Aug 13 01:54:21 2015 From: unicode at maxtruxa.com (Max Truxa) Date: Thu, 13 Aug 2015 08:54:21 +0200 Subject: Implementing SMP on a UTF-16 OS In-Reply-To: References: <2079848060.15640.1439235999068.JavaMail.www@wwinf1d19> <20150810212503.58cc24d0@JRWUBU2> <1761086697.16696.1439239991108.JavaMail.www@wwinf1d19> Message-ID: On Aug 12, 2015 6:55 PM, "Peter Constable" wrote: > > > ?In Visual Studio 2015, the C++ compiler and standard library have been updated with enhanced support for C++11 and initial support for certain C++14 features. They also include preliminary support for certain features expected to be in the C++17 standard.? > You are right. I have to admit that my statement was not 100% correct. Traditionally drivers for Windows are built using C (not C++). Part of the reason for this is that Microsoft did not officially support C++ in kernel code up to the WDK 8. Nowadays there is the /kernel switch which enables a subset of C++ which Microsoft considers safe to use in kernel mode. The most recent C standard fully supported is still C89 though (plus a few C99 features that were added with VS2013). C99 support is *far* from being complete and I don't know of a single C11 feature being implemented. This means C++11 could be used in a driver but one would need to "convert" the driver to C++ (or at least those sources that make use of modern features). Anyhow, Marcel could certainly declare the mapping in a .cpp (using extern "C" to ensure interoperability with C code) but that wouldn't change that surrogate pairs seem to be unsupported for keyboard drivers. (Like I said I have no experience writing keyboard drivers so I can't confirm this.) From doug at ewellic.org Thu Aug 13 09:58:40 2015 From: doug at ewellic.org (Doug Ewell) Date: Thu, 13 Aug 2015 07:58:40 -0700 Subject: Update on flag tags (PRI =?UTF-8?Q?=23=32=39=39=29=3F?= Message-ID: <20150813075840.665a7a7059d7ee80bb4d670165c8327d.bbcb98170c.wbe@email03.secureserver.net> The recently posted minutes from UTC #144 include the following: > B.11.1.1.3 PRI 299 feedback and mailing list discussion [Edberg, > L2/15-210] > > Discussion. UTC took no action at this time. and: > [144-A93] Action Item for Rick McGowan: Close PRI #299, saying: All > feedback has been considered and will be part of the deliberations for > possible future extension mechanisms. This sounds like the entire flag-tag proposal described by PRI #299 has been put on indefinite hold. Is that accurate? -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From charupdate at orange.fr Thu Aug 13 10:07:50 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Thu, 13 Aug 2015 17:07:50 +0200 (CEST) Subject: Implementing SMP on a UTF-16 OS Message-ID: <916581674.16210.1439478470453.JavaMail.www@wwinf1c10> [Given the bad news we got, I spared this in my draft folder from 12 Aug 11:12 on.] ? On 11 Aug 2015, at 23:18, Richard Wordingham wrote [I've replaced < > with ???, as already I've got a disappearance and am not sure whether once <> converted, a second conversion won't happen]: > On Tue, 11 Aug 2015 21:27:27 +0200 (CEST) > Marcel Schneider wrote: > > > I?ve tried to just remove the parentheses and let the string. This > > was compiled, but the keyboard test showed that in the keyboard > > driver DLL, UTF-16 strings with SMP characters aren?t handled as > > such. Each surrogate code unit is considered as a single character > > even when it?s followed by a trailing one. Only the code unit > > corresponding to the shift state (modification number) is taken, no > > matter if it?s only a surrogate and the other half comes next. > > This is exactly what one should expect. The data is an array of > UTF-16 code units rather than a UTF-16 string. Moreover, it was > probably written as UCS-2. I believe it is the application that has > the job of stitching the surrogate pairs together. > > > Is this the reason why a Unicode character cannot be represented > > alternatively as a 32?bit integer on Windows? > > They are, from time to time. There's a Windows message that delivers a > supplementary character rather a UTF-16 code unit, and fairly obviously > they have to be handled as such when performing font lookups. I've a > suspicion that this message hit an interoperability problem. A program > that can handle pairs of surrogates but predates the message will not > work with the more recent message. Therefore using the message type is > deferred until applications can handle it. Therefore applications don't > need to handle it, and don't. Therefore the message type doesn't get > used. > > > Being UTF-16, the OS > > could handle a complete surrogates pair in one single 32?bit?integer. > > Couldn't this be performed on driver level by modifying a program and > > updating this when the driver is installed? > > You really talking about a parallel set of routines. I suspect the > answer is that Microsoft don't want to work on extending a primitive > keyboarding system when TSF is available. > > You want to use dead keys. Why? Is it not that they are the only > mechanism you have experience of. Yes, along with the allocation table and the ligatures (and modifier key mapping). Dead keys are the only way I found in the driver source to easily input precomposed characters and to work out a Compose functionality. Marc Durdin told us that for most languages, dead keys are not the best way for input. However, we're accustomed to. About Compose I found out that preceding diacritics are the only way to efficiently input multiply diacriticized precomposed letters. When we use combining diacritics, the problem is where to place all the diacritics on a backwards compatible layout. The Compose key idea is to use punctuation keys to input diacritics. Basically we need to hit Compose once only, while generating combining marks out of punctuation needs at least one differenciating keystroke for each one. Given the limited number of keys, we can scarcely have more than one special dead key like Compose in the Base shift state. And as diacritical marks are so numerous that all keyboard punctuation together is not sufficient, we need sequences of punctuation for a number of less current diacritics. This brings the need of a triggering keystroke at the end. Most characters are therefore best input when diacritics come before the triggering letter. But that's my experience only, I wonder how it works on TSF. > > Better systems can be built, in which one sees what one is doing. I read that on Mac OS X, the dead key input and the Compose functionality that is made of, are accompanied by a visual feedback, which shows what characters have already been typed. > Is it not much better to type 'e' and then a circumflex, and see the 'e' > and then the 'e' with a circumflex? Yes, in fact the precomposed characters are legacy characters from the beginning of Unicode on. The most up-to-date input of diacriticized characters is with use of combining diacritical marks. This produces directly the string that is generated by the canonical decomposition algorithms. However, on the internet, AFAIK, precomposed characters must be used for a web page to be validated W3C. > Dead keys are an imitation of a limitation of typewriter technology. > If I was typing cuneiform, I'd much rather type 'bi?COMMIT?' and see > the growing sequence 'b', 'bi', '?CUNEIFORM SIGN BI?' as I typed. > (What you have for a ?COMMIT? key is your choice.) TSF lets one do this. > A simple extension of the keyboard definition DLLs generated by MSKLC > does not. What you should be pressing for is a usable tutorial on how > to do this in TSF. Agreed. I'll look for. Marc does all in TSF. But recently he shared how hard it was at the beginning and over 15?years. Now he's got it run, and when we need TSF, let's consider using his software. > > > If yes, we must modify the interface so that keyboard driver DLLs are > > really read in UTF-16. And/or we must find another compiler. > > > > Must the Windows driver be compiled by a Microsoft compiler? > > The compiler is not the issue. The point is that the 16-bit code > exists, and programs that use the 16-bit API exist. Language upgrades > may make supplementary characters easier to use in programs, but that > is all. They don't change existing binary interfaces. Indeed. And if it would make sense to use other compilers than those shipping with the WDK, Max would have told us in this thread. So best practice is to stick with the original development environment. Or to use TSF. Thanks, Marcel -------------- next part -------------- An HTML attachment was scrubbed... URL: From charupdate at orange.fr Thu Aug 13 10:13:03 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Thu, 13 Aug 2015 17:13:03 +0200 (CEST) Subject: Implementing SMP on a UTF-16 OS In-Reply-To: References: <2079848060.15640.1439235999068.JavaMail.www@wwinf1d19> <20150810212503.58cc24d0@JRWUBU2> <1761086697.16696.1439239991108.JavaMail.www@wwinf1d19> Message-ID: <1925503470.16339.1439478783459.JavaMail.www@wwinf1c10> > On 12 Aug 2015 at 18:55, Peter Constable wrote: > > > > ?In Visual Studio 2015, the C++ compiler and standard library have been updated with enhanced support for C++11 and initial support for certain C++14 features. They also include preliminary support for certain features expected to be in the C++17 standard.? Thanks; I've just downloaded the WDK 10. Best regards, Marcel -------------- next part -------------- An HTML attachment was scrubbed... URL: From charupdate at orange.fr Thu Aug 13 10:23:59 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Thu, 13 Aug 2015 17:23:59 +0200 (CEST) Subject: Implementing SMP on a UTF-16 OS In-Reply-To: References: <2079848060.15640.1439235999068.JavaMail.www@wwinf1d19> <20150810212503.58cc24d0@JRWUBU2> <1761086697.16696.1439239991108.JavaMail.www@wwinf1d19> Message-ID: <1667485856.16669.1439479439796.JavaMail.www@wwinf1c10> On 13 Aug 2015 at 09:05, Max Truxa wrote: > Anyhow, Marcel could certainly declare the mapping in a .cpp (using > extern "C" to ensure interoperability with C code) but that wouldn't > change that surrogate pairs seem to be unsupported for keyboard > drivers. (Like I said I have no experience writing keyboard drivers so > I can't confirm this.) Not only are surrogate pairs unsupported, even the possibility of having them in one deadtrans entry seems to be definitely blocked: http://www.siao2.com/2004/12/17/323257.aspx I would like to define a DEADTRANSEXT function delivering two code units instead of one; but it seems to me that I'm dreaming. There must be an API behind that wouldn't recognize this. I hope I'm wrong. Thanks for the information. Best regards, Marcel -------------- next part -------------- An HTML attachment was scrubbed... URL: From kenwhistler at att.net Thu Aug 13 10:47:45 2015 From: kenwhistler at att.net (Ken Whistler) Date: Thu, 13 Aug 2015 08:47:45 -0700 Subject: Update on flag tags (PRI #299)? In-Reply-To: <20150813075840.665a7a7059d7ee80bb4d670165c8327d.bbcb98170c.wbe@email03.secureserver.net> References: <20150813075840.665a7a7059d7ee80bb4d670165c8327d.bbcb98170c.wbe@email03.secureserver.net> Message-ID: <55CCBC21.8090102@att.net> Doug, On 8/13/2015 7:58 AM, Doug Ewell wrote: > The recently posted minutes from UTC #144 > include the following: > >> B.11.1.1.3 PRI 299 feedback and mailing list discussion [Edberg, >> L2/15-210] >> >> Discussion. UTC took no action at this time. > and: > >> [144-A93] Action Item for Rick McGowan: Close PRI #299, saying: All >> feedback has been considered and will be part of the deliberations for >> possible future extension mechanisms. > This sounds like the entire flag-tag proposal described by PRI #299 has > been put on indefinite hold. Is that accurate? > > No, it is not. No recorded actions were taken by the UTC, but what is missing from the recorded minutes are the hours of discussion that took place both during plenary and during the various ad hoc meetings listed during lunch hours. The ad hoc group did not file a written report, but the upshot is basically that the Emoji SC has general direction to take all the feedback and discussion and work up a more detailed proposal that addresses all of the issues involved. At some point that will appear as a new proposal for further discussion and decision. So stay tuned. --Ken From doug at ewellic.org Thu Aug 13 15:14:56 2015 From: doug at ewellic.org (Doug Ewell) Date: Thu, 13 Aug 2015 13:14:56 -0700 Subject: Update on flag tags (PRI =?UTF-8?Q?=23=32=39=39=29=3F?= Message-ID: <20150813131456.665a7a7059d7ee80bb4d670165c8327d.20f4c7c6f6.wbe@email03.secureserver.net> Ken Whistler wrote: > but the upshot is basically that the Emoji SC > has general direction to take all the feedback and discussion and > work up a more detailed proposal that addresses all of the issues > involved. At some point that will appear as a new proposal for > further discussion and decision. So stay tuned. Thanks. On Wed, 01 Jul 2015 16:20:08 +0000, Noah Slater wrote: >>? > > Can someone help me understand what this means for my rainbow flag > proposal? I can't speak for Noah, nor for others who might want to propose a non-region, non-subdivision flag emoji, but it might be helpful if the Emoji SC can at least say whether that type of flag is expected to be within the scope of their more detailed proposal. That might help Noah and others decide whether they need to invest the effort to write up a proposal for a unitary character. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From charupdate at orange.fr Fri Aug 14 06:14:55 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Fri, 14 Aug 2015 13:14:55 +0200 (CEST) Subject: Implementing SMP on a UTF-16 OS Message-ID: <1713579761.8528.1439550895738.JavaMail.www@wwinf1j10> As far as it remained pregnant, the issue is now resolved to some extent. Only five high surrogates are used for the 2,413?SMP characters that might most probably be wished to be available on a universal Latin layout, so that five key positions (for example at Shift+Kana) are enough to ensure input efficiency along with streamlined Compose sequences for the low surrogates. This results from examining the NamesList in a spreadsheet with surrogates generated by Excel formulas. The five mentioned leading surrogates are: U+D800 (12 Roman symbols U+10190?sqq); U+D835 (996 mathematical letters U+1D400?sqq); U+D83C (421 mathematical letters, symbols, and emojis U+1F100?sqq); U+D83D (821 emoticons and stars U+1F400?sqq); U+D83E (163 arrows U+1F800?sqq). However, this workaround is far from optimal and demands from the user to learn?additionally to the Compose sequences?which leading surrogate he must type first. For example, U+1F16A RAISED MC SIGN, and U+1F16B RAISED MD SIGN (which being for use in Canada, are supposed to be on every universal Latin layout of any locale) should be input with Compose, m, c, and Compose, m, d, respectively. Now the user must type Shift+Kana+S, Compose, m, c, or Shift+Kana+S, Compose, m, d. It???s just less bad than not to have them anyhow. (Depending on the locale, one might wish to map them to (Shift+) Ctrl+Alt+C and Ctrl+Alt+D.) I???m still hoping that there will be means to make DEADTRANS rendering two code units alternatively, or to define and use a DEADTRANSEXT function. Best regards, Marcel Schneider From gwalla at gmail.com Fri Aug 14 13:31:28 2015 From: gwalla at gmail.com (Garth Wallace) Date: Fri, 14 Aug 2015 11:31:28 -0700 Subject: Chess symbol glyphs in code charts Message-ID: Can anyone tell me what font is used for the chess symbols in the code chart for the Miscellaneous Symbols block? It looks a lot like Chess Merida but I can't be certain. From kenwhistler at att.net Fri Aug 14 13:50:50 2015 From: kenwhistler at att.net (Ken Whistler) Date: Fri, 14 Aug 2015 11:50:50 -0700 Subject: Chess symbol glyphs in code charts In-Reply-To: References: Message-ID: <55CE388A.3030000@att.net> Garth, The glyphs for the chess symbols in the 26XX block date from Unicode 3.0. Most of the symbols redesigned for the Unicode 3.0 charts were done by John M. Fiscella. (See the font acknowledgements on p. iv of Unicode 3.0.) I do not know which predecessor designs Fiscella might ultimately have based his designs on. The *actual* font used in the chart production is some in house chart font, possibly tweaked over the years for various specific glyphs, although it doesn't appear to me on first inspection that any of the chess symbol glyphs per se have had any workover since the Unicode 3.0 publication. The chart fonts are in house, used with special licenses specific to Unicode chart production, and with all sorts of chart-specific quirks. So even if I did attempt to track down specifically which font was involved for the current Unicode 26XX block for the 2654..265F range of glyphs, knowing that wouldn't actually help much for your question, I think. --Ken On 8/14/2015 11:31 AM, Garth Wallace wrote: > Can anyone tell me what font is used for the chess symbols in the code > chart for the Miscellaneous Symbols block? It looks a lot like Chess > Merida but I can't be certain. > From asmus-inc at ix.netcom.com Fri Aug 14 14:31:10 2015 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Fri, 14 Aug 2015 12:31:10 -0700 Subject: Chess symbol glyphs in code charts In-Reply-To: <55CE388A.3030000@att.net> References: <55CE388A.3030000@att.net> Message-ID: <55CE41FE.5060503@ix.netcom.com> An HTML attachment was scrubbed... URL: From gwalla at gmail.com Fri Aug 14 15:26:45 2015 From: gwalla at gmail.com (Garth Wallace) Date: Fri, 14 Aug 2015 13:26:45 -0700 Subject: Chess symbol glyphs in code charts In-Reply-To: <55CE388A.3030000@att.net> References: <55CE388A.3030000@att.net> Message-ID: Would it be acceptable if I extracted the font from the code chart PDF and used it as the basis for one in a proposal I'm working on? The proposal covers rotated and half-black half-white chess symbols, which should match the shapes of the existing ones, and compound symbols, which should harmonize. On Fri, Aug 14, 2015 at 11:50 AM, Ken Whistler wrote: > Garth, > > The glyphs for the chess symbols in the 26XX block date from > Unicode 3.0. Most of the symbols redesigned for the Unicode 3.0 > charts were done by John M. Fiscella. (See the font acknowledgements > on p. iv of Unicode 3.0.) I do not know which predecessor designs > Fiscella might ultimately have based his designs on. > > The *actual* font used in the chart production is some in house > chart font, possibly tweaked over the years for various specific > glyphs, although it doesn't appear to me on first inspection that > any of the chess symbol glyphs per se have had any workover since the > Unicode 3.0 publication. The chart fonts are in house, used with > special licenses specific to Unicode chart production, and with all > sorts of chart-specific quirks. So even if I did attempt to track down > specifically which font was involved for the current Unicode 26XX > block for the 2654..265F range of glyphs, knowing that wouldn't > actually help much for your question, I think. > > --Ken > > > On 8/14/2015 11:31 AM, Garth Wallace wrote: >> >> Can anyone tell me what font is used for the chess symbols in the code >> chart for the Miscellaneous Symbols block? It looks a lot like Chess >> Merida but I can't be certain. >> > From haberg-1 at telia.com Fri Aug 14 17:46:38 2015 From: haberg-1 at telia.com (=?utf-8?Q?Hans_=C3=85berg?=) Date: Sat, 15 Aug 2015 00:46:38 +0200 Subject: Chess symbol glyphs in code charts In-Reply-To: References: Message-ID: <97FEE862-58DD-405B-8D57-AF384928CFDB@telia.com> > On 14 Aug 2015, at 20:31, Garth Wallace wrote: > > Can anyone tell me what font is used for the chess symbols in the code > chart for the Miscellaneous Symbols block? It looks a lot like Chess > Merida but I can't be certain. They are quite close to Apple Symbols, but not exactly the same. From asmus-inc at ix.netcom.com Fri Aug 14 19:43:57 2015 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Fri, 14 Aug 2015 17:43:57 -0700 Subject: Chess symbol glyphs in code charts In-Reply-To: References: <55CE388A.3030000@att.net> Message-ID: <55CE8B4D.2000500@ix.netcom.com> An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Sun Aug 16 05:20:24 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Sun, 16 Aug 2015 11:20:24 +0100 Subject: Standardised Variation Sequences with Toggles Message-ID: <20150816112024.1760f1a6@JRWUBU2> The view of the Unicode Technical committee appears to be that the Unicode Character Database (UCD) takes priority over the core text of the Unicode Standard in case of conflict. (Please advise if I have misunderstood; I only have the core text and samples of past behaviour to go on, neither of which appears to be binding.) I am worried that this view may come to cause a redefinition of sequences in which the variation selector is intended to toggle between what are normally contextually determined forms. The clearest example is 'phags-pa letter reversed shaping small a'. Phags-pa is a 'cursive' script, and this letter is dual-joining. From just the text of StandardizedVariants.txt and the text and pictures of StandardizedVariants.html (the latter are in the processing of migrating to the code charts, which will replace the HTML file in Unicode 9.0.0), one could easily imagine that the usual forms of and in authentic continuous text were different. In fact, *careful* reading of the core text shows that the commonest forms of these two sequences in authentic text are identical! The paradox arises because U+A856 PHAGS-PA LETTER SMALL A and several other characters may be mirrored about the reading axis after certain letters or flipped letters, and to avoid complications, the rule is that by default they are in these extremely rare environments. I believe this mirroring is what is meant by the word 'shaping' in the description of the variant; it is not a reflection of the 'cursive' nature of the script. U+FE00 toggles the mirroring state, and this is what is meant by the word 'reversed', not that the letter is the other way round to the form in the code chart. Unlike the other contextually mirrored characters, it so happens that, more often than not, U+A856 is not actually mirrored in the authentic extant text where the Unicode rules call for mirroring. I believe the Phags-pa code chart should have a normative statement that U+FE00 is acting as a toggle, and refer back to the core text. Now Phags-pa is a relatively clean case - all standardised variants in the block have the same behaviour, so a single sentence in the block's code chart might suffice. However, I do not believe this is always the case. One possibility would be to change the text from ~ A856 FE00 phags-pa letter reversed shaping small a to ~ A856 FE00 phags-pa letter reversed shaping small a ? Toggles between and ; see core text for contextual shaping. where text in '<...>' is rendered as a string, not echoed as ASCII. However, that reads clumsily. Can people suggest improvements? Similar text would be needed for StandardizedVariants.txt in the UCD. The relevant line currently reads "A856 FE00; phags-pa letter reversed shaping small a; # PHAGS-PA LETTER SMALL A" Obviously this potential problem needs to be formally reported, but I would first like to see other people's views. There are other cases where variation selectors were intended as toggles, but the ones I know of are not so clearly documented. Richard. From alexweiner at alexweiner.com Sun Aug 16 09:35:17 2015 From: alexweiner at alexweiner.com (alexweiner at alexweiner.com) Date: Sun, 16 Aug 2015 07:35:17 -0700 Subject: APL Under-bar Characters Message-ID: <20150816073517.e74b0ce91403bfe413f98785c6a226af.37509199d0.wbe@email06.secureserver.net> An HTML attachment was scrubbed... URL: From khaledhosny at eglug.org Sun Aug 16 10:17:06 2015 From: khaledhosny at eglug.org (Khaled Hosny) Date: Sun, 16 Aug 2015 17:17:06 +0200 Subject: APL Under-bar Characters In-Reply-To: <20150816073517.e74b0ce91403bfe413f98785c6a226af.37509199d0.wbe@email06.secureserver.net> References: <20150816073517.e74b0ce91403bfe413f98785c6a226af.37509199d0.wbe@email06.secureserver.net> Message-ID: <20150816151706.GA2553@khaled-laptop> On Sun, Aug 16, 2015 at 07:35:17AM -0700, alexweiner at alexweiner.com wrote: > Hello Unicode Mailing List, > > There is significant discussion about the problems of adding capital letters > with individual under-bars in this mailing list for GNU APL. > > http://lists.gnu.org/archive/html/bug-apl/2015-08/msg00050.html > > Pretty much it adds up to the following problem: > > The string length functionality would view an 'A' code point combined with an > '_' code point as an item that has two elements, while something that looks > like 'A' Should be atomic, and return a length of one. I think what you need is better ?character? counting [1], rather than new precomposed characters. Regards, Khaled 1. http://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries From alexweiner at alexweiner.com Sun Aug 16 11:31:25 2015 From: alexweiner at alexweiner.com (alexweiner at alexweiner.com) Date: Sun, 16 Aug 2015 09:31:25 -0700 Subject: APL Under-bar Characters Message-ID: <20150816093125.e74b0ce91403bfe413f98785c6a226af.654c59af0e.wbe@email06.secureserver.net> An HTML attachment was scrubbed... URL: From khaledhosny at eglug.org Sun Aug 16 11:53:52 2015 From: khaledhosny at eglug.org (Khaled Hosny) Date: Sun, 16 Aug 2015 18:53:52 +0200 Subject: APL Under-bar Characters In-Reply-To: <20150816093125.e74b0ce91403bfe413f98785c6a226af.654c59af0e.wbe@email06.secureserver.net> References: <20150816093125.e74b0ce91403bfe413f98785c6a226af.654c59af0e.wbe@email06.secureserver.net> Message-ID: <20150816165351.GA10179@khaled-laptop> On Sun, Aug 16, 2015 at 09:31:25AM -0700, alexweiner at alexweiner.com wrote: > Khaled, > Thank you for the link. The normalization methods were already discussed, > specifically here: > > http://lists.gnu.org/archive/html/bug-apl/2015-08/msg00047.html Grapheme cluster boundaries detection is different from normalisation, please read the link I provided. > Where the problem of "how big" is ? is discussed. The answer being that this is > one symbol, because the Unicode Consortium decided that it is also its own > standalone character. From the thread: > > I'll give you an example. What would you want ?,'?' to be? > > Right now, that could return either 1 or 2 depending on whether the ? was using > the precomposed character (U+00E4) or the combining mark (U+0061, U+0308). > Visually, these are identical, and generally you'd expect them to compare > equal. If you are counting grapheme clusters, then the answer is one in both cases. > In Unicode, the comparison of equivalent (but with different characters) > strings are done by performing a normalisation step prior to comparison. There > are 4 different types of normalisation, with different behaviour. Quoting from the link I provided: A key feature of default Unicode grapheme clusters (both legacy and extended) is that they remain unchanged across all canonically equivalent forms of the underlying text. Thus the boundaries remain unchanged whether the text is in NFC or NFD. Using a grapheme cluster as the fundamental unit of matching thus provides a very clear and easily explained basis for canonically equivalent matching. This is important for applications from searching to regular expressions. See also: http://unicode.org/faq/char_combmark.html#7 > Now, the ? character has a precomposed form in Unicode, and if you couple that > with the NFC normalisation form, you'd get the above _expression_ to return 1. > > > So I'm not sure why the allowance was made for ? as well as other certain > characters, but not for other things (under-bar characters) that face > similar representation issues. It was encoded for compatibility of pre-existing character sets AFAIK. Regards, Khaled > > > -------- Original Message -------- > Subject: Re: APL Under-bar Characters > From: Khaled Hosny > Date: Sun, August 16, 2015 8:17 am > To: alexweiner at alexweiner.com > Cc: unicode at unicode.org > > On Sun, Aug 16, 2015 at 07:35:17AM -0700, alexweiner at alexweiner.com wrote: > > Hello Unicode Mailing List, > > > > There is significant discussion about the problems of adding capital > letters > > with individual under-bars in this mailing list for GNU APL. > > > > http://lists.gnu.org/archive/html/bug-apl/2015-08/msg00050.html > > > > Pretty much it adds up to the following problem: > > > > The string length functionality would view an 'A' code point combined > with an > > '_' code point as an item that has two elements, while something that > looks > > like 'A' Should be atomic, and return a length of one. > > I think what you need is better ?character? counting [1], rather than > new precomposed characters. > > Regards, > Khaled > > 1. http://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries > From richard.wordingham at ntlworld.com Sun Aug 16 13:27:13 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Sun, 16 Aug 2015 19:27:13 +0100 Subject: APL Under-bar Characters In-Reply-To: <20150816165351.GA10179@khaled-laptop> References: <20150816093125.e74b0ce91403bfe413f98785c6a226af.654c59af0e.wbe@email06.secureserver.net> <20150816165351.GA10179@khaled-laptop> Message-ID: <20150816192713.56ab0841@JRWUBU2> On Sun, 16 Aug 2015 18:53:52 +0200 Khaled Hosny wrote: > On Sun, Aug 16, 2015 at 09:31:25AM -0700, alexweiner at alexweiner.com > wrote: > > Now, the ? character has a precomposed form in Unicode, and if you > > couple that with the NFC normalisation form, you'd get the above > > _expression_ to return 1. > > So I'm not sure why the allowance was made for ? as well as other > > certain characters, but not for other things (under-bar > > characters) that face similar representation issues. > It was encoded for compatibility of pre-existing character sets AFAIK. Note that compatibility means allowing habits of treating the precomposed characters as single characters to continue. These habits allowed simple transition, but now cause confusion. Most rules work better in NFD than NFC. For string lengths in NFC, you immediately lose the rule len(a + b) = len(a) + len(b). For NFC, you don't even have len(a + b) <= len(a) + len(b). However, do note that for the corresponding 'string' algebra, the mathematical concept of a string no longer works - and this applies to both NFC and NFD. Instead, you have to allow for pairs of characters commuting, and so you get the concept of a 'trace'. If all combinations of base character and non-spacing marks were encoded, there'd be infinitely many. Polytonic Greek has 36 *precomposed* combinations of base character and 3 combining marks, and some languages frequently use base characters with 4 combining marks; unexceptional words with 5 combining marks are less frequent. Richard. From kenwhistler at att.net Sun Aug 16 13:37:50 2015 From: kenwhistler at att.net (Ken Whistler) Date: Sun, 16 Aug 2015 11:37:50 -0700 Subject: APL Under-bar Characters In-Reply-To: <20150816165351.GA10179@khaled-laptop> References: <20150816093125.e74b0ce91403bfe413f98785c6a226af.654c59af0e.wbe@email06.secureserver.net> <20150816165351.GA10179@khaled-laptop> Message-ID: <55D0D87E.9020709@att.net> It seems to me that APL has some very deeply embedded (and ancient) assumptions about fixed-width 8-bit characters, dating from ASCII days. It only got as far as it did with the current assumptions because people hacked up 8-bit fonts for all the special characters for the APL syntax, and because IBM implemented those as dedicated special character sets with matching specialized APL keyboards. A built-in function like ? which returns the *size* of data is structurally hand-in-hand with the definition of vectors and arrays. There seem to be very deep assumptions in the APL data model that strings are simply an array of *fixed-size* data elements, aka "characters". So requiring ?,'?' and ?,'_A_' to "just work" is the moral equivalent of asking the C library call strlen("?") or strlen("_A_") to "just work", regardless of the representation of the data in the string. It is a nonsensical requirement if applied to general Unicode strings outside the context of a very carefully restricted subset designed to ensure one-to-one relationship between "character" and "array element". A Unicode-based APL implementation can (presumably) just up the size of its "character" to 16-bits internally (actually a UTF-16 code *unit*) and carefully restrict itself to the subset of ASCII & Latin-1, the APL symbols and a few other operators needed to fill out the set. Looking at the fonts people seem to actually be using in various implementations, e.g.: http://aplwiki.com/AplCharacters the general choice seems to be to use both uppercase and lowercase Latin letters, and forgo the old convention of underlined uppercase Latin letters. That seems a small adjustment to make to not stay stuck in the 70's, frankly. I can understand Alex's request that Unicode then effectively "solve the problem" by providing a fixed-width 16-bit entity for "_A_" that could then just be added to the restricted subset in the APL implementations. But that isn't going to happen -- because of the normalization stability guarantees for the Unicode Standard. And in any case, if users of APL need something more sophisticated for actual string handling than strictly limited subsets based on the assumption that character=element_of_fixed_data_size_array, then rho and a limited subset aren't going to handle it anyway. At that point, another layer of abstraction would have to be built on top of the basic array and vector processing. And then Khaled's points about character=grapheme_cluster become relevant. --Ken On 8/16/2015 9:53 AM, Khaled Hosny wrote: > On Sun, Aug 16, 2015 at 09:31:25AM -0700, alexweiner at alexweiner.com wrote: > > > So I'm not sure why the allowance was made for ? as well as other certain > characters, but not for other things (under-bar characters) that face > similar representation issues. > It was encoded for compatibility of pre-existing character sets AFAIK. > > Regards, > Khaled > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From kenwhistler at att.net Sun Aug 16 14:08:34 2015 From: kenwhistler at att.net (Ken Whistler) Date: Sun, 16 Aug 2015 12:08:34 -0700 Subject: Standardised Variation Sequences with Toggles In-Reply-To: <20150816112024.1760f1a6@JRWUBU2> References: <20150816112024.1760f1a6@JRWUBU2> Message-ID: <55D0DFB2.7000806@att.net> On 8/16/2015 3:20 AM, Richard Wordingham wrote: > The view of the Unicode Technical committee appears to be that the > Unicode Character Database (UCD) takes priority over the core text of > the Unicode Standard in case of conflict. (Please advise if I have > misunderstood; I only have the core text and samples of past behaviour > to go on, neither of which appears to be binding.) Richard, That means that if a data file states, e.g., 200B;ZERO WIDTH SPACE;Cf;0;BN;;;;;N;;;;; thus *defining* the General_Category of ZWSP to be Cf, but that we find that due to some oversight in editing (of what is now a very large core specification, plus over a dozen annexes), somebody goofed up and happened to refer to ZWSP as gc=Zs, the data file *wins*. Some editorial oversight or a typo in the text of the core specification cannot be taken as legalistically somehow trumping the data file, just because somebody finds it "written in the standard". Capiche? This should not, IMO, be taken as occasion for general worrying about the status of data files and the core specification. (In most cases, the core specification is simply underspecified because the research, writing and editing for it is under-resourced.) > One possibility would be to change the text from > > ~ A856 FE00 phags-pa letter reversed shaping small a > > to > > ~ A856 FE00 phags-pa letter reversed shaping small a > > ? Toggles between and ; see core text for > contextual shaping. > > where text in '<...>' is rendered as a string, not echoed as ASCII. > > However, that reads clumsily. Can people suggest improvements? Yes, a notice at the top: @+ For details about the implementation of variation sequences in Phags-pa, please refer to the Phags-pa section of the core specification. --Ken From alexweiner at alexweiner.com Sun Aug 16 14:36:08 2015 From: alexweiner at alexweiner.com (alexweiner at alexweiner.com) Date: Sun, 16 Aug 2015 12:36:08 -0700 Subject: APL Under-bar Characters Message-ID: <20150816123608.e74b0ce91403bfe413f98785c6a226af.39c870b9bd.wbe@email06.secureserver.net> An HTML attachment was scrubbed... URL: From alexweiner at alexweiner.com Sun Aug 16 14:41:58 2015 From: alexweiner at alexweiner.com (alexweiner at alexweiner.com) Date: Sun, 16 Aug 2015 12:41:58 -0700 Subject: APL Under-bar Characters Message-ID: <20150816124158.e74b0ce91403bfe413f98785c6a226af.6dce0e5645.wbe@email06.secureserver.net> An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Sun Aug 16 17:50:08 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Sun, 16 Aug 2015 23:50:08 +0100 Subject: Standardised Variation Sequences with Toggles In-Reply-To: <55D0DFB2.7000806@att.net> References: <20150816112024.1760f1a6@JRWUBU2> <55D0DFB2.7000806@att.net> Message-ID: <20150816235008.43bbd56d@JRWUBU2> On Sun, 16 Aug 2015 12:08:34 -0700 Ken Whistler wrote: > Some editorial oversight or a typo in the text of the core > specification cannot > be taken as legalistically somehow trumping the data file, just > because somebody finds it "written in the standard". > > Capiche? No. What about oversights and typos in the UCD? Indeed, two variation sequences were removed because it was found that their bases were decomposable, which contradicts the core specification. In this case, the UCD did not trump the rules for variation sequences. When there is a contradiction, it needs to be investigated and resolved, with awareness that different people may be relying on different parts of the specification. > (In most > cases, the core specification is simply underspecified because the > research, writing and editing for it is under-resourced.) That is also true of much of the UCD. I suspect that much of it relies on intelligent guesswork. Some properties may simply be ignored because nothing readily testable uses them (e.g. line and word-break properties relevant for scriptio continua writing systems), and others appear to be arbitrary. (Is the allocation of digits to L, AN or EN actually anything but an encoding decision?) Fortunately, most errors in the UCD can be corrected when the settings don't work; casing pairs, names, decompositions and canonical combining classes are the main problems. I believe problems arising from codepoint assignments could be fixed by created singleton decompositions, e.g. to change mere numbers into decimal digits. As an example of an effectively ignored line break property, I offer the line-break property of the Thai repetition character U+0E46 THAI CHARACTER MAIYAMOK. It is currently of general category Lm, and has the line-break property SA 'South-East Asian line-breaking'. This means that the Unicode line-breaking algorithm calls upon a non-standard algorithm to assign each instance of the character a line-break property. Now I believe that it should have line break property EX. I can find a grammatical description that says it should be separated from the preceding word by a space, and I have found no example in books of U+0E46 starting a line. Giving it line break property EX would prevent a line break between the space and the repetition mark. However, there is little point in trying to have it assigned line break property EX, for the Unicode assignment is irrefutable. My argument has to be addressed to the specifications of the algorithms doing Thai line-breaking. A historical example of errors in the UCD is U+200B ZERO WIDTH SPACE (ZWSP). It's primary use is as a word separator in scripts that don't have visible word separators, though I'm currently finding it useful in Word 2010 to split up excessively long path names without visible hyphens being added. When its general category was changed from Zs to Cf, its Unicode word-break property became 'Format'; it no longer had any effect on word-breaking. Its line-breaking behaviour was preserved, so the control of text layout was unaffected. For SE Asian languages, the change had no direct effect, for their word-breaking rules are largely outside the scope of the Unicode text segmentation algorithms. All went well until someone decided that TUS text describing it as a word-breaker was an 'editorial oversight'. A corrigendum removed this word-breaking behaviour, and SE Asian word processors started to misbehave as software maintainers caught up with the corrigendum. For details see an email from Javier Sol?? http://unicode.org/mail-arch/unicode-ml/y2009-m01/0604.html . The referenced proposal gives the text of the erratum, dated May 2008. Presumably corrigenda did not then have numbers, for there is no trace of its former existence in http://www.unicode.org/versions/corrigenda.html . A similar process is now in progress for U+2060 WORD JOINER (WJ), which is the opposite of ZWSP. It is intended that WJ will cease to indicate the absence of word boundaries. In scripts that have visible line-boundaries, the absence of an effect on word-breaking is of no consequence for sequences of letters, for the mere juxtaposition of letters prevents a word-break between them. By contrast, SE Asian word-boundary detectors largely rely on recognising words, and they can make mistakes, or be given an impossible task. The English analogue is detecting the word boundary in 'humanevents' - is the last word 'events' or 'vents'? A notable challenge is to persuade a Thai spell-checker that a transliteration of 'Hemingway' is actually a single word. Delimiting the boundaries does not work - one has to join the fragments into which the automatic word-breaker splits it. The language proposed for ISO 10646, in http://www.unicode.org/L2/L2015/15211-word-joiner.pdf , does not actually state that it does not prevent a word break, though stronger text denying that it suppresses word breaks has been proposed for Unicode. By contrast, U+202F NARROW NO-BREAK SPACE (NNBSP) looks set to regain its originally intended purpose, that of a narrow space that does not break words. The script for which it was intended, Mongolian, will be able to use the Unicode word-boundary detection algorithm once NNBSP is allowed as part of a word. However, the fact remains that NNBSP should never have been allowed to break words. The core text has long stated that NNBSP does not break Mongolian words. There remains, however, a possibility that European usage of NNBSP will prevent it from recovering its intended functionality. > Yes, a notice at the top: > > @+ For details about the implementation of variation sequences in > Phags-pa, please refer to the Phags-pa section of the core > specification. a) This is likely to be ignored by someone who is just looking for the *specification*. I think replacing 'implementation' by 'rendering' would be better. I would be inclined to add, 'These sequences are more complicated than they appear at first reading'. Otherwise, someone will just add them to the character to glyph conversion section of a font and think, "Job done". b) This won't work where the effort has not been expended on the core text. As to StandardizedVariants.txt, Section 23.4 needs to refer to the Phags-pa section in the core text. As that file points to the Section 23.4 of TUS, this should then at least suggest that the descriptions in the file do not override the core specification. Richard. From richard.wordingham at ntlworld.com Sun Aug 16 18:06:20 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Mon, 17 Aug 2015 00:06:20 +0100 Subject: APL Under-bar Characters In-Reply-To: <20150816073517.e74b0ce91403bfe413f98785c6a226af.37509199d0.wbe@email06.secureserver.net> References: <20150816073517.e74b0ce91403bfe413f98785c6a226af.37509199d0.wbe@email06.secureserver.net> Message-ID: <20150817000620.5cb6b869@JRWUBU2> On Sun, 16 Aug 2015 07:35:17 -0700 wrote: > There is significant discussion about the problems of adding capital > letters with individual under-bars in this mailing list for GNU APL. > > http://lists.gnu.org/archive/html/bug-apl/2015-08/msg00050.html > Is there something I could do to make this addition to the Unicode > standard? There is already a section for APL symbols. A possible compromise would be to use the Private Use Area (PUA). If you need single characters, it may be an appropriate solution. It might even be better to use the PUA (codepoints U+E000 to U+F8FF) than to be assigned a block in the Supplementary Multilingual Plane (SMP) U+1xxxx or the 'deprecated' plane U+Exxxx. Richard. From kenwhistler at att.net Sun Aug 16 19:15:26 2015 From: kenwhistler at att.net (Ken Whistler) Date: Sun, 16 Aug 2015 17:15:26 -0700 Subject: APL Under-bar Characters In-Reply-To: <20150816124158.e74b0ce91403bfe413f98785c6a226af.6dce0e5645.wbe@email06.secureserver.net> References: <20150816124158.e74b0ce91403bfe413f98785c6a226af.6dce0e5645.wbe@email06.secureserver.net> Message-ID: <55D1279E.9020008@att.net> Alex, On 8/16/2015 12:41 PM, alexweiner at alexweiner.com wrote: > > As far as I know, APL definitely predates the Unicode consortium. Do > you think that The Consortium possibly overlooked the pre-existing > under-bar character set? > > The answer to that is no. Initially, Unicode 1.0 attempted to punt the entire APL complex functional symbol problem by encoding U+2300 APL COMPOSE OPERATOR. The concept was essentially that any of the combined symbols -- the old rack of stuff that people complained about entering with symbol/backspace/symbol keying, could simply be represented as sequences of existing symbols. Think of 2300 as an early attempt to introduce an APL "script"-specific conjunct-forming virama, a la much-later artificially introduced script-specific joiners. Cf. U+2D7F TIFINAGH CONSONANT JOINER. But U+2300 APL COMPOSE OPERATOR was an innovation that failed. It was fiercely opposed *by the APL community*, who wanted it out of 10646 and replaced with a explicit list of pre-formed complex functional symbols. Presumably for the same reason we are talking about here now: essentially that each symbol had to work as a "character", and in an APL context that meant fixed width and the same data size as all the other characters. The removal of Unicode 1.0 U+2300 APL COMPOSE OPERATOR is documented in Unicode 1.1 as of 1993: http://www.unicode.org/versions/Unicode1.1.0/ (see page 3) The addition of APL functional symbols is documented in Section 5.4.8, pp. 39-41. The exact repertoire that ended up encoded in the standard was the result of meetings between some Unicode representatives and some folks from the APL community. The names escape me at the moment, although it might be possible to recover some information eventually. (Documentation regarding Unicode events in late 1991 is sparse these days.) At any rate the agreed upon additional repertoire is probably that included in: X3L2/92-035, Unicode Request for Additional Characters in ISO/IEC 10646-1.2. And the rest of the consequences and processing can be dug out of the ballot history record for the voting on 10646 in 1992. At any rate, a propos *this* discussion, we agreed that the repertoire would cover all the complex functional symbols, but *not* the letters with underscores. And it is not that they were simply overlooked. How do I know? Well, first, there were APL specialists involved in coming up (and promoting) the repertoire that was carried into the 10646 balloting at the time. It isn't as if a bunch of ignorant Unicoders just grabbed one APL book off the shelf and coded up the table, not noticing that some stuff was missing. Second, the text that is currently in the core specification about this issue, to wit: " ... All other APL extensions can e encoded by composition of other Unicode characters. For example, the APL symbol a underbar can be represented by U+0061 LATIN SMALL LETTER A + U+0332 COMBINING LOW LINE." (Unicode 7.0, Section 22.7, p. 772) is *ancient* text. It was first printed on p. 6-83 of Unicode 2.0 in 1996, with exactly the same wording. And the only reason it took until 1996 to appear, instead of 1993, was that the editing of Unicode 2.0 and its code charts was such a massive task at the time. So the clear intent in *1993* was to represent any APL letter with underbar as a combining character sequence -- as noted. The only problem I see there is that the text in the core spec mistakenly used U+0061 (the lowercase "a") instead of U+0041 (the uppercase "A") for the exemplification. Third, I can attest that at least some of us at the time -- as early as 1989, had printed copies of IBM EBCDIC code page 293 for APL, which had the EBCDIC uppercase Latin letters with underscores (italicized, by the way), together with the regular EBCDIC upper and lowercase letters. [Dates from 1984.] *And* IBM EBCDIC code page 310 for APL, which dropped all the regular upper- and lowercase letters but added more symbols. *And* IBM PC code page 907 (with the underscored uppercase Latin letters) and PC code page 909 (CP437 hacked up for APL, without the underscored uppercase Latin letters), which was quickly superseded by PC code page 910, which also did not use the uppercase Latin letters with underscores. So yeah, we knew about these. Encoding them as combining character sequences instead of as atomic characters was a deliberate decision taken in 1992. And that decision made it through both UTC and international balloting for publication in 1993. --Ken -------------- next part -------------- An HTML attachment was scrubbed... URL: From alexweiner at alexweiner.com Sun Aug 16 19:49:28 2015 From: alexweiner at alexweiner.com (alexweiner at alexweiner.com) Date: Sun, 16 Aug 2015 17:49:28 -0700 Subject: APL Under-bar Characters Message-ID: <20150816174928.e74b0ce91403bfe413f98785c6a226af.05d9ddafcb.wbe@email06.secureserver.net> An HTML attachment was scrubbed... URL: From prosfilaes at gmail.com Sun Aug 16 19:59:28 2015 From: prosfilaes at gmail.com (David Starner) Date: Mon, 17 Aug 2015 00:59:28 +0000 Subject: APL Under-bar Characters In-Reply-To: <20150816174928.e74b0ce91403bfe413f98785c6a226af.05d9ddafcb.wbe@email06.secureserver.net> References: <20150816174928.e74b0ce91403bfe413f98785c6a226af.05d9ddafcb.wbe@email06.secureserver.net> Message-ID: The standard is set here. The Unicode Consortium has declared that it won't encode precomposed characters that can be created from characters in the standard, because that would be destabilizing and potentially introduce security holes in programs depending on Unicode. If you want, we can have a vote on whether or not APL should use characters with underlines, since I was unfairly locked out of that vote by not being born yet. On Sun, Aug 16, 2015 at 5:52 PM wrote: > Ken, > You pose a very strong, and well worded response. The historical element > really helps to illuminate what I thought was lost knowledge: "Why are > there no under-bars". To this I can only ask one thing: > > Can we put this to a vote again? To put things in perspective, I was thee > years old at the time of the ballot in 1993 and had much larger issues to > deal with (comprehending speech, learning to walk, etc.), and was unable to > participate in this internationally binding vote. > > Perhaps feelings about the under-bar characters have changed since then. I > know that the APL landscape is *very* different than it was in 1993. > > I have a copy of one of those IBM books that has the italicized upper-case > under-bars. If my proposal for a new vote is well received, maybe we should > include those as well, for completeness sake. > > -Alex > > > -------- Original Message -------- > Subject: Re: APL Under-bar Characters > > From: Ken Whistler > Date: Sun, August 16, 2015 5:15 pm > To: alexweiner at alexweiner.com > Cc: unicode at unicode.org > > Alex, > > On 8/16/2015 12:41 PM, alexweiner at alexweiner.com wrote: > > > As far as I know, APL definitely predates the Unicode consortium. Do you > think that The Consortium possibly overlooked the pre-existing under-bar > character set? > > > > The answer to that is no. > > Initially, Unicode 1.0 attempted to punt the entire APL complex functional > symbol > problem by encoding U+2300 APL COMPOSE OPERATOR. > > The concept was essentially that any of the combined symbols -- the old > rack of stuff that people complained about entering with > symbol/backspace/symbol > keying, could simply be represented as sequences of existing symbols. > Think of 2300 as an early attempt to introduce an APL "script"-specific > conjunct-forming virama, a la much-later artificially introduced > script-specific > joiners. Cf. U+2D7F TIFINAGH CONSONANT JOINER. > > But U+2300 APL COMPOSE OPERATOR was an innovation that failed. > It was fiercely opposed *by the APL community*, who wanted it > out of 10646 and replaced with a explicit list of pre-formed complex > functional symbols. Presumably for the same reason we are talking > about here now: essentially that each symbol had to work as a "character", > and in an APL context that meant fixed width and the same data size as > all the other characters. > > The removal of Unicode 1.0 U+2300 APL COMPOSE OPERATOR is documented > in Unicode 1.1 as of 1993: > > http://www.unicode.org/versions/Unicode1.1.0/ > > (see page 3) > > The addition of APL functional symbols is documented in Section 5.4.8, pp. > 39-41. > > The exact repertoire that ended up encoded in the standard was the result > of meetings > between some Unicode representatives and some folks from the APL > community. The names > escape me at the moment, although it might be possible to recover some > information eventually. (Documentation regarding Unicode events in late > 1991 is > sparse these days.) At any rate the agreed upon additional repertoire is > probably > that included in: > > X3L2/92-035, Unicode Request for Additional Characters in ISO/IEC > 10646-1.2. > And the rest of the consequences and processing can be dug out of the > ballot history record > for the voting on 10646 in 1992. > > At any rate, a propos *this* discussion, we agreed that the repertoire > would cover > all the complex functional symbols, but *not* the letters > with underscores. And it is not that they were simply overlooked. > > How do I know? Well, first, there were APL specialists involved in coming > up > (and promoting) the repertoire that was carried into the 10646 balloting at > the time. It isn't as if a bunch of ignorant Unicoders just grabbed one APL > book off the shelf and coded up the table, not noticing that some stuff was > missing. > > Second, the text that is currently in the core specification about this > issue, > to wit: > > " ... All other APL extensions can e encoded by composition of other > Unicode characters. For example, the APL symbol a underbar can be > represented by U+0061 LATIN SMALL LETTER A + U+0332 COMBINING LOW LINE." > (Unicode 7.0, Section 22.7, p. 772) > > is *ancient* text. It was first printed on p. 6-83 of Unicode 2.0 in 1996, > with exactly the same wording. And the only reason it took until 1996 to > appear, > instead of 1993, was that the editing of Unicode 2.0 and its code charts > was such a massive task at the time. > > So the clear intent in *1993* was to represent any APL letter with underbar > as a combining character sequence -- as noted. The only problem I see there > is that the text in the core spec mistakenly used U+0061 (the lowercase > "a") > instead of U+0041 (the uppercase "A") for the exemplification. > > Third, I can attest that at least some of us at the time -- as early as > 1989, had > printed copies of IBM EBCDIC code page 293 for APL, which had > the EBCDIC uppercase Latin letters with underscores (italicized, by the > way), > together with the regular EBCDIC upper and lowercase letters. [Dates from > 1984.] > *And* IBM EBCDIC code page 310 for APL, which dropped all the > regular upper- and lowercase letters but added more symbols. > *And* IBM PC code page 907 (with the underscored uppercase Latin > letters) and PC code page 909 (CP437 hacked up for APL, without the > underscored uppercase Latin letters), which was quickly superseded by > PC code page 910, which also did not use the uppercase Latin letters > with underscores. > > So yeah, we knew about these. Encoding them as combining character > sequences instead of as atomic characters was a deliberate decision > taken in 1992. And that decision made it through both UTC and > international balloting for publication in 1993. > > --Ken > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From alexweiner at alexweiner.com Sun Aug 16 20:16:09 2015 From: alexweiner at alexweiner.com (alexweiner at alexweiner.com) Date: Sun, 16 Aug 2015 18:16:09 -0700 Subject: APL Under-bar Characters Message-ID: <20150816181609.e74b0ce91403bfe413f98785c6a226af.61b5a510b6.wbe@email06.secureserver.net> An HTML attachment was scrubbed... URL: From prosfilaes at gmail.com Sun Aug 16 20:27:17 2015 From: prosfilaes at gmail.com (David Starner) Date: Mon, 17 Aug 2015 01:27:17 +0000 Subject: APL Under-bar Characters In-Reply-To: <20150816181609.e74b0ce91403bfe413f98785c6a226af.61b5a510b6.wbe@email06.secureserver.net> References: <20150816181609.e74b0ce91403bfe413f98785c6a226af.61b5a510b6.wbe@email06.secureserver.net> Message-ID: http://unicode.org/policies/stability_policy.html , in particular, the Normalization Policy. The way the APL A with underscore is encoded is the way we've been saying, and Unicode has promised its users that there's no other way of writing it. The current precedent is that when users ask for things like this is that they are told they can't have them; for example, the Lithuanians were told that the way to encode LATIN CAPITAL LETTER A WITH OGONEK AND ACUTE is U+0104 U+0301, not any other way. They can be listed in http://www.unicode.org/Public/UCD/latest/ucd/NamedSequences.txt so that there can be a unique name to refer to them, but there will not be any new codepoint. On Sun, Aug 16, 2015 at 6:16 PM wrote: > David, > > I don't understand what you mean by saying that the standard is set. By > Ken's account, The Consortium decided to create a policy specifically > regarding this, by vote of APL (and I assume interested Unicode) users > worldwide. The Standard itself is in version eight. Why does a vote seem so > ridiculous, especially in the case of an addition, rather than a > subtraction? > > What is the current precedent for this sort of thing? > > -Alex > > -------- Original Message -------- > Subject: Re: APL Under-bar Characters > > From: David Starner > Date: Sun, August 16, 2015 5:59 pm > To: alexweiner at alexweiner.com, Ken Whistler > Cc: unicode at unicode.org > > The standard is set here. The Unicode Consortium has declared that it > won't encode precomposed characters that can be created from characters in > the standard, because that would be destabilizing and potentially introduce > security holes in programs depending on Unicode. If you want, we can have a > vote on whether or not APL should use characters with underlines, since I > was unfairly locked out of that vote by not being born yet. > > On Sun, Aug 16, 2015 at 5:52 PM wrote: > >> Ken, >> You pose a very strong, and well worded response. The historical element >> really helps to illuminate what I thought was lost knowledge: "Why are >> there no under-bars". To this I can only ask one thing: >> >> Can we put this to a vote again? To put things in perspective, I was thee >> years old at the time of the ballot in 1993 and had much larger issues to >> deal with (comprehending speech, learning to walk, etc.), and was unable to >> participate in this internationally binding vote. >> >> Perhaps feelings about the under-bar characters have changed since then. >> I know that the APL landscape is *very* different than it was in 1993. >> >> I have a copy of one of those IBM books that has the italicized >> upper-case under-bars. If my proposal for a new vote is well received, >> maybe we should include those as well, for completeness sake. >> >> -Alex >> >> >> -------- Original Message -------- >> Subject: Re: APL Under-bar Characters >> >> From: Ken Whistler >> Date: Sun, August 16, 2015 5:15 pm >> To: alexweiner at alexweiner.com >> Cc: unicode at unicode.org >> >> Alex, >> >> On 8/16/2015 12:41 PM, alexweiner at alexweiner.com wrote: >> >> >> As far as I know, APL definitely predates the Unicode consortium. Do you >> think that The Consortium possibly overlooked the pre-existing under-bar >> character set? >> >> >> >> The answer to that is no. >> >> Initially, Unicode 1.0 attempted to punt the entire APL complex >> functional symbol >> problem by encoding U+2300 APL COMPOSE OPERATOR. >> >> The concept was essentially that any of the combined symbols -- the old >> rack of stuff that people complained about entering with >> symbol/backspace/symbol >> keying, could simply be represented as sequences of existing symbols. >> Think of 2300 as an early attempt to introduce an APL "script"-specific >> conjunct-forming virama, a la much-later artificially introduced >> script-specific >> joiners. Cf. U+2D7F TIFINAGH CONSONANT JOINER. >> >> But U+2300 APL COMPOSE OPERATOR was an innovation that failed. >> It was fiercely opposed *by the APL community*, who wanted it >> out of 10646 and replaced with a explicit list of pre-formed complex >> functional symbols. Presumably for the same reason we are talking >> about here now: essentially that each symbol had to work as a "character", >> and in an APL context that meant fixed width and the same data size as >> all the other characters. >> >> The removal of Unicode 1.0 U+2300 APL COMPOSE OPERATOR is documented >> in Unicode 1.1 as of 1993: >> >> http://www.unicode.org/versions/Unicode1.1.0/ >> >> (see page 3) >> >> The addition of APL functional symbols is documented in Section 5.4.8, >> pp. 39-41. >> >> The exact repertoire that ended up encoded in the standard was the result >> of meetings >> between some Unicode representatives and some folks from the APL >> community. The names >> escape me at the moment, although it might be possible to recover some >> information eventually. (Documentation regarding Unicode events in late >> 1991 is >> sparse these days.) At any rate the agreed upon additional repertoire is >> probably >> that included in: >> >> X3L2/92-035, Unicode Request for Additional Characters in ISO/IEC >> 10646-1.2. >> And the rest of the consequences and processing can be dug out of the >> ballot history record >> for the voting on 10646 in 1992. >> >> At any rate, a propos *this* discussion, we agreed that the repertoire >> would cover >> all the complex functional symbols, but *not* the letters >> with underscores. And it is not that they were simply overlooked. >> >> How do I know? Well, first, there were APL specialists involved in coming >> up >> (and promoting) the repertoire that was carried into the 10646 balloting >> at >> the time. It isn't as if a bunch of ignorant Unicoders just grabbed one >> APL >> book off the shelf and coded up the table, not noticing that some stuff >> was >> missing. >> >> Second, the text that is currently in the core specification about this >> issue, >> to wit: >> >> " ... All other APL extensions can e encoded by composition of other >> Unicode characters. For example, the APL symbol a underbar can be >> represented by U+0061 LATIN SMALL LETTER A + U+0332 COMBINING LOW LINE." >> (Unicode 7.0, Section 22.7, p. 772) >> >> is *ancient* text. It was first printed on p. 6-83 of Unicode 2.0 in 1996, >> with exactly the same wording. And the only reason it took until 1996 to >> appear, >> instead of 1993, was that the editing of Unicode 2.0 and its code charts >> was such a massive task at the time. >> >> So the clear intent in *1993* was to represent any APL letter with >> underbar >> as a combining character sequence -- as noted. The only problem I see >> there >> is that the text in the core spec mistakenly used U+0061 (the lowercase >> "a") >> instead of U+0041 (the uppercase "A") for the exemplification. >> >> Third, I can attest that at least some of us at the time -- as early as >> 1989, had >> printed copies of IBM EBCDIC code page 293 for APL, which had >> the EBCDIC uppercase Latin letters with underscores (italicized, by the >> way), >> together with the regular EBCDIC upper and lowercase letters. [Dates from >> 1984.] >> *And* IBM EBCDIC code page 310 for APL, which dropped all the >> regular upper- and lowercase letters but added more symbols. >> *And* IBM PC code page 907 (with the underscored uppercase Latin >> letters) and PC code page 909 (CP437 hacked up for APL, without the >> underscored uppercase Latin letters), which was quickly superseded by >> PC code page 910, which also did not use the uppercase Latin letters >> with underscores. >> >> So yeah, we knew about these. Encoding them as combining character >> sequences instead of as atomic characters was a deliberate decision >> taken in 1992. And that decision made it through both UTC and >> international balloting for publication in 1993. >> >> --Ken >> >> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From alexweiner at alexweiner.com Sun Aug 16 20:57:43 2015 From: alexweiner at alexweiner.com (alexweiner at alexweiner.com) Date: Sun, 16 Aug 2015 18:57:43 -0700 Subject: APL Under-bar Characters Message-ID: <20150816185743.e74b0ce91403bfe413f98785c6a226af.266e3b9a15.wbe@email06.secureserver.net> An HTML attachment was scrubbed... URL: From asmus-inc at ix.netcom.com Sun Aug 16 21:16:17 2015 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Sun, 16 Aug 2015 19:16:17 -0700 Subject: APL Under-bar Characters In-Reply-To: <20150816185743.e74b0ce91403bfe413f98785c6a226af.266e3b9a15.wbe@email06.secureserver.net> References: <20150816185743.e74b0ce91403bfe413f98785c6a226af.266e3b9a15.wbe@email06.secureserver.net> Message-ID: <55D143F1.9030604@ix.netcom.com> An HTML attachment was scrubbed... URL: From kojiishi at gmail.com Mon Aug 17 01:21:44 2015 From: kojiishi at gmail.com (Koji Ishii) Date: Mon, 17 Aug 2015 15:21:44 +0900 Subject: Why doesn't Ideographic (ID) in UAX#14 have half-width katakana? In-Reply-To: <4A7F11C4-2F5B-4156-844C-C21E543FCFC7@apple.com> References: <553EEE6D.2020004@ga2.so-net.ne.jp> <553EFB2E.3010808@hiroshima-u.ac.jp> <55403267.9060202@att.net> <55438AFB.6020000@att.net> <55439AE4.4020109@hiroshima-u.ac.jp> <5543AE4B.5020904@att.net> <5543D03B.80603@ix.netcom.com> <55467CAF.4080401@ix.netcom.com> <4A7F11C4-2F5B-4156-844C-C21E543FCFC7@apple.com> Message-ID: Hi all, I'm not in sync with publishing schedule, sorry about that, but is it possible to consider this change for Unicode 9.0 time frame? I believe all concerns were cleared in the discussion, but if any were left, I'd be happy to discuss further. And I hope I'm not too late this time? /koji On Tue, May 5, 2015 at 6:19 AM, Peter Edberg wrote: > I have been checking with various groups at Apple. The consensus here is > that we would like to see the linebreak value for halfwidth katakana > changed to ID. > > - Peter E > > > > On May 3, 2015, at 12:53 PM, Asmus Freytag (t) > wrote: > > On 5/3/2015 9:47 AM, Koji Ishii wrote: > > Thank you so much Ken and Asmus for the detailed guides and histories. > This helps me a lot. > > In terms of time frame, I don't insist on specific time frame, Unicode 9 > is fine if that works well for all. > > I'm not sure how much history and postmortem has to be baked into the > section of UAX#14, hope not much because I'm not familiar with how it was > defined so other than what Ken and Asmus kindly provided in this thread. > But from those information, I feel stronger than before that this was > simply an unfortunate oversight. In the document Ken quoted, F and W are > distinguished, but H and N are not. In '90, East Asian versions of Office > and RichEdit were in my radar and all of them handled halfwidth Katakana as > ID for the line breaking purposes. That's quite understandable given the > amount of code points to work on, given the priority of halfwidth Katakana, > and given the difference of "what line breaking should be" and UAX#14 as > Ken noted, but writing it up as a document doesn't look an easy task > > > Koji, > > kana are special in that they are not shared among languages. From that > perspective, there's nothing wrong with having a "general purpose" > algorithm support the rules of the target language (unless that would add > undue complexity, which isn't a consideration here). > > Based on the data presented informally here in postings, I find your > conclusion (oversight) quite believable. The task would therefore be to > present the same data in a more organized fashion as part of a formal > proposal. Should be doable. > > I think you'd want to focus on survey of modern practice in > implementations (and if you have data on some of them going back to the > '90s the better). > > From the historical analysis it's clear that there was a desire to create > assignments that didn't introduce random inconsistencies between LB and EAW > properties, but that kind of self-consistency check just makes sure that > all characters of some group defined by the intersection of property > subsets are treated the same (unless there's an overriding reason to > differentiate within). It seems entirely plausible that this process > misfired for the characters in question, more likely so, given that the > earliest drafts of the tables were based on an implementation also being > created by MS around the same time. That makes any difference to other MS > products even more likely to be an oversight. > > I do want to help UTC establish a precedent of getting changes like that > endorsed by a representative sample of implementers and key external > standards (where applicable, in this case that would be CSS), to avoid the > chance of creating undue disruption (and to increase the chance that the > resulting modified algorithm is actually usable off-the-shelf, for example > for "default" or "unknown language" type scenarios. > > Hence my insistence that you go out and drum up support. But it looks like > this should be relatively easy, as there seems to be no strong case for > maintaining the status quo, other than that it is the status quo. > > A./ > > > > I agree that implementers and CSS WG should be involved, but given IE and > FF have already tailored, and all MS products as well, I guess it should > not be too hard. I'm in Chrome team now, and the only problem for me to fix > it in Chrome is to justify why Chrome wants to tailor rather than fixing > UAX#14 (and the bug priority...) > > Either Makoto or I can bring it up to CSS WG to get back to you. > > /koji > > > On Sat, May 2, 2015 at 4:12 AM, Asmus Freytag (t) > wrote: > >> Thank you, Ken, for your dedicated archeological efforts. >> >> I would like to emphasize that, at the time, UAX#14 reflected observed >> behavior, in particular (but not exclusively) for MS products some of which >> (at the time) used an LB algorithm that effectively matched an untailored >> UAX#14. >> >> However, recently, the W3C has spent considerable effort to look into >> different layout-related algorithms and specification. If, in that context, >> a consensus approach is developed that would point to a better "default" >> behavior for untailored UAX#14-style line breaking, I would regard that as >> a critical mass of support to allow UTC to consider tinkering with such a >> long-standing set of property assignments. >> >> This would be true, especially, if it can be demonstrated that (other >> than matching legacy behavior) there's no context that would benefit from >> the existing classification. I note that this was something several posters >> implied. >> >> So, if implementers of the legacy behavior are amenable to achieve this >> by tailoring, and if the change augments the number of situations where >> untailored UAX#14-style line breaking can be used, that would be a win that >> might offset the cost of a disruptive change. >> >> We've heard arguments why the proposed change is technically superior for >> Japanese. We now need to find out whether there are contexts where a change >> would adversely affect users/implementers. Following that, we would look >> for endorsements of the proposal from implementers or other standards >> organizations such as W3C (and, if at all possible, agreement from those >> implementers who use the untailored algorithm now). With these three >> preconditions in place, I would support an effort of the UTC to revisit >> this question. >> >> A./ >> >> >> On 5/1/2015 9:48 AM, Ken Whistler wrote: >> >> Suzuki-san, >> >> On 5/1/2015 8:25 AM, suzuki toshiya wrote: >> >> >> Excuse me, there is any discussion record how UAX#14 class for >> halfwidth-katakana in 15 years ago? If there is such, I want to >> see a sample text (of halfwidth-katakana) and expected layout >> result for it. >> >> >> The *founding* document for the UTC discussion of the initial >> Line_Break property values 15 years ago was: >> >> http://www.unicode.org/L2/L1999/99179.pdf >> >> and the corresponding table draft (before approval and conversion >> into the final format that was published with UTR #14 -- later >> *UAX* #14) was: >> >> http://www.unicode.org/L2/L1999/99180.pdf >> >> There is nothing different or surprising in terms of values there. The >> halfwidth >> katakana were lb=AL and the fullwidth katakana were lb=ID in >> that earliest draft, as of 1999. >> >> What is new information, perhaps, is the explicit correlation that can be >> found >> in those documents with the East_Asian_Width properties, and the >> explanation >> in L2/99-179 that the EAW property values were explicitly used to >> make distinctions for the initial LB values. >> >> There is no sample text or expected layout results from that time period, >> because that was not the basis for the original UTC decisions on any of >> this. >> Initial LB values were generated based on existing General_Category >> and EAW values, using general principles. They were not generated by >> examining and specifying in detail the line breaking behavior for >> every single script in the standard, and then working back from those >> detailed specifications to attempt to create a universal specification >> that would replicate all of that detailed behavior. Such an approach >> would have been nearly impossible, given the state of all the data, >> and might have taken a decade to complete. >> >> That said, Japanese line breaking was no doubt considered as part of >> the overall background, because the initial design for UTR #14 was >> informed >> by experience in implementation of line breaking algorithms at Microsoft >> in the 90's. >> >> >> You commented that the UAX#14 class should not be changed but >> the tailoring of the line breaking behaviour would solve >> the problem (as Firefox and IE11 did). However, some developers >> may wonder "there might be a reason why UTC put halfwidth-katakana >> to AL - without understanding it, we could not determine whether >> the proposed tailoring should be enabled always, or enabled >> only for a specific environment (e.g. locale, surrounding text)". >> >> >> See above, in L2/99-179. *That* was the justification. It had nothing >> to do with specific environment, locale, or surrounding text. >> >> >> If UTC can supply the "expected layout result for halfwidth- >> katakana (used to define the class in current UAX#14)", it >> would be helpful for the developers to evaluate the proposed >> tailoring algorithm. >> >> >> UAX #14 was never intended to be a detailed, script-by-script >> specification of line layout results. It is a default, generic, universal >> algorithm for line breaking that does a decent, generic job of >> line breaking in generic contexts without tailoring or specific >> knowledge of language, locale, or typographical conventions in use. >> >> UAX #14 is not a replacement for full specification of kinsoku >> rules for Japanese, in particular. Nor is it intended as any kind >> of replacement for JIS X 4051. >> >> Please understand this: UAX #14 does *NOT* tell anyone how >> Japanese text *should* line break. Instead, it is Japanese typographers, >> users and standardizers who tell implementers of line break >> algorithms for Japanese what the expectations for Japanese text should >> be, in what contexts. It is then the job of the UTC and of the >> platform and application vendors to negotiate the details of >> which part of that expected behavior makes sense to try to >> cover by tweaking the default line-breaking algorithm and the >> Line_Break property values for Unicode characters, and which >> part of that expected behavior makes sense to try to cover >> by adjusting commonly accessible and agreed upon tailoring >> behavior (or public standards like CSS), and finally which part of that >> expected behavior should instead be addressed by value-added, proprietary >> implementations of high end publishing software. >> >> Regards, >> >> --Ken >> >> >> >> >> >> > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From andrewcwest at gmail.com Mon Aug 17 03:16:45 2015 From: andrewcwest at gmail.com (Andrew West) Date: Mon, 17 Aug 2015 09:16:45 +0100 Subject: Standardised Variation Sequences with Toggles In-Reply-To: <20150816235008.43bbd56d@JRWUBU2> References: <20150816112024.1760f1a6@JRWUBU2> <55D0DFB2.7000806@att.net> <20150816235008.43bbd56d@JRWUBU2> Message-ID: On 16 August 2015 at 23:50, Richard Wordingham wrote: > >> @+ For details about the implementation of variation sequences in >> Phags-pa, please refer to the Phags-pa section of the core >> specification. > > a) This is likely to be ignored by someone who is just looking for the > *specification*. I think replacing 'implementation' by 'rendering' > would be better. I would be inclined to add, 'These sequences are more > complicated than they appear at first reading'. Otherwise, someone > will just add them to the character to glyph conversion section of a > font and think, "Job done". That's not a plausible scenario. Phags-pa has complex shaping and joining requirements, and it is impossible for someone to create a properly functioning Phags-pa font based on the code charts alone. If anyone did implement Phags-pa in a font based solely on the Unicode or 10646 code charts, with no joining or shaping behaviour, for use as a fallback font or as a code chart font then naively implementing U+A856 + U+FE00 (VS1) as a mirrored glyph is not unreasonable. If they want to produce a Phags-pa font for displaying running Phags-pa text then at a minimum they will need to read the appropriate section of the core specification. Andrew From chris.fynn at gmail.com Mon Aug 17 04:22:35 2015 From: chris.fynn at gmail.com (Christopher Fynn) Date: Mon, 17 Aug 2015 14:52:35 +0530 Subject: Emoji characters for food allergens In-Reply-To: <29292306.26076.1437842589469.JavaMail.defaultUser@defaultHost> References: <29292306.26076.1437842589469.JavaMail.defaultUser@defaultHost> Message-ID: Surely there is already some international standards body or panel which deals with food safety and labelling? (maybe ISO 22000 Food Safety Management Systems) If there is a real need for characters to represent food allergens, wouldn't such a body be the right group to come up with appropriate glyphs and then make a proposal to ISO 10646 / Unicode - Chris On 25 July 2015 at 22:13, William_J_G Overington wrote: > Emoji characters for food allergens > > An interesting document entitled > > Preliminary proposal to add emoji characters for food allergens > > by Hiroyuki Komatsu > > was added into the UTC (Unicode Technical Committee) Document Register > yesterday. > > http://www.unicode.org/L2/L2015/15197-emoji-food-allergens.pdf > > This is a welcome development. > > I suggest that, in view of the importance of precision in conveying > information about food allergens, that the emoji characters for food > allergens should be separate characters from other emoji characters. That > is, encoded in a separate quite distinct block of code points far away in > the character map from other emoji characters, with no dual meanings for > any of the characters: a character for a food allergen should be quite > separate and distinct from a character for any other meaning. > > I opine that having two separate meanings for the same character, one > meaning as an everyday jolly good fun meaning in a text message and one > meaning as a specialist food allergen meaning could be a source of > confusion. Far better to encode a separate code block with separate > characters right from the start than risk needless and perhaps medically > dangerous confusion in the future. > > I suggest that for each allergen that there be two characters. > > The glyph for the first character of the pair goes from baseline to > ascender. > > The glyph for the second character of the pair is a copy of the glyph for > the first character of the pair augmented with a thick red line from lower > left descender to higher right a little above the base line, the thick red > line perhaps being at about thirty degrees from the horizontal. Thus the > thick red line would go over the allergen part of the glyph yet just by > clipping it a bit so that clarity is maintained. > > The glyphs are thus for the presence of the allergen and the absence of > the allergen respectively. > > It is typical in the United Kingdom to label food packets not only with an > ingredients list but also with a list of allergens in the food and also > with a list of allergens not in the food. > > For example, a particular food may contain soya yet not gluten. > > Thus I opine that two characters are needed for each allergen. > > I have deliberately avoided a total strike through at forty-five degrees > as I opine that that could lead to problems distinguishing clearly the > glyph for the absence of one allergen from the glyph for the absence of > another allergen. > > I have also wondered whether each glyph for an allergen should include > within its glyph a number, maybe a three-digit number, so that clarity is > precise. > > I opine that two separate characters for each allergen is desirable rather > than some solution such as having one character for each allergen and a > combining strike through character. > > The two separate characters approach keeps the system straightforward to > use with many software packages. The matter of expressing food allergens is > far too important to become entangled in problems for everyday users. > > For gluten, it might be necessary to have three distinct code points. > > In the United Kingdom there is a legal difference between "gluten-free" > and "no gluten-containing ingredients". > > To be labelled gluten-free the product must have been tested. This is to > ensure that there has been no cross-contamination of ingredients. For > example, rice has no gluten, but was a particular load of rice transported > in a lorry used for wheat on other days? > > Yet testing is not always possible in a restaurant situation. > > William Overington > > 25 July 2015 > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From charupdate at orange.fr Mon Aug 17 06:48:48 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Mon, 17 Aug 2015 13:48:48 +0200 (CEST) Subject: APL Under-bar Characters Message-ID: <817872584.8325.1439812128893.JavaMail.www@wwinf1j14> On 16 Aug 2015, at 16:35, Alex Weiner wrote: > I have heard that the problem was brought to Unicode consortium before, and the answer was to just use the underline styling, as it is apparently equivalent, but I do not think it is. Underline styling usually connects the line from one letter to another like this. The under-bar characters do not do such connecting, and are actually only for capital letters. so It would look more? L I K E ? T H I S? (I added the spaces for dramatic effect). ? [And I left out the underscore in the middle.] This connecting behavior of the underline formatting can be disabled by checking the ?Words only? check box in LibreOffice Writer and surely in new versions of Microsoft Office Word, or in older versions by selecting ?Words? in the underline style dropdown menu. When the Words only feature is enabled, the underline skips all spaces, including no-break spaces. ____ I need to mention that I would have posted this yesterday, but refrained for having sent something too much these days, particularly in the threads about Michael Kaplan. I declare again to be very sorry, and I ask Michael to forgive my reactions to his blog post he wrote up ad hoc while he was angry about our discussion here. ____ On 17 Aug 2015 at 02:25, Ken Whistler wrote: > It isn't as if a bunch of ignorant Unicoders just grabbed one APL book off the shelf and coded up the table, not noticing that some stuff was missing. The received false idea about underline formatting (which "usually connects"), that in this highly technical context (after all, a programming language) has been opposed to the advice of the Unicode consortium: I couldn't help thinking immediately that such reasonings are symptomatic of the Unicode contesting posture that seems to be the first reflex of many people (including myself) from the beginning on (historically as well as personally speaking). As I too, at my beginning on this mailing list (which at the same time has been my first mailing list participation), went on in such an immature attempt, I'm seeming to be well placed to state once and for all (I hope) that making trouble for little to no use is a bit like process garbage: it wasts the energy (mental/physical) that needs to be used/saved to save the planet, as far as extends to the life on it. All the best, Marcel -------------- next part -------------- An HTML attachment was scrubbed... URL: From charupdate at orange.fr Mon Aug 17 06:51:26 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Mon, 17 Aug 2015 13:51:26 +0200 (CEST) Subject: Custom keyboard source samples (was: Re: Windows keyboard restrictions) Message-ID: <749936537.8372.1439812286240.JavaMail.www@wwinf1j14> On 07 Aug 2015, at 20:54, Richard Wordingham wrote: > What we're waiting for is a guide we can follow, or some code we can ape. Since yesterday I know a very simple way to get the source code (in C) of any MSKLC layout. While the build is done, we must wait for the four files appearing in an ad hoc created "amd64" subdirectory in the Temporary Files folder, in the hidden Application Data directory: C:\Users\user\AppData\Local\Temp\amd64 As soon as the four files are visible in the Explorer, we can press Ctrl+A, Ctrl+C, Ctrl+V. This must be done rapidly, in order to get a copy before their deletion by MSKLC a few seconds later. If we notice that during the build, three other temporary folders are created by MSKLC and deleted if empty, we may wish to know that the four files are strictly identical in all four folders. This has been verified on a simple layout, using the (very useful) comparison tool of the ConTEXT text editor. Best regards, Marcel From doug at ewellic.org Mon Aug 17 11:23:15 2015 From: doug at ewellic.org (Doug Ewell) Date: Mon, 17 Aug 2015 09:23:15 -0700 Subject: APL Under-bar Characters Message-ID: <20150817092315.665a7a7059d7ee80bb4d670165c8327d.91850dae39.wbe@email03.secureserver.net> wrote: > I have heard that the problem was brought to Unicode consortium > before, and the answer was to just use the underline styling, as it is > apparently equivalent, but I do not think it is. Combining character sequences are not "styling." Combining character sequences are plain text. They are not the same as marking a letter or word or paragraph in your word processor and clicking a button to make that text bold or italic or underlined. In layman's terms, each combining sequence (base character plus any number of combining characters) should be treated as a unit, regardless of whether the sequence has been assigned a name. So these sequences are indeed equivalent to the APL-specific "underlined letter" characters used in non-Unicode systems. > Underline styling usually connects the line from one letter to another > l?i?k?e? ?t?h?i?s?.? The under-bar characters do not do such connecting, > and are actually only for capital letters. so It would look more > L? I? K? E? ? T? H? I? S? (I added the spaces for dramatic effect). TUS 7.0, Section 7.9 does say: > The characters U+0332 COMBINING LOW LINE, U+0333 COMBINING DOUBLE LOW > LINE, U+0305 COMBINING OVERLINE, and U+033F COMBINING DOUBLE OVERLINE > are intended to connect on the left and right. In that case, despite the text in Section 22.7 that Ken quoted, it seems that U+0331 COMBINING MACRON might be a better choice for APL "underlined letters" than U+0332 COMBINING LOW LINE. Compare A?B?C? with A?B?C?, noting that your font and rendering engine mileage may vary. "Voting again" to change one of the basic rules of Unicode, on the basis that "perhaps feelings about the under-bar characters have changed since then," is not expected to be an option, as David said. > Then maybe we could work off that as a pseudo-standard? Neither named or unnamed character sequences are a "pseudo-standard." Both are part of the Unicode Standard. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From neil at tonal.clara.co.uk Mon Aug 17 13:02:50 2015 From: neil at tonal.clara.co.uk (Neil Harris) Date: Mon, 17 Aug 2015 19:02:50 +0100 Subject: APL Under-bar Characters In-Reply-To: <20150817092315.665a7a7059d7ee80bb4d670165c8327d.91850dae39.wbe@email03.secureserver.net> References: <20150817092315.665a7a7059d7ee80bb4d670165c8327d.91850dae39.wbe@email03.secureserver.net> Message-ID: <55D221CA.5030807@tonal.clara.co.uk> On 17/08/15 17:23, Doug Ewell wrote: > > In that case, despite the text in Section 22.7 that Ken quoted, it seems > that U+0331 COMBINING MACRON might be a better choice for APL > "underlined letters" than U+0332 COMBINING LOW LINE. Compare A?B?C? > with A?B?C?, noting that your font and rendering engine mileage may > vary. > > "Voting again" to change one of the basic rules of Unicode, on the basis > that "perhaps feelings about the under-bar characters have changed since > then," is not expected to be an option, as David said. > Doug is right. One small correction: U+0331 is COMBINING MACRON BELOW, not COMBINING MACRON. Wikipedia has an excellent article on this topic: https://en.wikipedia.org/wiki/Macron_below -- Neil From doug at ewellic.org Mon Aug 17 13:14:37 2015 From: doug at ewellic.org (Doug Ewell) Date: Mon, 17 Aug 2015 11:14:37 -0700 Subject: APL Under-bar Characters Message-ID: <20150817111437.665a7a7059d7ee80bb4d670165c8327d.edd06ccc55.wbe@email03.secureserver.net> Neil Harris wrote: > One small correction: U+0331 is COMBINING MACRON BELOW, not COMBINING > MACRON. Yes, thank you. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From charupdate at orange.fr Mon Aug 17 15:02:04 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Mon, 17 Aug 2015 22:02:04 +0200 (CEST) Subject: APL Under-bar Characters Message-ID: <2040787231.15099.1439841724457.JavaMail.www@wwinf1g37> On 17 Aug 2015 at 18:34, Doug Ewell wrote [included subsequent exchange]: > > I have heard that the problem was brought to Unicode consortium > > before, and the answer was to just use the underline styling, as it is > > apparently equivalent, but I do not think it is. > > Combining character sequences are not "styling." Combining character > sequences are plain text. They are not the same as marking a letter or > word or paragraph in your word processor and clicking a button to make > that text bold or italic or underlined. > > In layman's terms, each combining sequence (base character plus any > number of combining characters) should be treated as a unit, regardless > of whether the sequence has been assigned a name. So these sequences are > indeed equivalent to the APL-specific "underlined letter" characters > used in non-Unicode systems. > > > Underline styling usually connects the line from one letter to another > > l?i?k?e? ?t?h?i?s?.? The under-bar characters do not do such connecting, > > and are actually only for capital letters. so It would look more > > L? I? K? E? ? T? H? I? S? (I added the spaces for dramatic effect). > > TUS 7.0, Section 7.9 does say: > > > The characters U+0332 COMBINING LOW LINE, U+0333 COMBINING DOUBLE LOW > > LINE, U+0305 COMBINING OVERLINE, and U+033F COMBINING DOUBLE OVERLINE > > are intended to connect on the left and right. > > In that case, despite the text in Section 22.7 that Ken quoted, it seems > that U+0331 COMBINING MACRON BELOW might be a better choice for APL > "underlined letters" than U+0332 COMBINING LOW LINE. Compare A?B?C? > with A?B?C?, noting that your font and rendering engine mileage may > vary. As Alex Weiner apparently used formatting in both examples, I didn't notice that the issue was to avoid having underlined capitals looking like if they were underlined with formatting. Indeed I see that APL must use the combining macron below (and good rendering engines). Alex' concern is probably to be irked seeing all the mathematical alphabets in the SMP and no underlined for APL. By contrast, following Khaled Hosny's soon reply, Alex' posted concern about character count resolves clearly as an unexpected software behavior, which is to be fixed at implementation level. ? Marcel -------------- next part -------------- An HTML attachment was scrubbed... URL: From charupdate at orange.fr Mon Aug 17 15:35:43 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Mon, 17 Aug 2015 22:35:43 +0200 (CEST) Subject: Custom keyboard source samples In-Reply-To: <749936537.8372.1439812286240.JavaMail.www@wwinf1j14> References: <749936537.8372.1439812286240.JavaMail.www@wwinf1j14> Message-ID: <787050305.15517.1439843743352.JavaMail.www@wwinf1g37> I need to clarify why to get the C sources of an MSKLC layout we must work in the Temp folder. It's not really working what I've done in Temp, it's rather going to fetch the data where MSKLC stores them for a few seconds. This is independent of the working directory. Creating a new folder in Temp helps getting the data. We can also wait for the amd64 folder or the i386, ia64 (which appear first) or wow64 (which comes last) and select it before copying and pasting somewhere else. I just fear we haven't enough time for this procedure. So creating the folder before is a way to ensure that we get the files within the imparted time. These custom samples help to complete the WDK keyboard layout source samples, given that the WDK samples being for current keyboard layouts (US?English, Greek, French, German, Japanese), they don't include ligature tables. A working practice is IMHO to pack the maximum into a MSKLC layout, get the C sources and the installation package, then edit the sources and recompile using the WDK. Eventually we need to install the MSKLC layout first, then replace the driver in the System32 folder and reboot. This is a way to develop enhanced layouts, with chained dead keys, increased numerical keypad mapping with more code units and numpad accessed also while Fn is pressed (on compact keyboards), more or different modifiers, and so on. To do this, no much knowledge in programming is needed. As Richard calls it, we can ape the code, looking up kbd.h and winuser.h in the WDK or the MSKLC for scancodes and virtual key names. However, aping the splitted allocation tables and the reduced numpad digits (without shift states) is strongly discouraged. The best way is to unify the allocation tables, with the modification of some other code lines which that implies. I'm hopeful that this helps implementing more of Unicode at input level. But this is only *one* way to do so, which today certainly isn't any longer the most performative one. Marcel On 17 Aug 2015 at 13:59, I wrote: > On 07 Aug 2015, at 20:54, Richard Wordingham wrote: > > > What we're waiting for is a guide we can follow, or some code we can ape. > > Since yesterday I know a very simple way to get the source code (in C) of any MSKLC layout. > > While the build is done, we must wait for the four files appearing in an ad hoc created "amd64" subdirectory in the Temporary Files folder, in the hidden Application Data directory: C:\Users\user\AppData\Local\Temp\amd64 > As soon as the four files are visible in the Explorer, we can press Ctrl+A, Ctrl+C, Ctrl+V. This must be done rapidly, in order to get a copy before their deletion by MSKLC a few seconds later. > > If we notice that during the build, three other temporary folders are created by MSKLC and deleted if empty, we may wish to know that the four files are strictly identical in all four folders. This has been verified on a simple layout, using the (very useful) comparison tool of the ConTEXT text editor. > > Best regards, > > Marcel > -------------- next part -------------- An HTML attachment was scrubbed... URL: From alexweiner at alexweiner.com Mon Aug 17 17:32:37 2015 From: alexweiner at alexweiner.com (alexweiner at alexweiner.com) Date: Mon, 17 Aug 2015 15:32:37 -0700 Subject: APL Under-bar Characters Message-ID: <20150817153237.e74b0ce91403bfe413f98785c6a226af.d64e5e188d.wbe@email06.secureserver.net> An HTML attachment was scrubbed... URL: From olopierpa at gmail.com Mon Aug 17 17:48:16 2015 From: olopierpa at gmail.com (Pierpaolo Bernardi) Date: Tue, 18 Aug 2015 00:48:16 +0200 Subject: APL Under-bar Characters In-Reply-To: <20150817153237.e74b0ce91403bfe413f98785c6a226af.d64e5e188d.wbe@email06.secureserver.net> References: <20150817153237.e74b0ce91403bfe413f98785c6a226af.d64e5e188d.wbe@email06.secureserver.net> Message-ID: On Tue, Aug 18, 2015 at 12:32 AM, wrote: > Hi Doug, > > I think I am going to suggest that GNUAPL use > http://www.unicode.org/Public/UCD/latest/ucd/NamedSequences.txt > as previously suggested as it seems like it may provide a way for GNUAPL to > support characters with under-bars, and ease all our parsing problems. How can giving a name to a sequence allow you to change your parser in ways that you couldn't without an official name to the sequence? From alexweiner at alexweiner.com Mon Aug 17 21:57:22 2015 From: alexweiner at alexweiner.com (alexweiner at alexweiner.com) Date: Mon, 17 Aug 2015 19:57:22 -0700 Subject: APL Under-bar Characters Message-ID: <20150817195722.e74b0ce91403bfe413f98785c6a226af.489356c3b1.wbe@email06.secureserver.net> An HTML attachment was scrubbed... URL: From prosfilaes at gmail.com Mon Aug 17 23:45:23 2015 From: prosfilaes at gmail.com (David Starner) Date: Tue, 18 Aug 2015 04:45:23 +0000 Subject: APL Under-bar Characters In-Reply-To: <20150817195722.e74b0ce91403bfe413f98785c6a226af.489356c3b1.wbe@email06.secureserver.net> References: <20150817195722.e74b0ce91403bfe413f98785c6a226af.489356c3b1.wbe@email06.secureserver.net> Message-ID: On Mon, Aug 17, 2015 at 8:03 PM wrote: > Pierpaolo, > > You make a very good observation. You are essentially asking the question > that began the whole discussion. This is covered in depth in the gnuapl > mailing list. You can go their archive, and just search my name :) > > Since it seems that all hope of adding characters is lost, I think the > next best goal would be to try an reach some sort of semblance between the > Unicode Consortium and a nebulous group of people (APLers) who really > believe that the uppercase under-bar letters are atomic and different than > an underlined uppercase letters. > There are many languages, particularly Native American languages, given written form in the typewriter era that use letters with under-bar as part of their alphabet. And the underbar is no different from the cedilla, the acute and grave accents, the umlaut or many other modifiers used to make new characters in languages across the globe. There are single code-point versions of characters like ?, but that's historical coincidence, and they are equivalent to the two code-point versions. Arguing atomicity is missing the point; A? is as atomic as ? in Unicode's eyes. -------------- next part -------------- An HTML attachment was scrubbed... URL: From charupdate at orange.fr Tue Aug 18 02:32:01 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Tue, 18 Aug 2015 09:32:01 +0200 (CEST) Subject: APL Under-bar Characters In-Reply-To: References: <20150817195722.e74b0ce91403bfe413f98785c6a226af.489356c3b1.wbe@email06.secureserver.net> Message-ID: <2034236760.3262.1439883121516.JavaMail.www@wwinf1f13> On 18 Aug 2015 at 06:56, David Starner wrote: > There are many languages, particularly Native American languages, given written form in the typewriter era that use letters with under-bar as part of their alphabet. And the underbar is no different from the cedilla, the acute and grave accents, the umlaut or many other modifiers used to make new characters in languages across the globe. There are single code-point versions of characters like ?, but that's historical coincidence, and they are equivalent to the two code-point versions. Arguing atomicity is missing the point; A? is as atomic as ? in Unicode's eyes. IMHO the problem was aroused from GNU APL being implementing Unicode but still hesitating (and seemingly even about to abandon). I just pick one e-mail out of the archives (following Alex Weiner's invitation) http://lists.gnu.org/archive/html/bug-apl/2015-08/msg00047.html and have no time to browse them all but as I must implement APL on the keyboard along with universal Latin, I'm interested in decrypting how GNU APL view characters. IMO the way Unicode worked out to feasibly encode all characters on the world, with decomposition sequences and taking over precomposed characters only for backward compatibility's sake, opposes to GNU APL sticking with the inherited model. This antagonism may be exacerbated by GNU being a part of the Free Software Movement, as opposed to the business model of the companies Unicode is a consortium of. This may partly explain the tone of one part of this thread (except for my own comment). So it could really be a good idea to make GNU APL at ease with Unicode. If underbar letters are for the sole use of GNU APL, their implementation and font support will be catered for by this organization, and it would be enough to discourage their use outside of APL to meet the security issues. However, Ken Whistler explained clearly [http://www.unicode.org/mail-arch/unicode-ml/y2015-m08/0122.html] that today, APL would take advantage from updating towards the up-to-date character model. To facilitate this by making it plausible, I?suggest to consider that free software and proprietary software, rather than antagonistic, should be considered as complementary. I hope this (as are other people's contributions on this thread) to be a constructive view helping to clear the differends, given that particular requests cannot be dealt with entirely as long as the underlying philosophy isn't satisfactorily taken into account. Marcel -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Tue Aug 18 03:09:43 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Tue, 18 Aug 2015 10:09:43 +0200 Subject: Custom keyboard source samples (was: Re: Windows keyboard restrictions) In-Reply-To: <749936537.8372.1439812286240.JavaMail.www@wwinf1j14> References: <749936537.8372.1439812286240.JavaMail.www@wwinf1j14> Message-ID: it helps if hou reduce the processor frequency (if you don't have a tool fir that, use the energy control panel and set the power profile to max energy saving) just before clicking the button to build the package. i don't know why these c source files need to be deleted so fast when they could just remain in the same folder as the saved.klc file. nlte that the tool builds several packages for several processor types, not just amd64. Le 17 ao?t 2015 13:59, "Marcel Schneider" a ?crit : > On 07 Aug 2015, at 20:54, Richard Wordingham wrote: > > > What we're waiting for is a guide we can follow, or some code we can ape. > > Since yesterday I know a very simple way to get the source code (in C) of > any MSKLC layout. > > While the build is done, we must wait for the four files appearing in an > ad hoc created "amd64" subdirectory in the Temporary Files folder, in the > hidden Application Data directory: C:\Users\user\AppData\Local\Temp\amd64 > As soon as the four files are visible in the Explorer, we can press > Ctrl+A, Ctrl+C, Ctrl+V. This must be done rapidly, in order to get a copy > before their deletion by MSKLC a few seconds later. > > If we notice that during the build, three other temporary folders are > created by MSKLC and deleted if empty, we may wish to know that the four > files are strictly identical in all four folders. This has been verified on > a simple layout, using the (very useful) comparison tool of the ConTEXT > text editor. > > Best regards, > > Marcel > -------------- next part -------------- An HTML attachment was scrubbed... URL: From charupdate at orange.fr Tue Aug 18 04:42:01 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Tue, 18 Aug 2015 11:42:01 +0200 (CEST) Subject: Custom keyboard source samples (was: Re: Windows keyboard restrictions) In-Reply-To: References: <749936537.8372.1439812286240.JavaMail.www@wwinf1j14> Message-ID: <1050352786.6768.1439890921356.JavaMail.www@wwinf1f13> On 18 Aug 2015 at 10:09, Philippe Verdy wrote: > it helps if hou reduce the processor frequency (if you don't have a tool fir that, use the energy control panel and set the power profile to max energy saving) just before clicking the button to build the package. That's a very good idea. I've currently throttled down the CPU to half performance (my computer being a netbook, and the processor stays heating around it). This is very important to note for users of desktop machines working at high performance. One really needs to slow down for this operation. > i don't know why these c source files need to be deleted so fast when they could just remain in the same folder as the saved.klc file. That's an?again, very good?question, and I often asked it to myself. Short answer is: There's no need. Long answer is: As it results from decrypting* the blog http://www.siao2.com, Michael Scott Kaplan, the author of MSKLC, tried to meet at maximum the users' needs* and what users were expected to wish to have on their keyboard, and to create a smooth UI with maximum security.* Well, editing oneself the C source and the header file is about the opposite. So Michael didn't want to have his users bother with (which would have inevitably occurred if he put the files into the scope of the end-user). Now once they're meant to stay hidden, the best is to delete them right after use but nevertheless, a long time I?couldn't help thinking something I won't post here anymore because it's about software policies.* > nlte that the tool builds several packages for several processor types, not just amd64. I'd to choose one folder. amd64 comes first in alphabet, and it comes third in the MSKLC build process so that we've enough time to switch back to the explorer window where the desired files are awaited. * Like many other people wished that MSKLC support chained dead keys, I often wished that Windows support chained dead keys (like GNU/Linux), thinking that it doesn't, given that MSKLC doesn't offer this option. When I learned that Windows does, I deduced that MSKLC is purposely restricted to prevent the users from making any "too useful" keyboard layouts, until thanks to Doug Ewell drawing our attention to it, I learned the existence of Michael Kaplan's blog, and finally found this blog post on it: http://www.siao2.com/2004/12/17/323257.aspx I still wonder why one should not like to type two dead keys to get double-diacriticized letters, but I?agree that the new way of typing text is with combining diacritics. Now I see that it would have been enough to type "who is the author of MSKLC" into the Bing search bar to learn the name in the second result, and the story on his own blog as the fifth result... http://www.siao2.com/2013/10/04/10454264.aspx I've missed that! October 2013 was before the time I?began to really bother with keyboard layouts. But it was the time I should have begun to. Sorry, Michael! Marcel -------------- next part -------------- An HTML attachment was scrubbed... URL: From eik at iki.fi Tue Aug 18 05:55:53 2015 From: eik at iki.fi (Erkki I Kolehmainen) Date: Tue, 18 Aug 2015 13:55:53 +0300 Subject: APL Under-bar Characters In-Reply-To: <2034236760.3262.1439883121516.JavaMail.www@wwinf1f13> References: <20150817195722.e74b0ce91403bfe413f98785c6a226af.489356c3b1.wbe@email06.secureserver.net> <2034236760.3262.1439883121516.JavaMail.www@wwinf1f13> Message-ID: <000001d0d9a4$740e42e0$5c2ac8a0$@fi> Mr. Schneider Free Software Movement or not makes no difference. Furthermore, please consult the membership roster of Unicode before making statements on what Unicode is a consortium of. You also state: If underbar letters are for the sole use of GNU APL, their implementation and font support will be catered for by this organization, and it would be enough to discourage their use outside of APL to meet the security issues. If composed letters are not acceptable for whatever and how non-understandable reason, there is a perfect solution: PUA. Sincerely, Erkki I. Kolehmainen Tilkankatu 12 A 3, 00300 Helsinki, Finland Mob: +358400825943, Tel: +358943682643, Fax: +35813318116 L?hett?j?: Unicode [mailto:unicode-bounces at unicode.org] Puolesta Marcel Schneider L?hetetty: 18. elokuuta 2015 10:32 Vastaanottaja: Unicode Mailing List Kopio: alexweiner at alexweiner.com Aihe: Re: APL Under-bar Characters On 18 Aug 2015 at 06:56, David Starner < prosfilaes at gmail.com> wrote: > There are many languages, particularly Native American languages, given written form in the typewriter era that use letters with under-bar as part of their alphabet. And the underbar is no different from the cedilla, the acute and grave accents, the umlaut or many other modifiers used to make new characters in languages across the globe. There are single code-point versions of characters like ?, but that's historical coincidence, and they are equivalent to the two code-point versions. Arguing atomicity is missing the point; A? is as atomic as ? in Unicode's eyes. IMHO the problem was aroused from GNU APL being implementing Unicode but still hesitating (and seemingly even about to abandon). I just pick one e-mail out of the archives (following Alex Weiner's invitation) http://lists.gnu.org/archive/html/bug-apl/2015-08/msg00047.html and have no time to browse them all but as I must implement APL on the keyboard along with universal Latin, I'm interested in decrypting how GNU APL view characters. IMO the way Unicode worked out to feasibly encode all characters on the world, with decomposition sequences and taking over precomposed characters only for backward compatibility's sake, opposes to GNU APL sticking with the inherited model. This antagonism may be exacerbated by GNU being a part of the Free Software Movement, as opposed to the business model of the companies Unicode is a consortium of. This may partly explain the tone of one part of this thread (except for my own comment). So it could really be a good idea to make GNU APL at ease with Unicode. If underbar letters are for the sole use of GNU APL, their implementation and font support will be catered for by this organization, and it would be enough to discourage their use outside of APL to meet the security issues. However, Ken Whistler explained clearly [http://www.unicode.org/mail-arch/unicode-ml/y2015-m08/0122.html] that today, APL would take advantage from updating towards the up-to-date character model. To facilitate this by making it plausible, I suggest to consider that free software and proprietary software, rather than antagonistic, should be considered as complementary. I hope this (as are other people's contributions on this thread) to be a constructive view helping to clear the differends, given that particular requests cannot be dealt with entirely as long as the underlying philosophy isn't satisfactorily taken into account. Marcel -------------- next part -------------- An HTML attachment was scrubbed... URL: From alexweiner at alexweiner.com Tue Aug 18 08:18:47 2015 From: alexweiner at alexweiner.com (alexweiner at alexweiner.com) Date: Tue, 18 Aug 2015 06:18:47 -0700 Subject: APL Under-bar Characters Message-ID: <20150818061847.e74b0ce91403bfe413f98785c6a226af.fada311151.mailapi@mailapi06.secureserver.net> "PUA"? -------- Original Message -------- Subject: RE: APL Under-bar Characters From: "Erkki I Kolehmainen" Date: Aug 18, 2015 6:55 AM To: "'Marcel Schneider'" ,"'Unicode Mailing List'" CC: alexweiner at alexweiner.com Mr. Schneider Free Software Movement or not makes no difference. Furthermore, please consult the membership roster of Unicode before making statements on what Unicode is a consortium of. You also state: If underbar letters are for the sole use of GNU APL, their implementation and font support will be catered for by this organization, and it would be enough to discourage their use outside of APL to meet the security issues. If composed letters are not acceptable for whatever and how non-understandable reason, there is a perfect solution: PUA. Sincerely, Erkki I. Kolehmainen Tilkankatu 12 A 3, 00300 Helsinki, Finland Mob: +358400825943, Tel: +358943682643, Fax: +35813318116 L?hett?j?: Unicode [mailto:unicode-bounces at unicode.org] Puolesta Marcel Schneider L?hetetty: 18. elokuuta 2015 10:32 Vastaanottaja: Unicode Mailing List Kopio: alexweiner at alexweiner.com Aihe: Re: APL Under-bar Characters On 18 Aug 2015 at 06:56, David Starner < > prosfilaes at gmail.com> wrote: > There are many languages, particularly Native American languages, given written form in the typewriter era that use letters with under-bar as part of their alphabet. And the underbar is no different from the cedilla, the acute and grave accents, the umlaut or many other modifiers used to make new characters in languages across the globe. There are single code-point versions of characters like ?, but that's historical coincidence, and they are equivalent to the two code-point versions. Arguing atomicity is missing the point; A? is as atomic as ? in Unicode's eyes. IMHO the problem was aroused from GNU APL being implementing Unicode but still hesitating (and seemingly even about to abandon). I just pick one e-mail out of the archives (following Alex Weiner's invitation) http://lists.gnu.org/archive/html/bug-apl/2015-08/msg00047.html and have no time to browse them all but as I must implement APL on the keyboard along with universal Latin, I'm interested in decrypting how GNU APL view characters. IMO the way Unicode worked out to feasibly encode all characters on the world, with decomposition sequences and taking over precomposed characters only for backward compatibility's sake, opposes to GNU APL sticking with the inherited model. This antagonism may be exacerbated by GNU being a part of the Free Software Movement, as opposed to the business model of the companies Unicode is a consortium of. This may partly explain the tone of one part of this thread (except for my own comment). So it could really be a good idea to make GNU APL at ease with Unicode. If underbar letters are for the sole use of GNU APL, their implementation and font support will be catered for by this organization, and it would be enough to discourage their use outside of APL to meet the security issues. However, Ken Whistler explained clearly [http://www.unicode.org/mail-arch/unicode-ml/y2015-m08/0122.html] that today, APL would take advantage from updating towards the up-to-date character model. To facilitate this by making it plausible, I suggest to consider that free software and proprietary software, rather than antagonistic, should be considered as complementary. I hope this (as are other people's contributions on this thread) to be a constructive view helping to clear the differends, given that particular requests cannot be dealt with entirely as long as the underlying philosophy isn't satisfactorily taken into account. Marcel From leob at mailcom.com Tue Aug 18 09:38:42 2015 From: leob at mailcom.com (Leo Broukhis) Date: Tue, 18 Aug 2015 16:38:42 +0200 Subject: APL Under-bar Characters In-Reply-To: <20150818061847.e74b0ce91403bfe413f98785c6a226af.fada311151.mailapi@mailapi06.secureserver.net> References: <20150818061847.e74b0ce91403bfe413f98785c6a226af.fada311151.mailapi@mailapi06.secureserver.net> Message-ID: http://www.acronymfinder.com/Information-Technology/PUA.html On Tue, Aug 18, 2015 at 3:18 PM, wrote: > "PUA"? > > -------- Original Message -------- > Subject: RE: APL Under-bar Characters > From: "Erkki I Kolehmainen" > Date: Aug 18, 2015 6:55 AM > To: "'Marcel Schneider'" ,"'Unicode Mailing List'" > CC: alexweiner at alexweiner.com > > Mr. Schneider > > > > Free Software Movement or not makes no difference. Furthermore, please consult the membership roster of Unicode before making statements on what Unicode is a consortium of. > > > > You also state: > > If underbar letters are for the sole use of GNU APL, their implementation and font support will be catered for by this organization, and it would be enough to discourage their use outside of APL to meet the security issues. > > If composed letters are not acceptable for whatever and how non-understandable reason, there is a perfect solution: PUA. > > > > Sincerely, > > > > Erkki I. Kolehmainen > > Tilkankatu 12 A 3, 00300 Helsinki, Finland > > Mob: +358400825943, Tel: +358943682643, Fax: +35813318116 > > > > L?hett?j?: Unicode [mailto:unicode-bounces at unicode.org] Puolesta Marcel Schneider > L?hetetty: 18. elokuuta 2015 10:32 > Vastaanottaja: Unicode Mailing List > Kopio: alexweiner at alexweiner.com > Aihe: Re: APL Under-bar Characters > > > > On 18 Aug 2015 at 06:56, David Starner < > prosfilaes at gmail.com> wrote: > >> There are many languages, particularly Native American languages, given written form in the typewriter era that use letters with under-bar as part of their alphabet. And the underbar is no different from the cedilla, the acute and grave accents, the umlaut or many other modifiers used to make new characters in languages across the globe. There are single code-point versions of characters like ?, but that's historical coincidence, and they are equivalent to the two code-point versions. Arguing atomicity is missing the point; A? is as atomic as ? in Unicode's eyes. > > IMHO the problem was aroused from GNU APL being implementing Unicode but still hesitating (and seemingly even about to abandon). I just pick one e-mail out of the archives (following Alex Weiner's invitation) > http://lists.gnu.org/archive/html/bug-apl/2015-08/msg00047.html > and have no time to browse them all but as I must implement APL on the keyboard along with universal Latin, I'm interested in decrypting how GNU APL view characters. IMO the way Unicode worked out to feasibly encode all characters on the world, with decomposition sequences and taking over precomposed characters only for backward compatibility's sake, opposes to GNU APL sticking with the inherited model. This antagonism may be exacerbated by GNU being a part of the Free Software Movement, as opposed to the business model of the companies Unicode is a consortium of. This may partly explain the tone of one part of this thread (except for my own comment). > > So it could really be a good idea to make GNU APL at ease with Unicode. If underbar letters are for the sole use of GNU APL, their implementation and font support will be catered for by this organization, and it would be enough to discourage their use outside of APL to meet the security issues. > However, Ken Whistler explained clearly [http://www.unicode.org/mail-arch/unicode-ml/y2015-m08/0122.html] that today, APL would take advantage from updating towards the up-to-date character model. To facilitate this by making it plausible, I suggest to consider that free software and proprietary software, rather than antagonistic, should be considered as complementary. > > I hope this (as are other people's contributions on this thread) to be a constructive view helping to clear the differends, given that particular requests cannot be dealt with entirely as long as the underlying philosophy isn't satisfactorily taken into account. > > Marcel > > > From alexweiner at alexweiner.com Tue Aug 18 09:55:59 2015 From: alexweiner at alexweiner.com (alexweiner at alexweiner.com) Date: Tue, 18 Aug 2015 07:55:59 -0700 Subject: APL Under-bar Characters Message-ID: <20150818075559.e74b0ce91403bfe413f98785c6a226af.1587c769a1.mailapi@mailapi06.secureserver.net> ah yes. I believe the "private use area" was also suggested and may provide a route to take -Alex -------- Original Message -------- Subject: Re: APL Under-bar Characters From: Leo Broukhis Date: Aug 18, 2015 10:38 AM To: alexweiner at alexweiner.com CC: eik at iki.fi,charupdate at orange.fr,"unicode Unicode Discussion" http://www.acronymfinder.com/Information-Technology/PUA.html

On Tue, Aug 18, 2015 at 3:18 PM, wrote:
> "PUA"?
>
> -------- Original Message --------
> Subject: RE: APL Under-bar Characters
> From: "Erkki I Kolehmainen"
> Date: Aug 18, 2015 6:55 AM
> To: "'Marcel Schneider'" ,"'Unicode Mailing List'"
> CC: alexweiner at alexweiner.com
>
> Mr. Schneider
>
>
>
> Free Software Movement or not makes no difference. Furthermore, please consult the membership roster of Unicode before making statements on what Unicode is a consortium of.
>
>
>
> You also state:
>
> If underbar letters are for the sole use of GNU APL, their implementation and font support will be catered for by this organization, and it would be enough to discourage their use outside of APL to meet the security issues.
>
> If composed letters are not acceptable for whatever and how non-understandable reason, there is a perfect solution: PUA.
>
>
>
> Sincerely,
>
>
>
> Erkki I. Kolehmainen
>
> Tilkankatu 12 A 3, 00300 Helsinki, Finland
>
> Mob: +358400825943, Tel: +358943682643, Fax: +35813318116
>
>
>
> L?hett?j?: Unicode [mailto:unicode-bounces at unicode.org] Puolesta Marcel Schneider
> L?hetetty: 18. elokuuta 2015 10:32
> Vastaanottaja: Unicode Mailing List
> Kopio: alexweiner at alexweiner.com
> Aihe: Re: APL Under-bar Characters
>
>
>
> On 18 Aug 2015 at 06:56, David Starner < > prosfilaes at gmail.com> wrote:
>
>> There are many languages, particularly Native American languages, given written form in the typewriter era that use letters with under-bar as part of their alphabet. And the underbar is no different from the cedilla, the acute and grave accents, the umlaut or many other modifiers used to make new characters in languages across the globe. There are single code-point versions of characters like ?, but that's historical coincidence, and they are equivalent to the two code-point versions. Arguing atomicity is missing the point; A? is as atomic as ? in Unicode's eyes.
>
> IMHO the problem was aroused from GNU APL being implementing Unicode but still hesitating (and seemingly even about to abandon). I just pick one e-mail out of the archives (following Alex Weiner's invitation)
> http://lists.gnu.org/archive/html/bug-apl/2015-08/msg00047.html
> and have no time to browse them all but as I must implement APL on the keyboard along with universal Latin, I'm interested in decrypting how GNU APL view characters. IMO the way Unicode worked out to feasibly encode all characters on the world, with decomposition sequences and taking over precomposed characters only for backward compatibility's sake, opposes to GNU APL sticking with the inherited model. This antagonism may be exacerbated by GNU being a part of the Free Software Movement, as opposed to the business model of the companies Unicode is a consortium of. This may partly explain the tone of one part of this thread (except for my own comment).
>
> So it could really be a good idea to make GNU APL at ease with Unicode. If underbar letters are for the sole use of GNU APL, their implementation and font support will be catered for by this organization, and it would be enough to discourage their use outside of APL to meet the security issues.
> However, Ken Whistler explained clearly [http://www.unicode.org/mail-arch/unicode-ml/y2015-m08/0122.html] that today, APL would take advantage from updating towards the up-to-date character model. To facilitate this by making it plausible, I suggest to consider that free software and proprietary software, rather than antagonistic, should be considered as complementary.
>
> I hope this (as are other people's contributions on this thread) to be a constructive view helping to clear the differends, given that particular requests cannot be dealt with entirely as long as the underlying philosophy isn't satisfactorily taken into account.
>
> Marcel
>
>
>
From doug at ewellic.org Tue Aug 18 10:11:33 2015 From: doug at ewellic.org (Doug Ewell) Date: Tue, 18 Aug 2015 08:11:33 -0700 Subject: APL Under-bar Characters Message-ID: <20150818081133.665a7a7059d7ee80bb4d670165c8327d.d3a7f92947.wbe@email03.secureserver.net> wrote: > Since it seems that all hope of adding characters is lost, I think the > next best goal would be to try an reach some sort of semblance between > the Unicode Consortium and a nebulous group of people (APLers) who > really believe that the uppercase under-bar letters are atomic and > different than an underlined uppercase letters. This reminds me of an argument that has occasionally been made that Unicode should encode Latin "majuscules" separately from "capital letters" because the semantic functions of the two are different (typographical choice vs. orthographic rule). Unicode does not provide distinct encodings of "a" based on pronunciation (chaos, cat, star). It does not provide distinct encodings of "uppercase A" based on the reason for using the uppercase instead of the lowercase (HAPPY, Adam). And it also does not provide distinct encodings of "uppercase A with underline" based on the reason for underlining the letter. > Some sort of list, no matter how "unofficial", is better than no list > at all, right? Wouldn't the Unicode Consortium be the place for such a > list, such as in NamedSequences.txt ? It is NOT necessary for a combining sequence to be assigned a name, either by the Unicode Technical Committee or by anyone else, in order to use it. Note that of the two sequences: A? <0041, 0331> A? <0041, 0332> neither sequence is listed in NamedSequences.txt, yet I can use them without limitation in this email and in plain text generally. I'm not sure the general concept of combining sequences is well understood in this thread. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From asmus-inc at ix.netcom.com Tue Aug 18 10:34:52 2015 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Tue, 18 Aug 2015 08:34:52 -0700 Subject: APL Under-bar Characters In-Reply-To: <20150818061847.e74b0ce91403bfe413f98785c6a226af.fada311151.mailapi@mailapi06.secureserver.net> References: <20150818061847.e74b0ce91403bfe413f98785c6a226af.fada311151.mailapi@mailapi06.secureserver.net> Message-ID: <55D3509C.3050404@ix.netcom.com> An HTML attachment was scrubbed... URL: From tom at bluesky.org Tue Aug 18 10:45:01 2015 From: tom at bluesky.org (Tom Gewecke) Date: Tue, 18 Aug 2015 11:45:01 -0400 Subject: APL Under-bar Characters In-Reply-To: <20150818081133.665a7a7059d7ee80bb4d670165c8327d.d3a7f92947.wbe@email03.secureserver.net> References: <20150818081133.665a7a7059d7ee80bb4d670165c8327d.d3a7f92947.wbe@email03.secureserver.net> Message-ID: <3A4F5254-7C02-4ACC-BF4C-E975A20BB45D@bluesky.org> On Aug 18, 2015, at 11:11 AM, Doug Ewell wrote: > > It is NOT necessary for a combining sequence to be assigned a name, > either by the Unicode Technical Committee or by anyone else, in order to > use it. Note that of the two sequences: > > A? <0041, 0331> > A? <0041, 0332> > > neither sequence is listed in NamedSequences.txt, yet I can use them > without limitation in this email and in plain text generally. I guess the question is whether having a named sequence would somehow make it easier for the gnu apl folks to add something to their system so that their string length function sees such a sequence as having a length of "1"? From kenwhistler at att.net Tue Aug 18 11:22:53 2015 From: kenwhistler at att.net (Ken Whistler) Date: Tue, 18 Aug 2015 09:22:53 -0700 Subject: APL Under-bar Characters In-Reply-To: <20150817092315.665a7a7059d7ee80bb4d670165c8327d.91850dae39.wbe@email03.secureserver.net> References: <20150817092315.665a7a7059d7ee80bb4d670165c8327d.91850dae39.wbe@email03.secureserver.net> Message-ID: <55D35BDD.2020006@att.net> Returning to a historical note on the glyphic forms and the question of combining low lines or combining macrons below... admittedly a side note on this thread, the *original* identification of these APL uppercase Latin letters, at least in their IBM implementations, was clearly as uppercase (italicized) Latin letters with *underscores* -- not with macrons below. The identification of the entity we have been talking about, for example, in IBM documentation is *LA480000*, described as: "A Line Below Capital/A Underscore (APL)" It is shown in the documentation with an *underscore*, with the scoring reaching to match the outside serifs of the "A", and clearly not with a macron below. Furthermore, the glyph character identification system from that era (late 1980's) contained a value for a "Line below" diacritic (that's the "48" in the glyph identifier above), but no provision for macrons *below* a letter. Consider also that the keyboards and character sets involved in the time had underscores (low lines), but macrons below were rare diacritics, and not in anybody's character set at the time. The appearance of underscored characters in printed material at the time would typically involve a gap between the underscoring on adjacent characters, but that was a result of discrete type elements in the print trains, typically. It wasn't because conceptually the underscores were being treated as deliberately short diacritics that should *not* connect. The underscores were more likely to connect on screens, but that was typically the result of the very limited scale of the character generator pixel rasters for the characters. You just turned on all the pixels in the bottom row of the box -- and there you had your underscore! The documentation that Doug cites from Section 7.9 of TUS was written to clarify that the *general* intent, when people use underscoring or overscoring diacritics, is that they should connect laterally. That is to contrast with macron diacritics, above *or* below, which of course do not connect laterally to adjacent macrons. But without a very specialized font, it is very difficult to do lateral connection correctly for variable width fonts. See examples below for Helvetica, Times, and Courier (although your mileage may differ, depending on your email client fonts, as Doug noted): A?B?O?M?I?N?A?T?I?O?N? vs. _ABOMINATION_ A?B?O?M?I?N?A?T?I?O?N? vs. _ABOMINATION_ ?A?B?O?M?I?N?A?T?I?O?N vs. _ABOMINATION_ Sequences of combining low lines after letters on the left, styling with underscoring on the right. Only for fixed width fonts does this really work "as designed", so to speak. Hence, the general recommendation that if what you are trying to do is underscore (or overscore) a sequence of text, by all means do it with styling, and not with sequences of individual diacritics on letters. But the fact that underscores used as diacritics on letters are basically a 20th century typewriter hack that persisted into early computer character sets -- and the fact that they don't work very well, or look very elegant with most modern, digital, variable width fonts, interestingly has led to the rise of the macron below, very much along the lines of Doug's suggestion cited below. What used to be a rare diacritic is gaining in popularity in actual use, precisely because it "looks like" a diacritic on the letter in most fonts nowadays, and because it *doesn't* connect, more or less randomly, with neighboring diacritics on adjacent letters, the way the low line diacritic can. And while I am generally sympathetic with this changeover, and suspect it is probably the best outcome for cases like the use of line below diacritics on Latin letters in Semitic transliteration, I don't think it is the best recommendation for this particular case of a legacy usage for APL. The APL letters with underscores clearly *are* historically connected precisely to the underscore, and should probably stay represented accordingly. If APL afficionados don't prefer the underscores visually connecting between adjacent capital Latin letters in APL text material presented that way, then that can be addressed in the APL specialist fonts. After all, such fonts already exist, precisely to provide best display for all the other specialized symbols of APL. --Ken On 8/17/2015 9:23 AM, Doug Ewell wrote: > TUS 7.0, Section 7.9 does say: >> The characters U+0332 COMBINING LOW LINE, U+0333 COMBINING DOUBLE LOW >> LINE, U+0305 COMBINING OVERLINE, and U+033F COMBINING DOUBLE OVERLINE >> are intended to connect on the left and right. > In that case, despite the text in Section 22.7 that Ken quoted, it seems > that U+0331 COMBINING MACRON [BELOW] might be a better choice for APL > "underlined letters" than U+0332 COMBINING LOW LINE. Compare A?B?C? > with A?B?C?, noting that your font and rendering engine mileage may > vary. > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Tue Aug 18 11:23:53 2015 From: doug at ewellic.org (Doug Ewell) Date: Tue, 18 Aug 2015 09:23:53 -0700 Subject: APL Under-bar Characters Message-ID: <20150818092353.665a7a7059d7ee80bb4d670165c8327d.bc667e26d2.wbe@email03.secureserver.net> Tom Gewecke wrote: > I guess the question is whether having a named sequence would somehow > make it easier for the gnu apl folks to add something to their system > so that their string length function sees such a sequence as having a > length of "1"? I don't see why that would, or should, be the determining factor. A more robust approach for their purposes might be to teach ? to exclude combining characters (gc=Mn) when counting the "size" of a string. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From doug at ewellic.org Tue Aug 18 11:29:13 2015 From: doug at ewellic.org (Doug Ewell) Date: Tue, 18 Aug 2015 09:29:13 -0700 Subject: APL Under-bar Characters Message-ID: <20150818092913.665a7a7059d7ee80bb4d670165c8327d.e2ba804e00.wbe@email03.secureserver.net> Ken Whistler wrote: > Returning to a historical note on the glyphic forms and the question > of combining low lines or combining macrons below... admittedly a > side note on this thread, the *original* identification of these APL > uppercase Latin letters, at least in their IBM implementations, was > clearly as uppercase (italicized) Latin letters with *underscores* -- > not with macrons below. > ... I absolutely stand corrected on this. U+0332 it is. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From kenwhistler at att.net Tue Aug 18 11:35:58 2015 From: kenwhistler at att.net (Ken Whistler) Date: Tue, 18 Aug 2015 09:35:58 -0700 Subject: APL Under-bar Characters In-Reply-To: <20150818092353.665a7a7059d7ee80bb4d670165c8327d.bc667e26d2.wbe@email03.secureserver.net> References: <20150818092353.665a7a7059d7ee80bb4d670165c8327d.bc667e26d2.wbe@email03.secureserver.net> Message-ID: <55D35EEE.70205@att.net> On 8/18/2015 9:23 AM, Doug Ewell wrote: > Tom Gewecke wrote: > >> I guess the question is whether having a named sequence would somehow >> make it easier for the gnu apl folks to add something to their system >> so that their string length function sees such a sequence as having a >> length of "1"? > I don't see why that would, or should, be the determining factor. A more > robust approach for their purposes might be to teach ? to exclude > combining characters (gc=Mn) when counting the "size" of a string. > > And it seems to me that that is *very* unlikely to happen, precisely because ? is so deeply embedded in the array and vector logic of APL. That is counting the data size of arrays of "characters" (i.e., code units). If somebody tried to somehow teach ? to do something different about characters, changing the concept of array of code units into something more akin to what we think of as Unicode strings, that would end up being a *different* language -- not APL! --Ken From doug at ewellic.org Tue Aug 18 11:45:17 2015 From: doug at ewellic.org (Doug Ewell) Date: Tue, 18 Aug 2015 09:45:17 -0700 Subject: APL Under-bar Characters Message-ID: <20150818094517.665a7a7059d7ee80bb4d670165c8327d.2f37adb81c.wbe@email03.secureserver.net> Ken Whistler wrote: >> A more >> robust approach for their purposes might be to teach ? to exclude >> combining characters (gc=Mn) when counting the "size" of a string. > > And it seems to me that that is *very* unlikely to happen, precisely > because ? is so deeply embedded in the array and vector logic of APL. > > That is counting the data size of arrays of "characters" (i.e., code > units). If somebody tried to somehow teach ? to do something different > about characters, changing the concept of array of code units into > something more akin to what we think of as Unicode strings, that > would end up being a *different* language -- not APL! Then we're back to the central point that Alex Weiner originally expressed, in arguing for the encoding of precomposed letters with underbar: > The string length functionality would view an 'A' code point combined > with an '_' code point as an item that has two elements, while > something that looks like 'A' Should be atomic, and return a length > of one. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From kenwhistler at att.net Tue Aug 18 12:13:15 2015 From: kenwhistler at att.net (Ken Whistler) Date: Tue, 18 Aug 2015 10:13:15 -0700 Subject: APL Under-bar Characters In-Reply-To: <20150818094517.665a7a7059d7ee80bb4d670165c8327d.2f37adb81c.wbe@email03.secureserver.net> References: <20150818094517.665a7a7059d7ee80bb4d670165c8327d.2f37adb81c.wbe@email03.secureserver.net> Message-ID: <55D367AB.40809@att.net> On 8/18/2015 9:45 AM, Doug Ewell wrote: > Ken Whistler wrote: > > Then we're back to the central point that Alex Weiner originally > expressed, in arguing for the encoding of precomposed letters with > underbar: >> The string length functionality would view an 'A' code point combined >> with an '_' code point as an item that has two elements, while >> something that looks like 'A' Should be atomic, and return a length >> of one. > Precisely. And instead of pushing for the impossible, the correct solution here involves dividing and conquering: 1. If the issue is just the *presentation* of legacy APL materials showing the traditional IBM uppercase italic letters with underscores, then fix some fonts, use the combining character sequences (or styling, makes no matter), and edit away with existing characters, and with no implications for APL implementations. 2. If the issue is *augmentation* of APL implementations to have an additional A-Z set of character symbols, beyond the upper- and lowercase ones apparently supported by most APL fonts and implementations, then pick one of the existing, encoded, mathematical alphabets and have done with it. There are 13 to choose from! The sans-serif italic set might make a nice choice. And for the cherry on top, in the APL fonts, draw a non-connecting underline beneath your 26 new letters to please traditionalists. The reason to do #2 is that the implementations of APL, because of the very nature of the language, need their "characters" to have a fixed size, so that each element of a data array of "characters" is exactly one "character". The oopsie for #2, of course, is that if your APL implementation is actually using 16-bit code *units* for your characters, it is still stuck in a UCS-2 world, and can't handle UTF-16, because that once again breaks the ironclad rule that 1 "character" equals one data element in the array. The fix for the oopsie is to upgrade the APL implementations to UTF-32. At that point, the supplementary character problem goes away, and APL could freely augment its sets of A-Z symbols with the mathematical alphanumeric symbols without further ado. What people should *not* be doing is insisting on being stuck in 1970, as if everybody were still doing APL with IBM Selectric typewriter terminals hooked up to IBM/360 mainframes using an EBCDIC APL character set, and that everything in the APL program text has to look precisely the way it did in 1970. --Ken From miszhan3ys at gmail.com Tue Aug 18 18:20:01 2015 From: miszhan3ys at gmail.com (Emma Haneys) Date: Wed, 19 Aug 2015 07:20:01 +0800 Subject: a suggestion new emoji . Message-ID: hello dear unicode , i just wondering if i can suggest a new emoji . hoppefully you can respone to me . i suggest one and only for fruit category . it is a durian . thanx From mark at kli.org Tue Aug 18 19:53:22 2015 From: mark at kli.org (Mark E. Shoulson) Date: Tue, 18 Aug 2015 20:53:22 -0400 Subject: a suggestion new emoji . In-Reply-To: References: Message-ID: <55D3D382.60501@kli.org> On 08/18/2015 07:20 PM, Emma Haneys wrote: > hello dear unicode , i just wondering if i can suggest a new emoji . > hoppefully you can respone to me . i suggest one and only for fruit > category . it is a durian . thanx Ah, durians. Kind of a cross between food and weaponry. ~mark From Shawn.Steele at microsoft.com Tue Aug 18 20:13:44 2015 From: Shawn.Steele at microsoft.com (Shawn Steele) Date: Wed, 19 Aug 2015 01:13:44 +0000 Subject: a suggestion new emoji . In-Reply-To: <55D3D382.60501@kli.org> References: <55D3D382.60501@kli.org> Message-ID: I'm sure Klingons love them! -----Original Message----- From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Mark E. Shoulson Sent: Tuesday, August 18, 2015 5:53 PM To: unicode at unicode.org Subject: Re: a suggestion new emoji . On 08/18/2015 07:20 PM, Emma Haneys wrote: > hello dear unicode , i just wondering if i can suggest a new emoji . > hoppefully you can respone to me . i suggest one and only for fruit > category . it is a durian . thanx Ah, durians. Kind of a cross between food and weaponry. ~mark From nikiselken at gmail.com Tue Aug 18 20:26:30 2015 From: nikiselken at gmail.com (Niki Selken) Date: Tue, 18 Aug 2015 18:26:30 -0700 Subject: a suggestion new emoji . In-Reply-To: References: <55D3D382.60501@kli.org> Message-ID: <9BA50979-8626-42C4-99FE-84720E257FB2@gmail.com> Touch?! ?? Thanks, Niki Excuse my spelling, this is sent from my iPhone > On Aug 18, 2015, at 6:13 PM, Shawn Steele wrote: > > I'm sure Klingons love them! > > -----Original Message----- > From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Mark E. Shoulson > Sent: Tuesday, August 18, 2015 5:53 PM > To: unicode at unicode.org > Subject: Re: a suggestion new emoji . > >> On 08/18/2015 07:20 PM, Emma Haneys wrote: >> hello dear unicode , i just wondering if i can suggest a new emoji . >> hoppefully you can respone to me . i suggest one and only for fruit >> category . it is a durian . thanx > Ah, durians. Kind of a cross between food and weaponry. > > ~mark > From otto.stolz at uni-konstanz.de Wed Aug 19 06:36:46 2015 From: otto.stolz at uni-konstanz.de (Otto Stolz) Date: Wed, 19 Aug 2015 13:36:46 +0200 Subject: a suggestion new emoji . In-Reply-To: References: Message-ID: <55D46A4E.7090501@uni-konstanz.de> Hello Emma Haneys, Am 19.08.2015 um 01:20 schrieb Emma Haneys: > i just wondering if i can suggest a new emoji . > hoppefully you can respone to me . So far, you have only received derisive responses from the Unicode discussion list. This is because you have not understood how suggestions for Unicode characters work. Please read ?http://www.unicode.org/faq/?, in particular ?http://www.unicode.org/faq/char_proposal.html?. > i suggest one and only for fruit > category . it is a durian . You cannot suggest a new character just because it would be ?nice to have?. Rather, you have to supply evidence that an additional character really needs to be encoded, e. g. because it is already widely used in print and cannot be represented in Unicode. Best wishes, Otto Stolz From andrewcwest at gmail.com Wed Aug 19 07:23:22 2015 From: andrewcwest at gmail.com (Andrew West) Date: Wed, 19 Aug 2015 13:23:22 +0100 Subject: a suggestion new emoji . In-Reply-To: <55D46A4E.7090501@uni-konstanz.de> References: <55D46A4E.7090501@uni-konstanz.de> Message-ID: On 19 August 2015 at 12:36, Otto Stolz wrote: > > You cannot suggest a new character just because it would > be ?nice to have?. Rather, you have to supply evidence that > an additional character really needs to be encoded, e. g. > because it is already widely used in print and cannot be > represented in Unicode. Well that may once have been the case, but certainly isn't any longer with respect to emoji, especially emoji representing food and drink. I suggest Emma reads Unicode Technical Report 51 http://unicode.org/reports/tr51/ especially section 1.2 Encoding Considerations and Annex C Selection Factors, then start a petition to the Unicode Consortium on www.change.org, and when she has 10,000 signatures make a formal request to the UTC. Petitions don't guarantee acceptance, but widely-petitioned emoji such as taco, cheese wedge, paella and whisky tumbler have been successful. Andrew From mark at macchiato.com Wed Aug 19 09:19:27 2015 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Wed, 19 Aug 2015 16:19:27 +0200 Subject: a suggestion new emoji . In-Reply-To: References: <55D46A4E.7090501@uni-konstanz.de> Message-ID: ?I'd agree about reading and following http://unicode.org/reports/tr51/#Selection_Factors. As far as petitions go, we take them with a sizable grain of salt. See http://unicode.org/reports/tr51/#Selection_Factors_Requested. In the particular cases you cite, we had sufficient evidence about prospective usage independent of petitions (which usually started after we had settled on the character anyway). Paella was a bit of an exception; I think the work that the petitioners did upfront helped to convince the subcommittee that there would be sufficient usage, and the main issues were around distinctiveness and generality. Mark *? Il meglio ? l?inimico del bene ?* On Wed, Aug 19, 2015 at 2:23 PM, Andrew West wrote: > On 19 August 2015 at 12:36, Otto Stolz wrote: > > > > You cannot suggest a new character just because it would > > be ?nice to have?. Rather, you have to supply evidence that > > an additional character really needs to be encoded, e. g. > > because it is already widely used in print and cannot be > > represented in Unicode. > > Well that may once have been the case, but certainly isn't any longer > with respect to emoji, especially emoji representing food and drink. > > I suggest Emma reads Unicode Technical Report 51 > http://unicode.org/reports/tr51/ especially section 1.2 Encoding > Considerations and Annex C Selection Factors, then start a petition to > the Unicode Consortium on www.change.org, and when she has 10,000 > signatures make a formal request to the UTC. Petitions don't > guarantee acceptance, but widely-petitioned emoji such as taco, cheese > wedge, paella and whisky tumbler have been successful. > > Andrew > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From wjgo_10009 at btinternet.com Wed Aug 19 07:55:23 2015 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Wed, 19 Aug 2015 13:55:23 +0100 (BST) Subject: a suggestion new emoji . In-Reply-To: <55D46A4E.7090501@uni-konstanz.de> References: <55D46A4E.7090501@uni-konstanz.de> Message-ID: <3524460.27958.1439988923889.JavaMail.defaultUser@defaultHost> Otto Stolz wrote: > You cannot suggest a new character just because it would be ?nice to have?. Why not? The UNICORN FACE character has been encoded into regular Unicode following such a suggestion. I sent in a suggestion for an OKAPI to be encoded and was informed that my suggestion would be added to "the pile" of suggestions. It would be interesting to read what is listed in "the pile". It might be an interesting social history document of present times. Not just a list of items each suggested by many people, but a list of everything that has been suggested. Suppose that someone suggests encoding LOCOMOTIVE MALLARD WITH TENDER LIVERIED AS PRESERVED SEEN IN A DIRECT SIDEWAYS VIEW FROM HER LEFT with a glyph five times as wide as high. I remember that there was a trains font. Suppose that a mobile telephone manufacturer included in a mobile telephone a collection of characters one each for a number of specific steam locomotives. Returning to Emma's suggestion. I suggest to Emma the contacting of Unicode Inc. using the following form. http://www.unicode.org/reporting.html Hopefully Emma will receive an official reply and her suggestion will become added to "the pile". What being added to "the pile" presently means, or what it may become to mean, I do not know. Yet I suggest that making the suggestion on that form would be a potentially useful thing to do. I hope that Emma's suggestion is successful. William Overington 19 August 2015 From fantasai.lists at inkedblade.net Wed Aug 19 11:21:36 2015 From: fantasai.lists at inkedblade.net (fantasai) Date: Wed, 19 Aug 2015 09:21:36 -0700 Subject: Why doesn't Ideographic (ID) in UAX#14 have half-width katakana? In-Reply-To: <4A7F11C4-2F5B-4156-844C-C21E543FCFC7@apple.com> References: <553EEE6D.2020004@ga2.so-net.ne.jp> <553EFB2E.3010808@hiroshima-u.ac.jp> <55403267.9060202@att.net> <55438AFB.6020000@att.net> <55439AE4.4020109@hiroshima-u.ac.jp> <5543AE4B.5020904@att.net> <5543D03B.80603@ix.netcom.com> <55467CAF.4080401@ix.netcom.com> <4A7F11C4-2F5B-4156-844C-C21E543FCFC7@apple.com> Message-ID: <55D4AD10.1080104@inkedblade.net> On 05/04/2015 02:19 PM, Peter Edberg wrote: > I have been checking with various groups at Apple. The consensus here is that we would like to see the linebreak value for > halfwidth katakana changed to ID. Do we want all halfwidth kana changed to ID, or should there be some exception for the voicing marks (U+FF9E, U+FF9F) to forbid breaks before? ~fantasai From kenwhistler at att.net Wed Aug 19 11:45:55 2015 From: kenwhistler at att.net (Ken Whistler) Date: Wed, 19 Aug 2015 09:45:55 -0700 Subject: Why doesn't Ideographic (ID) in UAX#14 have half-width katakana? In-Reply-To: <55D4AD10.1080104@inkedblade.net> References: <553EEE6D.2020004@ga2.so-net.ne.jp> <553EFB2E.3010808@hiroshima-u.ac.jp> <55403267.9060202@att.net> <55438AFB.6020000@att.net> <55439AE4.4020109@hiroshima-u.ac.jp> <5543AE4B.5020904@att.net> <5543D03B.80603@ix.netcom.com> <55467CAF.4080401@ix.netcom.com> <4A7F11C4-2F5B-4156-844C-C21E543FCFC7@apple.com> <55D4AD10.1080104@inkedblade.net> Message-ID: <55D4B2C3.2010000@att.net> I don't think that is the issue. U+FF9E/F are already lb=NS, which prevents line breaks before. The issue is instead loosening up the lb class for the halfwidth katakana syllables (from lb=AL to lb=ID), so that they *can* break the way the regular katakana syllables do. --Ken On 8/19/2015 9:21 AM, fantasai wrote: > On 05/04/2015 02:19 PM, Peter Edberg wrote: >> I have been checking with various groups at Apple. The consensus here >> is that we would like to see the linebreak value for >> halfwidth katakana changed to ID. > > Do we want all halfwidth kana changed to ID, or should there > be some exception for the voicing marks (U+FF9E, U+FF9F) to > forbid breaks before? > > ~fantasai > From wjgo_10009 at btinternet.com Wed Aug 19 11:38:56 2015 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Wed, 19 Aug 2015 17:38:56 +0100 (BST) Subject: a suggestion new emoji . In-Reply-To: References: <55D46A4E.7090501@uni-konstanz.de> Message-ID: <4356528.46093.1440002336411.JavaMail.defaultUser@defaultHost> Andrew West wrote: > ..., and when she has 10,000 signatures make a formal request to the UTC. Where does the figure of 10,000 for the number of signatures come from please? That is a lot of people! What exactly are the rules? Does anybody really know these days! By comparison it needs ten signatures to stand in a United Kingdom Parliamentary Election. William Overington 19 August 2015 From wjgo_10009 at btinternet.com Wed Aug 19 12:10:40 2015 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Wed, 19 Aug 2015 18:10:40 +0100 (BST) Subject: a suggestion new emoji . In-Reply-To: References: <55D46A4E.7090501@uni-konstanz.de> Message-ID: <8509498.48370.1440004240920.JavaMail.defaultUser@defaultHost> Mark Davis wrote: > As far as petitions go, we take them with a sizable grain of salt. Who, exactly, precisely, is "we" please? William Overington 19 August 2015 -------------- next part -------------- An HTML attachment was scrubbed... URL: From charupdate at orange.fr Wed Aug 19 13:22:59 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Wed, 19 Aug 2015 20:22:59 +0200 (CEST) Subject: a suggestion new emoji . In-Reply-To: References: Message-ID: <1743506480.18484.1440008579130.JavaMail.www@wwinf1j04> On 19 Aug 2015 at 01:45, Emma Haneys wrote: > i suggest one and only for fruit category . it is a durian . Emma, at small sizes, and especially in monochrome rendering, the glyph of a durian emoji would resemble closely to the glyph of an eventual lychee emoji. You should read also the detailed explanations from Dr.?Freytag on this mailing list, in the previous most recent emoji thread: http://www.unicode.org/mail-arch/unicode-ml/y2015-m08/0014.html All the best, Marcel -------------- next part -------------- An HTML attachment was scrubbed... URL: From charupdate at orange.fr Wed Aug 19 13:40:12 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Wed, 19 Aug 2015 20:40:12 +0200 (CEST) Subject: a suggestion new emoji . In-Reply-To: <3524460.27958.1439988923889.JavaMail.defaultUser@defaultHost> References: <55D46A4E.7090501@uni-konstanz.de> <3524460.27958.1439988923889.JavaMail.defaultUser@defaultHost> Message-ID: <345852892.18984.1440009612606.JavaMail.www@wwinf1j04> On 19 Aug 2015 at 17:18, William_J_G Overington wrote: > Suppose that someone suggests encoding LOCOMOTIVE MALLARD WITH TENDER LIVERIED AS PRESERVED SEEN IN A DIRECT SIDEWAYS VIEW FROM HER LEFT with a glyph five times as wide as high. Do not forget that to be encoded in Unicode, an emoji must be highly iconic, Dr.?Freytag explained on 03?Aug?2015 at?12:38: http://www.unicode.org/mail-arch/unicode-ml/y2015-m08/0014.html So for *all* steam locomotives, we cannot have more emojis than the one we have at U+1F682 STEAM LOCOMOTIVE. This emoji's signification is polysemic and stays at least for all steam locomotives of any period and manufacturer, as well as for touristic oldtimer railway lines and stations. > I remember that there was a trains font. To implement the trains font in Unicode (or to implement Unicode in the trains font, I don't know well which way it goes round), the best would be to use the Private Use Area, as Mr. Kolehmainen recommended lastly for another purpose: http://www.unicode.org/mail-arch/unicode-ml/y2015-m08/0152.html Using the Contact form is a very good advice. This would have saved Emma from the sarcastic comments that came first in thread. All the best, Marcel -------------- next part -------------- An HTML attachment was scrubbed... URL: From andrewcwest at gmail.com Wed Aug 19 13:45:53 2015 From: andrewcwest at gmail.com (Andrew West) Date: Wed, 19 Aug 2015 19:45:53 +0100 Subject: a suggestion new emoji . In-Reply-To: <1743506480.18484.1440008579130.JavaMail.www@wwinf1j04> References: <1743506480.18484.1440008579130.JavaMail.www@wwinf1j04> Message-ID: On 19 August 2015 at 19:22, Marcel Schneider wrote: > > On 19 Aug 2015 at 01:45, Emma Haneys wrote: > > > i suggest one and only for fruit category . it is a durian . > > Emma, at small sizes, and especially in monochrome rendering, the glyph of a durian emoji would resemble closely to the glyph of an eventual lychee emoji. I don't know, I think durian emoji would be quite distinctive, as shown in the examples on this page (I am rather taken with the sad durian which gets no hugs). Andrew From andrewcwest at gmail.com Wed Aug 19 13:48:17 2015 From: andrewcwest at gmail.com (Andrew West) Date: Wed, 19 Aug 2015 19:48:17 +0100 Subject: a suggestion new emoji . In-Reply-To: References: <1743506480.18484.1440008579130.JavaMail.www@wwinf1j04> Message-ID: On 19 August 2015 at 19:45, Andrew West wrote: > On 19 August 2015 at 19:22, Marcel Schneider wrote: >> >> On 19 Aug 2015 at 01:45, Emma Haneys wrote: >> >> > i suggest one and only for fruit category . it is a durian . >> >> Emma, at small sizes, and especially in monochrome rendering, the glyph of a durian emoji would resemble closely to the glyph of an eventual lychee emoji. > > I don't know, I think durian emoji would be quite distinctive, as > shown in the examples on this page (I am rather taken with the sad > durian which gets no hugs). Sorry, this page: http://www.cafepress.co.uk/+durian+stickers Andrew From charupdate at orange.fr Wed Aug 19 13:49:39 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Wed, 19 Aug 2015 20:49:39 +0200 (CEST) Subject: a suggestion new emoji . In-Reply-To: References: <1743506480.18484.1440008579130.JavaMail.www@wwinf1j04> Message-ID: <62460733.18847.1440010179831.JavaMail.www@wwinf1e33> On 19 Aug 2015 at 20:45, Andrew West wrote: > I don't know, I think durian emoji would be quite distinctive, as > shown in the examples on this page (I am rather taken with the sad > durian which gets no hugs). What page do you refer to, the hyperlink has got lost, please. Marcel -------------- next part -------------- An HTML attachment was scrubbed... URL: From charupdate at orange.fr Wed Aug 19 13:59:03 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Wed, 19 Aug 2015 20:59:03 +0200 (CEST) Subject: a suggestion new emoji . Message-ID: <398300494.19072.1440010743779.JavaMail.www@wwinf1e33> On 19 Aug 2015 at 17:18, William_J_G Overington wrote: > I suggest to Emma the contacting of Unicode Inc. using the following form. > > http://www.unicode.org/reporting.html William is right. I strongly recommend you to first use the Contact form, as I did from the beginning on, long before e-mailing to the List. Using the Contact form you will always get a good answer (not always a *positive* response, but always a *good* answer). Just ignore the arrogant bullying that has come first in thread! Best wishes, Marcel -------------- next part -------------- An HTML attachment was scrubbed... URL: From charupdate at orange.fr Wed Aug 19 14:06:58 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Wed, 19 Aug 2015 21:06:58 +0200 (CEST) Subject: a suggestion new emoji . In-Reply-To: References: <1743506480.18484.1440008579130.JavaMail.www@wwinf1j04> Message-ID: <1360399199.19324.1440011218382.JavaMail.www@wwinf1e33> On 19 Aug 2015 at 20:48, Andrew West wrote: > On 19 August 2015 at 19:45, Andrew West wrote: > > On 19 August 2015 at 19:22, Marcel Schneider wrote: > >> > >> On 19 Aug 2015 at 01:45, Emma Haneys wrote: > >> > >> > i suggest one and only for fruit category . it is a durian . > >> > >> Emma, at small sizes, and especially in monochrome rendering, the glyph of a durian emoji would resemble closely to the glyph of an eventual lychee emoji. > > > > I don't know, I think durian emoji would be quite distinctive, as > > shown in the examples on this page (I am rather taken with the sad > > durian which gets no hugs). > > Sorry, this page: > > http://www.cafepress.co.uk/+durian+stickers That's nice. I see, durians have sharp tips and are represented in a corresponding way, whereas lychees have round tips. Well seen! Marcel? -------------- next part -------------- An HTML attachment was scrubbed... URL: From charupdate at orange.fr Wed Aug 19 14:48:38 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Wed, 19 Aug 2015 21:48:38 +0200 (CEST) Subject: a suggestion new emoji . In-Reply-To: <1360399199.19324.1440011218382.JavaMail.www@wwinf1e33> References: <1743506480.18484.1440008579130.JavaMail.www@wwinf1j04> <1360399199.19324.1440011218382.JavaMail.www@wwinf1e33> Message-ID: <1456355655.15227.1440013718592.JavaMail.www@wwinf1d04> On 19 Aug 2015 at 20:48, Andrew West wrote: > On 19 August 2015 at 19:45, Andrew West wrote: > > On 19 August 2015 at 19:22, Marcel Schneider wrote: > >> > >> On 19 Aug 2015 at 01:45, Emma Haneys wrote: > >> > >> > i suggest one and only for fruit category . it is a durian . > >> > >> Emma, at small sizes, and especially in monochrome rendering, the glyph of a durian emoji would resemble closely to the glyph of an eventual lychee emoji. > > > > I don't know, I think durian emoji would be quite distinctive, as > > shown in the examples on this page (I am rather taken with the sad > > durian which gets no hugs). > > Sorry, this page: > > http://www.cafepress.co.uk/+durian+stickers To console the sad durian, Unicode should encode the durian! This could free the sad one from death thoughts (I've noticed the little skull). I hope that Emma's petition will become successful! Marcel? From richard.wordingham at ntlworld.com Wed Aug 19 20:19:39 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Thu, 20 Aug 2015 02:19:39 +0100 Subject: Custom keyboard source samples (was: Re: Windows keyboard restrictions) In-Reply-To: <749936537.8372.1439812286240.JavaMail.www@wwinf1j14> References: <749936537.8372.1439812286240.JavaMail.www@wwinf1j14> Message-ID: <20150820021939.64532cc7@JRWUBU2> On Mon, 17 Aug 2015 13:51:26 +0200 (CEST) Marcel Schneider wrote: > Since yesterday I know a very simple way to get the source code (in > C) of any MSKLC layout. Is this legal? To me it smacks of reverse engineering, which is prohibited under the MSKLC licence. Richard. From mark at kli.org Wed Aug 19 21:18:37 2015 From: mark at kli.org (Mark E. Shoulson) Date: Wed, 19 Aug 2015 22:18:37 -0400 Subject: a suggestion new emoji . In-Reply-To: <8509498.48370.1440004240920.JavaMail.defaultUser@defaultHost> References: <55D46A4E.7090501@uni-konstanz.de> <8509498.48370.1440004240920.JavaMail.defaultUser@defaultHost> Message-ID: <55D538FD.8010103@kli.org> And is there an emoji for GRAIN OF SALT? (Actually, that could almost be useful... or even just a geometric CUBE...) ~mark On 08/19/2015 01:10 PM, William_J_G Overington wrote: > > Mark Davis wrote: > > > As far as petitions go, we take them with a sizable grain of salt. > > Who, exactly, precisely, is "we" please? > > William Overington > > 19 August 2015 > > > From miszhan3ys at gmail.com Wed Aug 19 21:18:45 2015 From: miszhan3ys at gmail.com (Emma Haneys) Date: Thu, 20 Aug 2015 10:18:45 +0800 Subject: a suggestion new emoji . In-Reply-To: <3524460.27958.1439988923889.JavaMail.defaultUser@defaultHost> References: <55D46A4E.7090501@uni-konstanz.de> <3524460.27958.1439988923889.JavaMail.defaultUser@defaultHost> Message-ID: thanx all for responding . and i 'ved sent the suggestion to the right place . thanx for info ?? On Aug 19, 2015 8:55 PM, "William_J_G Overington" wrote: > Otto Stolz wrote: > > > You cannot suggest a new character just because it would be ?nice to > have?. > > Why not? > > The UNICORN FACE character has been encoded into regular Unicode following > such a suggestion. > > I sent in a suggestion for an OKAPI to be encoded and was informed that my > suggestion would be added to "the pile" of suggestions. > > It would be interesting to read what is listed in "the pile". > > It might be an interesting social history document of present times. > > Not just a list of items each suggested by many people, but a list of > everything that has been suggested. > > Suppose that someone suggests encoding LOCOMOTIVE MALLARD WITH TENDER > LIVERIED AS PRESERVED SEEN IN A DIRECT SIDEWAYS VIEW FROM HER LEFT with a > glyph five times as wide as high. > > I remember that there was a trains font. > > Suppose that a mobile telephone manufacturer included in a mobile > telephone a collection of characters one each for a number of specific > steam locomotives. > > Returning to Emma's suggestion. > > I suggest to Emma the contacting of Unicode Inc. using the following form. > > http://www.unicode.org/reporting.html > > Hopefully Emma will receive an official reply and her suggestion will > become added to "the pile". > > What being added to "the pile" presently means, or what it may become to > mean, I do not know. > > Yet I suggest that making the suggestion on that form would be a > potentially useful thing to do. > > I hope that Emma's suggestion is successful. > > William Overington > > 19 August 2015 > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From kojiishi at gmail.com Thu Aug 20 01:18:06 2015 From: kojiishi at gmail.com (Koji Ishii) Date: Thu, 20 Aug 2015 15:18:06 +0900 Subject: Why doesn't Ideographic (ID) in UAX#14 have half-width katakana? In-Reply-To: <55D4B2C3.2010000@att.net> References: <553EEE6D.2020004@ga2.so-net.ne.jp> <553EFB2E.3010808@hiroshima-u.ac.jp> <55403267.9060202@att.net> <55438AFB.6020000@att.net> <55439AE4.4020109@hiroshima-u.ac.jp> <5543AE4B.5020904@att.net> <5543D03B.80603@ix.netcom.com> <55467CAF.4080401@ix.netcom.com> <4A7F11C4-2F5B-4156-844C-C21E543FCFC7@apple.com> <55D4AD10.1080104@inkedblade.net> <55D4B2C3.2010000@att.net> Message-ID: Right, this should be applied to only where currently AL. The basic idea is that, full width is a concept to use a character in an "imported" manner and thus different characteristics are applied, while half width is a concept of saving screen real estate and/or for legacy cultural usages so the characteristics should be the same as its full width counterpart, except the width. Roozbeh, thank you for the date, I'll work by then. /koji On Thu, Aug 20, 2015 at 1:45 AM, Ken Whistler wrote: > I don't think that is the issue. U+FF9E/F are already lb=NS, which prevents > line breaks before. The issue is instead loosening up the lb class for > the halfwidth katakana syllables (from lb=AL to lb=ID), so that they *can* > break the way the regular katakana syllables do. > > --Ken > > > On 8/19/2015 9:21 AM, fantasai wrote: > >> On 05/04/2015 02:19 PM, Peter Edberg wrote: >> >>> I have been checking with various groups at Apple. The consensus here is >>> that we would like to see the linebreak value for >>> halfwidth katakana changed to ID. >>> >> >> Do we want all halfwidth kana changed to ID, or should there >> be some exception for the voicing marks (U+FF9E, U+FF9F) to >> forbid breaks before? >> >> ~fantasai >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From charupdate at orange.fr Thu Aug 20 10:30:38 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Thu, 20 Aug 2015 17:30:38 +0200 (CEST) Subject: Custom keyboard source samples (was: Re: Windows keyboard restrictions) Message-ID: <1408769184.13369.1440084638814.JavaMail.www@wwinf1d11> On 20 Aug 2015 at 03:19, Richard Wordingham wrote: > On Mon, 17 Aug 2015 13:51:26 +0200 (CEST) > Marcel Schneider wrote: >> Since yesterday I know a very simple way to get the source code (in >> C) of any MSKLC layout. > Is this legal? To me it smacks of reverse engineering, which is > prohibited under the MSKLC licence. When I?d seen your question, I?felt somebody at Microsoft would be most qualified to answer it (the more as I?m not an addressee). But the point here is that we can answer it by ourselves, because the keyboard drivers are not covered by the MSKLC?licence. The licensed software is the MSKLC folder. So it *is* legal. Let?s look at the details however: ?You may not ? work around any technical limitations in the software; ? reverse engineer, decompile or disassemble the software, except and only to the extent that applicable law expressly permits, despite this limitation; ? make more copies of the software than specified in this agreement or allowed by applicable law, despite this limitation; ? publish the software for others to copy; ? rent, lease or lend the software; ? transfer the software or this agreement to any third party; ? use the software for commercial software hosting services; or ? Use the software for the sole purpose of repackaging a Microsoft provided keyboard layout to offer as a stand-alone commercial product for which you charge a fee.? Do I??work around any technical limitations in the software? by picking up the source code of the drivers it generates? This is my main concern about this practice. Are we allowed to use files generated by MSKLC that are not expressedly provided to the user? Further, are we allowed to use installation packages generated by MSKLC to install other keyboard drivers than those generated by MSKLC? To install keyboard drivers that exceed the limitations of MSKLC? The questioning becomes even more troublesome when we remember that the WDK is mentioned in the MSKLC Help, and ask: When we accept the invitation to switch towards WDK, must we package the drivers with the resources the driver kit comes along with (while not knowing how to write an INF file!), or may we use the MSI and setup from MSKLC? BTW we may wonder why and how MSKLC compiles a Windows-On-Windows driver, while except for a few sparse mentions, nothing seems to be provided for WOW in the WDK. Marcel -------------- next part -------------- An HTML attachment was scrubbed... URL: From charupdate at orange.fr Thu Aug 20 10:33:30 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Thu, 20 Aug 2015 17:33:30 +0200 (CEST) Subject: Custom keyboard source samples (was: Re: Windows keyboard restrictions) Message-ID: <133672282.13416.1440084810705.JavaMail.www@wwinf1d11> On 18 Aug 2015 at 10:09, Philippe Verdy wrote: > i don't know why these c source files need to be deleted so fast when they could just remain in the same folder as the saved.klc file. I?ve missed the point when I replied on 18?Aug?2015. In fact, there?s no short answer (which BTW would be ?There?s no use of ?em?). The point is: Why (the heck) are the C source folders stored in the hidden AppData Temp directory instead of appearing in the most straightforward place? (Which is, as Philippe notes, the same folder as the saved .klc file.) We could even extend and ask: Why is there no option ???Keep the C sources????Delete the C sources?? Why are there no menu items ?Generate C source? and ?Build from C source?, or an option ???Build from KLC source????Build from C source?? That?s what I?ve wished to find in the MSKLC when I learned about. Figure that, before, I not even imagined that such sources could ever exist. No, we must not disturb the author of MSKLC, we can answer for ourselves. And then we?ll probably fall back on what I wrote the day before yesterday. Marcel -------------- next part -------------- An HTML attachment was scrubbed... URL: From charupdate at orange.fr Sat Aug 22 07:21:20 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Sat, 22 Aug 2015 14:21:20 +0200 (CEST) Subject: a suggestion new emoji . Message-ID: <323520393.11425.1440246080996.JavaMail.www@wwinf1f34> On 19 Aug 2015 at 20:59, I wrote: > On 19 Aug 2015 at 17:18, William_J_G Overington wrote: >> I suggest to Emma the contacting of Unicode Inc. using the following form. >> http://www.unicode.org/reporting.html > William is right. I strongly recommend you to first use the Contact form, as I did from the beginning on, long before e-mailing to the List. > Using the Contact form you will always get a good answer (not always a *positive* response, but always a *good* answer). I forgot to add that of course you are always welcome on the Mailing List, where you equally get good answers. But you need to be patient, as best answers come naturally last in thread, like it occurred just six hours before you posted. On 19 Aug 2015 at 20:48, Andrew West wrote: > > > > I don't know, I think durian emoji would be quite distinctive, as > > shown in the examples on this page (I am rather taken with the sad > > durian which gets no hugs). > http://www.cafepress.co.uk/+durian+stickers For a fruit, a vegetable, a cereal, a plant, an animal, having its emoji encoded in Unicode is like a big hug! So we thank Mrs?Haneys for having suggested the DURIAN emoji! Marcel -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Sat Aug 22 08:35:30 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Sat, 22 Aug 2015 14:35:30 +0100 Subject: Thai Word Breaking Message-ID: <20150822143530.29f1e883@JRWUBU2> I'm trying to work out the meaning of TUS 8.0 Section 23.2. To do Thai word breaking properly, one needs to do a semantic analysis of the text to do the equivalent of resolving the equivalent of 'humanevents' into 'human events' rather than 'humane vents'. One also needs to cope with unknown and misspelt words. (A lot of effort has been devoted to avoid going to the extreme of doing semantic analysis.) However, it is possible to read Section 23.2 as prohibiting the use of certain information, and I would like to check whether this is the intended meaning. The opening paragraph seems clear enough on first reading: "The effect of layout controls is specific to particular text processes. As much as possible, lay-out controls are transparent to those text processes for which they were not intended. In other words, their effects are mutually orthogonal." However, my first question is, "Are paragraph boundaries directly admissible as evidence for or against word boundaries not adjacent to them?". For example, most Thai word breakers would not regard a paragraph boundary as any more significant than a phrase-delimiting space. However, a paragraph boundary often indicates a change of topic. My second question is, "Are line breaks admissible as evidence for or against word boundaries not adjacent to them?" For example, if a phrase makes heavy use of U+200B ZERO WIDTH SPACE (ZWSP), one may deduce that it is likely that all word boundaries within it are marked explicitly. This example is more useful for Khmer than to Thai, for whereas Cambodians were once taught to mark word boundaries, Thais rarely use ZWSP to mark word boundaries. My third question is, "Is the absence of a line break opportunity admissible as evidence for or against a word boundary?". Here I see conflicting signals. There is a character U+2060 WORD JOINER (WJ) which *used* to be regarded as the counterpart of ZWSP. The understanding was that ZSWP marked a word boundary and provided a line-break opportunity, while WJ denied both. This, however, is no longer the case. To quote the TUS section about WJ: P1: (Ignored) P2S1: The word joiner must not be confused with the zero width joiner or the combining grapheme joiner, which have very different functions. P2S2: In particular, inserting a word joiner between two characters has no effect on their ligating and cursive joining behavior. P2S3: The word joiner should be ignored in contexts other than line breaking. P2S4: Note in particular that the word joiner is ignored for word segmentation. P2S5: (See Unicode Standard Annex #29, ?Unicode Text Segmentation.?) Paragraph 2 Sentence 3 (P2S3) appears to rule out its use in word-breaking, but perhaps it does not if line-breaking is being used as evidence for word boundaries. P2S4 has three very different interpretations: (i) This is an assertion of fact, and may therefore be incorrect. (ii) The word 'is' is sloppy wording for 'should be'. Section 23.2 contains much sloppier wording, as I have already advised members of the UTC (4 July 2015). (iii) This is a deduction from other parts of the specification. Now, if P2S4 said 'is normally ignored for word segmentation', that would have made sense, for that applies to the default word boundary specification in UAX#29. However, just before Section 4.1, UAX#29 explains that it does not specify what happens for word boundary determination in Thai! (It does constrain what happens, though.) At the end of UAX#29 Section 6.2, there is the provision, "The Ignore rules should not be overridden by tailorings, with the possible exception of remapping some of the Format characters to other classes." To accord with the user perceptions of Unicode-aware people who work with SE Asian scripts, I am tempted to ask for CLDR to tailor the word-breaking algorithms for the corresponding languages so that the word-breaking classes of WJ (and ZWNBSP) are changed from Format to MidLetter. That would match the widespread old *perception* that there should be no word break in a sequence . However, there are several objections: (a) Perhaps P2S3 and P2S4 prohibit this. (b) If the word-break property of Thai letters falls back to Other, there would still be a word break between them. (c) If the word-break property of Thai letters fell back to ALetter, an old suggestion, WJ would have no effect on the presence of a word break. (d) If Thai word breaking assigns word-break classes to each letter (gc=Lo), then word boundaries can be suppressed by choosing the classes appropriately. Non-spacing Thai vowels are very relevant to Thai word-breaking, but formally are 'ignored'. WJ could be 'ignored' in exactly the same way. Richard. From nigel at nigelsmall.com Sat Aug 22 11:08:48 2015 From: nigel at nigelsmall.com (Nigel Small) Date: Sat, 22 Aug 2015 17:08:48 +0100 Subject: Square Brackets with Tick Message-ID: Hi all I am looking for clarification on an aspect of Unicode bracket pairing, specifically in relation to the following four characters: 298D; 2990; o # LEFT SQUARE BRACKET WITH TICK IN TOP CORNER 298E; 298F; c # RIGHT SQUARE BRACKET WITH TICK IN BOTTOM CORNER 298F; 298E; o # LEFT SQUARE BRACKET WITH TICK IN BOTTOM CORNER 2990; 298D; c # RIGHT SQUARE BRACKET WITH TICK IN TOP CORNER These stand out from all other brackets listed in *BidiBrackets.txt* due to an inconsistency in pairing. I have looked for references online on where these brackets are used in the wild as mathematical symbols but have been unable to find anything useful. All other bracket pairs are listed as opener followed by closer, sometimes with several code points in between. According to the code point pairs in the first and second columns of this file, these particular brackets should be paired as the *first and fourth* and the *third and second*. Intuitively however, these would actually be *first and second* and *third and fourth* if one is to expect consistency. My guess is that there are three possibilities here: 1. The current pairing information is correct and the sequence is irregular for some historical reason 2. The pairing information is wrong and the sequence is consistent with other brackets 3. Pairing can be mixed with either left bracket used as a valid opener and either right bracket used as a valid closer; in this case, the pairing information is incomplete I'd be very grateful if anyone could clarify the situation here or if anyone knows of a resource that describes where such brackets are used in practice. Many thanks Nigel Small -------------- next part -------------- An HTML attachment was scrubbed... URL: From eliz at gnu.org Sat Aug 22 11:26:25 2015 From: eliz at gnu.org (Eli Zaretskii) Date: Sat, 22 Aug 2015 19:26:25 +0300 Subject: Square Brackets with Tick In-Reply-To: References: Message-ID: <83bndzi91q.fsf@gnu.org> > From: Nigel Small > Date: Sat, 22 Aug 2015 17:08:48 +0100 > > I am looking for clarification on an aspect of Unicode bracket pairing, > specifically in relation to the following four characters: > > 298D; 2990; o # LEFT SQUARE BRACKET WITH TICK IN TOP CORNER > 298E; 298F; c # RIGHT SQUARE BRACKET WITH TICK IN BOTTOM CORNER > 298F; 298E; o # LEFT SQUARE BRACKET WITH TICK IN BOTTOM CORNER > 2990; 298D; c # RIGHT SQUARE BRACKET WITH TICK IN TOP CORNER > > These stand out from all other brackets listed in BidiBrackets.txt due to an > inconsistency in pairing. I have looked for references online on where these > brackets are used in the wild as mathematical symbols but have been unable to > find anything useful. > > All other bracket pairs are listed as opener followed by closer, sometimes with > several code points in between. I think the order in the file is by the codepoint in the leftmost column. All the rest is just a coincidence. But I don't speak for the Unicode Consortium, so please wait for a definitive reply. From jcb+unicode at inf.ed.ac.uk Sat Aug 22 11:35:19 2015 From: jcb+unicode at inf.ed.ac.uk (Julian Bradfield) Date: Sat, 22 Aug 2015 17:35:19 +0100 (BST) Subject: Square Brackets with Tick References: Message-ID: On 2015-08-22, Nigel Small wrote: > 298D; 2990; o # LEFT SQUARE BRACKET WITH TICK IN TOP CORNER > 298E; 298F; c # RIGHT SQUARE BRACKET WITH TICK IN BOTTOM CORNER > 298F; 298E; o # LEFT SQUARE BRACKET WITH TICK IN BOTTOM CORNER > 2990; 298D; c # RIGHT SQUARE BRACKET WITH TICK IN TOP CORNER > with several code points in between. According to the code point pairs in > the first and second columns of this file, these particular brackets should > be paired as the *first and fourth* and the *third and second*. Intuitively > however, these would actually be *first and second* and *third and fourth* > if one is to expect consistency. That's a strange intuition! Mathematical brackets are expected to pair with left-right symmetry, not rotational symmetry. As in, for example, floor and ceiling brackets. The pairing in the file is the natural one. > 1. The current pairing information is correct and the sequence is irregular > for some historical reason That will be the explanation. There is no inherent meaning to the order of codepoints, it's just convenience. One of the experts here can probably tell us why these four brackets happen to be coded in this order. -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. From asmus-inc at ix.netcom.com Sat Aug 22 12:32:45 2015 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Sat, 22 Aug 2015 10:32:45 -0700 Subject: Square Brackets with Tick In-Reply-To: References: Message-ID: <55D8B23D.6070405@ix.netcom.com> An HTML attachment was scrubbed... URL: From public at khwilliamson.com Sat Aug 22 15:08:14 2015 From: public at khwilliamson.com (Karl Williamson) Date: Sat, 22 Aug 2015 14:08:14 -0600 Subject: \b{wb} Message-ID: <55D8D6AE.8080008@khwilliamson.com> The concept of \b in a regular expression meaning to match the boundary between a word and non-word was invented by Larry Wall, for the Perl programming language. This was before Unicode, and a word was defined as alphanumerics plus the underscore, which fit well with how identifiers in that computer language (and many others) were defined. Essentially \b is defined to break between runs of word characters versus runs of non-word characters. The latest version of Perl 5 (recently released) has added \b{w} based on Unicode's definition. The typical expectation of its programmers is that it would be a drop-in replacement for the old \b, with much better results in parsing natural languages. But it isn't such a replacement, creating some consternation, and the main reason is that, unlike \b, it treats the boundary between white space characters as a breaking opportunity, so that it doesn't create runs of them. Thus if you have two spaces after a full stop, it treats each as an individual word. My question is "Was this intentional, and if so, Why?" TR18 says \b{w} is a"Zero-width match at a Unicode word boundary. Note that this is different than \b alone, which corresponds to \w and \W." And UAX29 says "adjacent spaces are collapsed to a single space" in intelligent cut and paste using the WB property. From richard.wordingham at ntlworld.com Sat Aug 22 16:46:08 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Sat, 22 Aug 2015 22:46:08 +0100 Subject: \b{wb} In-Reply-To: <55D8D6AE.8080008@khwilliamson.com> References: <55D8D6AE.8080008@khwilliamson.com> Message-ID: <20150822224608.384dacaf@JRWUBU2> On Sat, 22 Aug 2015 14:08:14 -0600 Karl Williamson wrote: > But it isn't such a replacement, creating some consternation, and the > main reason is that, unlike \b, it treats the boundary between white > space characters as a breaking opportunity, so that it doesn't create > runs of them. Thus if you have two spaces after a full stop, it > treats each as an individual word. > > My question is "Was this intentional, and if so, Why?" See below. > TR18 says \b{w} is a"Zero-width match at a Unicode word boundary. > Note that this is different than \b alone, which corresponds to \w > and \W." Unless I'm being stupid, \b and \b{w} are indeed vary different. Consider a sequence That has two internal word boundaries, splitting it into a space, a flag, and the word "Ab". Is this what you want? Worse, consider a short Thai sentence ???????????????????????. That gets split by ICU into |??|?????|???????????|???|??| - 5 words and 4 internal word boundaries. Note that there's a word or two between each boundary. Is this what you want? > My question is "Was this intentional, and if so, Why?" Take a look at the rules in UAX#29 Section 4.1.1. Apart from the first two and the last, they all identify where word boundaries aren't. This is tidy - the algorithm concentrates on working out where a word continues. In principle, you could, I believe, extend the rules so that characters outside words and regional indicator runs were not divided, but it would make for a more complicated algorithm with plenty of opportunities for error. I think the thought was that word-free runs did not need to be assembled into runs of non-word material. The short answer, of course, is that the regular expression engine could do this final step of post-processing itself. This may get tricky with customised word-breaking. Richard. From richard.wordingham at ntlworld.com Sat Aug 22 16:47:06 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Sat, 22 Aug 2015 22:47:06 +0100 Subject: Square Brackets with Tick In-Reply-To: <55D8B23D.6070405@ix.netcom.com> References: <55D8B23D.6070405@ix.netcom.com> Message-ID: <20150822224706.5680b7d3@JRWUBU2> On Sat, 22 Aug 2015 10:32:45 -0700 "Asmus Freytag (t)" wrote: > On 8/22/2015 9:35 AM, Julian Bradfield wrote: >> There is no inherent meaning to the >> order of codepoints, it's just convenience. > And for that reason, we have property files to explicitly give the > properties rather than asking the user to "glean" them from code > point order. But codepoints are normally orderly until they enter the ISO approval process. Thereafter, disorder creeps in, and becomes ever more likely as blocks fill up. The concern here is that the opening-closing pairing information, which used not to be a property, has been deduced wrongly. The code chart is prima facie evidence that whoever drew the order up conceived of U+298D and U+298E as a pair. I've traced the character as far back as http://www.unicode.org/L2/L1999/99159.pdf . Unfortunately, its meaning therein is implicitly described as unknown! It looks as though someone somewhere fashioned type for it - or perhaps another of the set of four - but no-one remembers what it was used for! Now, *if* no-one is using it, it doesn't really matter if the pair is wrong. Richard. From asmusf at ix.netcom.com Sat Aug 22 19:53:14 2015 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Sat, 22 Aug 2015 17:53:14 -0700 Subject: Square Brackets with Tick In-Reply-To: <20150822224706.5680b7d3@JRWUBU2> References: <55D8B23D.6070405@ix.netcom.com> <20150822224706.5680b7d3@JRWUBU2> Message-ID: <55D9197A.1040700@ix.netcom.com> An HTML attachment was scrubbed... URL: From nigel at nigelsmall.com Sun Aug 23 04:50:57 2015 From: nigel at nigelsmall.com (Nigel Small) Date: Sun, 23 Aug 2015 10:50:57 +0100 Subject: Square Brackets with Tick In-Reply-To: <55D9197A.1040700@ix.netcom.com> References: <55D8B23D.6070405@ix.netcom.com> <20150822224706.5680b7d3@JRWUBU2> <55D9197A.1040700@ix.netcom.com> Message-ID: Thanks to everyone for your responses so far. In terms of my comment on which brackets make intuitive pairs, I should perhaps have explained my thought process more clearly. If one is to consider the possible origins of these symbols, one likely idea is that they could be used to symbolise a bracketed expression that has been "slashed through". In that context, pairing the top left tick with the bottom right tick makes sense, as does the pairing of the other two. Then, the original code point order remains consistent (though I understand this need not have any relevance). This appears to mirror Asmus' observation. On 23 Aug 2015 1:58 am, "Asmus Freytag" wrote: > On 8/22/2015 2:47 PM, Richard Wordingham wrote: > > But codepoints are normally orderly until they enter the ISO approval > process. Thereafter, disorder creeps in, and becomes ever more likely > as blocks fill up > > > Haha, good one. > > . The concern here is that the opening-closing > pairing information, which used not to be a property, has been deduced > wrongly. The code chart is prima facie evidence that whoever drew the > order up conceived of U+298D and U+298E as a pair. > > > Not necessarily. Code charts are sometimes ordered in mysterious ways. > However, read on. > > > > I've traced the character as far back ashttp://www.unicode.org/L2/L1999/99159.pdf . Unfortunately, its meaning > therein is implicitly described as unknown! It looks as though someone > somewhere fashioned type for it - or perhaps another of the set of four > - but no-one remembers what it was used for! > > > This document doesn't tell you what the pairing is supposed to be, only > that which > ones are opening and closing (so we know that they are intended to be > arranged [ ] > and not ] [ (ticks omitted), but we don't know which of the two [[ go with > which of > the two ]], other than the - natural - assumptions that pairs are listed > adjacently). > > For the first document that gives the pairing information, see: > > http://www.unicode.org/L2/L2012/12173r-bidi-paren.pdf > > There is no note or other indication in this document that shows that any > thought > was put into the different ordering. > > However, it is notable that all other bracket pairings follow the bidi > mirroring glyph > relation, so I would put my money on that that file was used to create the > pairs using > a script, rather than manual editing. > > This is corroborated in section 3.2 of that document. > > Nigel was the first to notice that these were not encoded as left-right > glyph pairs, > but with the diagonal "tick" (originally called a solidus) having the same > orientation > in a pair (as if intended to bracket something in either diagonal or > anti-diagonal > direction). > > Given that L2/12-173 states that the property was derived via algorithm > that is based > on left-right mirroring and not via matching open/close pairs based on > other factors, > (including adjacency in the charts) I'm happy to join the growing chorus > that declares > this to be a bug. > > Luckily there seems to be no stability policy that would prevent fixing > this one. > > A./ > -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Sun Aug 23 09:15:39 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Sun, 23 Aug 2015 15:15:39 +0100 Subject: UAX#29 Word-Breaking Interface for Complex Context Message-ID: <20150823151539.63a73b9a@JRWUBU2> The word-breaking algorithm defines an apparently innocuous interface for word breaking of 'complex context' scripts such as Thai, Lao and Myanmar. The complex context part, whose internals are deliberately and reasonably not defined by Unicode, assigns word break property values to the characters. Are there any implementations that work that way? Negative answers such as 'xxx does not work that way' would also be useful. For example, ICU does not work this way. Instead, the complex context parts deliver word boundaries rather than character properties to the part of the algorithm working in accordance with a tailoring of the algorithm in UAX#29. It seems that in general the assignments may be a little complicated. For example, in the usual case of interest, Thai script word characters delimited by white space, it seems to me that the characters of alternate words should be assigned to 'ALetter' and 'Katakana'. Have I missed a trick here? 'RI' is a new alternative to 'ALetter' and 'Katakana', but that seems even more bizarre, and I'd worry about its stability. I'm finding some interesting constraints arisng from the interface. For example, *within* x?y (that's a Thai letter flanked by two English letters), there are either no or two word boundaries. By contrast, there may be no, one or two linebreak opportunities *within* the string. Richard. From jknappen at web.de Mon Aug 24 04:39:49 2015 From: jknappen at web.de (=?UTF-8?Q?=22J=C3=B6rg_Knappen=22?=) Date: Mon, 24 Aug 2015 11:39:49 +0200 Subject: Aw: Re: Square Brackets with Tick In-Reply-To: References: , Message-ID: An HTML attachment was scrubbed... URL: From wjgo_10009 at btinternet.com Mon Aug 24 05:00:32 2015 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Mon, 24 Aug 2015 11:00:32 +0100 (BST) Subject: Square Brackets with Tick In-Reply-To: References: <55D8B23D.6070405@ix.netcom.com> <20150822224706.5680b7d3@JRWUBU2> <55D9197A.1040700@ix.netcom.com> Message-ID: <20720148.17379.1440410432467.JavaMail.defaultUser@defaultHost> Looking at the document http://www.unicode.org/L2/L1999/99159.pdf that has been mentioned, the four bracket characters are therein described as follows. 4X1F O LEFT BRACKET, REVERSE SOLIDUS TOP CORNER 4X20 C RIGHT BRACKET, REVERSE SOLIDUS BOTTOM CORNER 4X21 O LEFT BRACKET, SOLIDUS BOTTOM CORNER 4X22 C RIGHT BRACKET, SOLIDUS TOP CORNER So it looks like the pairings in Unicode today are as originally intended. May I suggest a (possibly new) use for the brackets please? When a person is transcribing a document into a computer, perhaps a historical document, the first pair could be used to indicate a transcriber's note that the text between the brackets was crossed out in the original document, and the second pair could be used to indicate a transcriber's note that the text between the brackets was crossed out in the original document yet had been reinstated in the original document, either by the word stet being placed next to the crossed-out text or otherwise. William Overington 24 August 2015 -------------- next part -------------- An HTML attachment was scrubbed... URL: From shawnlandden at tuta.io Mon Aug 24 14:03:54 2015 From: shawnlandden at tuta.io (Shawn Landden) Date: Mon, 24 Aug 2015 19:03:54 +0000 (UTC) Subject: Arabic ligitures Message-ID: From github. https://github.com/golang/go/issues/12298 Arabic ligitures have been deprecated[1], despite a need for both ligitures and non-ligature versions of the same glyphs. Amiri uses contextual alternatives for ????.? These ligatures are used in religious documents[2] via pictures, which seems to be what the current Unicode standard recommends. Unlike the presentation forms, there is case for these phrases and formulas to be available both in ligature and non-ligature form. These ligatures should be non-deprecated and subject to canonical decomposition, rather than compatibility decomposition. http://www.unicode.org/reports/tr15 [1] https://en.wikipedia.org/wiki/Arabic_script_in_Unicode#Word_ligatures [2] http://www.mujahideenryder.net/pdf/WhoAretheDisbelievers.pdf -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Mon Aug 24 14:35:14 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Mon, 24 Aug 2015 20:35:14 +0100 Subject: Square Brackets with Tick In-Reply-To: <20720148.17379.1440410432467.JavaMail.defaultUser@defaultHost> References: <55D8B23D.6070405@ix.netcom.com> <20150822224706.5680b7d3@JRWUBU2> <55D9197A.1040700@ix.netcom.com> <20720148.17379.1440410432467.JavaMail.defaultUser@defaultHost> Message-ID: <20150824203514.58ceb74d@JRWUBU2> On Mon, 24 Aug 2015 11:00:32 +0100 (BST) William_J_G Overington wrote: > Looking at the document > http://www.unicode.org/L2/L1999/99159.pdf > that has been mentioned, the four bracket characters are therein > described as follows. > 4X1F O LEFT BRACKET, REVERSE SOLIDUS TOP CORNER > 4X20 C RIGHT BRACKET, REVERSE SOLIDUS BOTTOM CORNER > 4X21 O LEFT BRACKET, SOLIDUS BOTTOM CORNER > 4X22 C RIGHT BRACKET, SOLIDUS TOP CORNER > So it looks like the pairings in Unicode today are as originally > intended. How so? There are two relevant pairings in Unicode - the Bidi_Mirroring_Glyph and Bidi_Paired_Bracket. Both pair the 1st and the 4th together and the 2nd and the 3rd together. Now, Bidi_Mirroring_Glyph is based mainly on appearance (or have I missed a caveat?), and that seems to be correct. Bid_Paired_Bracket is based on semantics, which are difficult to be sure of when we have no examples of use. Indeed, some quote marks are notoriously inconsistent from language to language. I am assuming that it is better to render reversed ? U+2284 NOT A SUBSET OF (= , rather than the unreversed glyph of ? U+2285 NOT A SUPERSET OF = , despite U+2284 and U+2285 being a bidi-mirroring pair. If one took the view that a combining solidus didn't mirror (as indeed, it doesn't according to the UCD), and that the ticks are unhidden parts of solidi, then the Bidi_Mirroring_Glyph properties would be wrong! Good taste is probably the only way through the bidi mirroring maze. Richard. From fantasai.lists at inkedblade.net Sun Aug 23 11:13:44 2015 From: fantasai.lists at inkedblade.net (fantasai) Date: Sun, 23 Aug 2015 18:13:44 +0200 Subject: Why doesn't Ideographic (ID) in UAX#14 have half-width katakana? In-Reply-To: References: <553EEE6D.2020004@ga2.so-net.ne.jp> <553EFB2E.3010808@hiroshima-u.ac.jp> <55403267.9060202@att.net> <55438AFB.6020000@att.net> <55439AE4.4020109@hiroshima-u.ac.jp> <5543AE4B.5020904@att.net> <5543D03B.80603@ix.netcom.com> <55467CAF.4080401@ix.netcom.com> <4A7F11C4-2F5B-4156-844C-C21E543FCFC7@apple.com> <55D4AD10.1080104@inkedblade.net> <55D4B2C3.2010000@att.net> Message-ID: <55D9F138.4090201@inkedblade.net> On 08/20/2015 08:18 AM, Koji Ishii wrote: > Right, this should be applied to only where currently AL. > > The basic idea is that, full width is a concept to use a character in an "imported" manner and thus different characteristics > are applied, while half width is a concept of saving screen real estate and/or for legacy cultural usages so the > characteristics should be the same as its full width counterpart, except the width. This sounds good to me. Let me know if the UTC wants an official CSSWG resolution of support on this change, I can arrange for that if necessary. ~fantasai From wjgo_10009 at btinternet.com Tue Aug 25 03:54:29 2015 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Tue, 25 Aug 2015 09:54:29 +0100 (BST) Subject: Square Brackets with Tick In-Reply-To: <20150824203514.58ceb74d@JRWUBU2> References: <55D8B23D.6070405@ix.netcom.com> <20150822224706.5680b7d3@JRWUBU2> <55D9197A.1040700@ix.netcom.com> <20720148.17379.1440410432467.JavaMail.defaultUser@defaultHost> <20150824203514.58ceb74d@JRWUBU2> Message-ID: <26422784.15492.1440492869448.JavaMail.defaultUser@defaultHost> Richard Wordingham wrote: > On Mon, 24 Aug 2015 11:00:32 +0100 (BST) William_J_G Overington wrote: >> Looking at the document >> http://www.unicode.org/L2/L1999/99159.pdf >> that has been mentioned, the four bracket characters are therein described as follows. >> 4X1F O LEFT BRACKET, REVERSE SOLIDUS TOP CORNER >> 4X20 C RIGHT BRACKET, REVERSE SOLIDUS BOTTOM CORNER >> 4X21 O LEFT BRACKET, SOLIDUS BOTTOM CORNER >> 4X22 C RIGHT BRACKET, SOLIDUS TOP CORNER >> So it looks like the pairings in Unicode today are as originally intended. > How so? I was simply observing that the original pairings had the first-listed pair of brackets listed using REVERSE SOLIDUS and had the second-listed pair of brackets listed using SOLIDUS contrasting that clear pairing of the brackets with the use, in the encoding into Unicode, of TICK in the listing for each of the four of the bracket characters that are being discussed in this thread. William Overington 25 August 2015 From doug at ewellic.org Tue Aug 25 10:31:09 2015 From: doug at ewellic.org (Doug Ewell) Date: Tue, 25 Aug 2015 08:31:09 -0700 Subject: Arabic ligatures Message-ID: <20150825083109.665a7a7059d7ee80bb4d670165c8327d.f0b345d455.wbe@email03.secureserver.net> Shawn Landden wrote: > Arabic ligitures have been deprecated[1], despite a need for both > ligitures and non-ligature versions of the same glyphs. The only Arabic character that is deprecated in the standard is U+0673 ARABIC LETTER ALEF WITH WAVY HAMZA BELOW. The Wikipedia article cited as "[1]" does not claim otherwise. > Amiri uses contextual alternatives for ????. These ligatures are > used in religious documents[2] via pictures, which seems to be what > the current Unicode standard recommends. What is your source for this? > Unlike the presentation forms, there is case for these phrases and > formulas to be available both in ligature and non-ligature form. All Arabic letters and combinations can be rendered in ligated or non-ligated forms as needed using some combination of ZWJ and ZWNJ. See TUS 8.0, Section 9.2. > These ligatures should be non-deprecated and subject to canonical > decomposition, rather than compatibility decomposition. Section 9.2 (page 386 ff.) explains the Arabic Presentation Forms-A block (U+FB50?U+FDFF) in greater detail. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From richard.wordingham at ntlworld.com Tue Aug 25 14:07:35 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Tue, 25 Aug 2015 20:07:35 +0100 Subject: Square Brackets with Tick In-Reply-To: <26422784.15492.1440492869448.JavaMail.defaultUser@defaultHost> References: <55D8B23D.6070405@ix.netcom.com> <20150822224706.5680b7d3@JRWUBU2> <55D9197A.1040700@ix.netcom.com> <20720148.17379.1440410432467.JavaMail.defaultUser@defaultHost> <20150824203514.58ceb74d@JRWUBU2> <26422784.15492.1440492869448.JavaMail.defaultUser@defaultHost> Message-ID: <20150825200735.6c88cda8@JRWUBU2> On Tue, 25 Aug 2015 09:54:29 +0100 (BST) William_J_G Overington wrote: > Richard Wordingham wrote: > > > On Mon, 24 Aug 2015 11:00:32 +0100 (BST) > William_J_G Overington wrote: > > >> Looking at the document > >> http://www.unicode.org/L2/L1999/99159.pdf > >> that has been mentioned, the four bracket characters are therein > >> described as follows. > > >> 4X1F O LEFT BRACKET, REVERSE SOLIDUS TOP CORNER > >> 4X20 C RIGHT BRACKET, REVERSE SOLIDUS BOTTOM CORNER > >> 4X21 O LEFT BRACKET, SOLIDUS BOTTOM CORNER > >> 4X22 C RIGHT BRACKET, SOLIDUS TOP CORNER > >> So it looks like the pairings in Unicode today are as originally > >> intended. > > How so? > I was simply observing that the original pairings had the > first-listed pair of brackets listed using REVERSE SOLIDUS and had > the second-listed pair of brackets listed using SOLIDUS contrasting > that clear pairing of the brackets with the use, in the encoding into > Unicode, of TICK in the listing for each of the four of the bracket > characters that are being discussed in this thread. You said the 'pairings in Unicode'. With the exception of decimal digits, the scalar values of assigned characters have no *formal* relationship to their interpretation. The scalar values are about as significant as the difference between canonically equivalent non-Greek, non-Korean sequences. At best the different sequences give a hint of what the author thinks about the character. For example U+00E9 LATIN SMALL LETTER E WITH ACUTE suggests it may be though of as a character, while suggests that it may be two characters - the diacritic could be a length mark or a tone. The distinction is not to be relied upon - normalisation would obliterate it. Richard. From asmus-inc at ix.netcom.com Tue Aug 25 17:26:44 2015 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Tue, 25 Aug 2015 15:26:44 -0700 Subject: Square Brackets with Tick In-Reply-To: <20150825200735.6c88cda8@JRWUBU2> References: <55D8B23D.6070405@ix.netcom.com> <20150822224706.5680b7d3@JRWUBU2> <55D9197A.1040700@ix.netcom.com> <20720148.17379.1440410432467.JavaMail.defaultUser@defaultHost> <20150824203514.58ceb74d@JRWUBU2> <26422784.15492.1440492869448.JavaMail.defaultUser@defaultHost> <20150825200735.6c88cda8@JRWUBU2> Message-ID: <55DCEBA4.6020703@ix.netcom.com> On 8/25/2015 12:07 PM, Richard Wordingham wrote: > On Tue, 25 Aug 2015 09:54:29 +0100 (BST) > William_J_G Overington wrote: > >> Richard Wordingham wrote: >> >>> On Mon, 24 Aug 2015 11:00:32 +0100 (BST) >> William_J_G Overington wrote: >> >>>> Looking at the document >>>> http://www.unicode.org/L2/L1999/99159.pdf >>>> that has been mentioned, the four bracket characters are therein >>>> described as follows. >>>> 4X1F O LEFT BRACKET, REVERSE SOLIDUS TOP CORNER >>>> 4X20 C RIGHT BRACKET, REVERSE SOLIDUS BOTTOM CORNER >>>> 4X21 O LEFT BRACKET, SOLIDUS BOTTOM CORNER >>>> 4X22 C RIGHT BRACKET, SOLIDUS TOP CORNER >>>> So it looks like the pairings in Unicode today are as originally >>>> intended. >>> How so? >> I was simply observing that the original pairings had the >> first-listed pair of brackets listed using REVERSE SOLIDUS and had >> the second-listed pair of brackets listed using SOLIDUS contrasting >> that clear pairing of the brackets with the use, in the encoding into >> Unicode, of TICK in the listing for each of the four of the bracket >> characters that are being discussed in this thread. > You said the 'pairings in Unicode'. With the exception of decimal > digits, the scalar values of assigned characters have no *formal* > relationship to their interpretation. The scalar values are about as > significant as the difference between canonically equivalent > non-Greek, non-Korean sequences. At best the different sequences give a > hint of what the author thinks about the character. For example U+00E9 > LATIN SMALL LETTER E WITH ACUTE suggests it may be though of as a > character, while suggests that it may be two > characters - the diacritic could be a length mark or a tone. The > distinction is not to be relied upon - normalisation would obliterate > it. > I think William makes a reasonable point that conceiving of the "ticks" as angled lines and then naming their direction in pairs potentially reinforces the notion that the sets with matching naming were intended as pairs. While this is being bandied about here on the list, an offline effort is underway to see whether it's possible to find out more about the origin and potential use of these marks - prior to their encoding in Unicode. We know they came from some SGML entity sets, but how and why they got into those is still a bit of a mystery; locked away in the heads of the original creator of these sets. We may never get a more definite answer, unless someone here is conversant with whatever field of mathematics uses these brackets, or knows someone who is. A./ From duerst at it.aoyama.ac.jp Thu Aug 27 02:39:31 2015 From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J._D=c3=bcrst?=) Date: Thu, 27 Aug 2015 16:39:31 +0900 Subject: Why doesn't Ideographic (ID) in UAX#14 have half-width katakana? In-Reply-To: References: <553EEE6D.2020004@ga2.so-net.ne.jp> <553EFB2E.3010808@hiroshima-u.ac.jp> Message-ID: <55DEBEB3.5090201@it.aoyama.ac.jp> Sorry to be late. Just some background information. On 2015/04/28 14:57, Makoto Kato wrote: > Although I read JIS X 4051, it doesn't define that half-width katakana > and full-width katakana are differently. I was on the committee that updated JIS X 4015 (mostly liaison/observer role). The chair of that committee was Prof. Shibano (?? ??), who was also chair of the committee responsible for the Japanese character standards as well as chair of ISO/IEC JTC1 SC2. I very well remember that he explained at one point that as far as the standards were concerned, full-width and half-width versions were considered one and the same character. In modern terms, the standards' view was that single-byte and double-byte encodings of these characters were just different "encoding forms" of one and the same abstract character. This view is confirmed e.g. by the character names used in the 1997 version (confirmed 2002) of JIS X 0201, which are just "KATAKANA LETTER A",... Anybody interested can dig deeper, JIS X 0201 was just what was most easily accessible to me now. The justification behind this is that they are linguistically not different at all, and that they were intended just as a fallback due to technology (memory, display resolution) limitations. In practice, technical restrictions in early limitations (one byte == one (half-width) character cell) led to a typographic distinction. The fact that half-width Kana used less space was exploited in fixed-pitch screen design. That lead to a desire to keep the distinction when round-tripping via Unicode, and thus to different character names. Regards, Martin. From duerst at it.aoyama.ac.jp Thu Aug 27 03:04:16 2015 From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J._D=c3=bcrst?=) Date: Thu, 27 Aug 2015 17:04:16 +0900 Subject: Why doesn't Ideographic (ID) in UAX#14 have half-width katakana? In-Reply-To: <55DEBEB3.5090201@it.aoyama.ac.jp> References: <553EEE6D.2020004@ga2.so-net.ne.jp> <553EFB2E.3010808@hiroshima-u.ac.jp> <55DEBEB3.5090201@it.aoyama.ac.jp> Message-ID: <55DEC480.20705@it.aoyama.ac.jp> Sorry, one correction: On 2015/08/27 16:39, Martin J. D?rst wrote: > In practice, technical restrictions in early limitations (one byte == > one (half-width) character cell) led to a typographic distinction. The > fact that half-width Kana used less space was exploited in fixed-pitch > screen design. That lead to a desire to keep the distinction when > round-tripping via Unicode, and thus to different character names. "early limitations" -> "early technologies. Regards, Martin. From charupdate at orange.fr Thu Aug 27 14:49:45 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Thu, 27 Aug 2015 21:49:45 +0200 (CEST) Subject: Thai Word Breaking In-Reply-To: <20150822143530.29f1e883@JRWUBU2> References: <20150822143530.29f1e883@JRWUBU2> Message-ID: <2062952758.26825.1440704985138.JavaMail.www@wwinf1g16> On 22 Aug 2015 at 15:47, Richard Wordingham wrote: > I'm trying to work out the meaning of TUS 8.0 Section 23.2. > > To do Thai word breaking properly, one needs to do a semantic analysis > of the text to do the equivalent of resolving the equivalent of > 'humanevents' into 'human events' rather than 'humane vents'. One also > needs to cope with unknown and misspelt words. (A lot of effort has > been devoted to avoid going to the extreme of doing semantic analysis.) > However, it is possible to read Section 23.2 as prohibiting the use of > certain information, and I would like to check whether this is the > intended meaning. > > The opening paragraph seems clear enough on first reading: > > "The effect of layout controls is specific to particular text processes. > As much as possible, lay-out controls are transparent to those text > processes for which they were not intended. In other words, their > effects are mutually orthogonal." > > However, my first question is, "Are paragraph boundaries > directly admissible as evidence for or against word boundaries not > adjacent to them?". For example, most Thai word breakers would not > regard a paragraph boundary as any more significant than a > phrase-delimiting space. However, a paragraph boundary often indicates > a change of topic. > > My second question is, "Are line breaks admissible as evidence for > or against word boundaries not adjacent to them?" For example, if a > phrase makes heavy use of U+200B ZERO WIDTH SPACE (ZWSP), one may deduce > that it is likely that all word boundaries within it are marked > explicitly. This example is more useful for Khmer than to Thai, for > whereas Cambodians were once taught to mark word boundaries, Thais > rarely use ZWSP to mark word boundaries. > > My third question is, "Is the absence of a line break opportunity > admissible as evidence for or against a word boundary?". Here I > see conflicting signals. > > There is a character U+2060 WORD JOINER (WJ) which *used* to be regarded > as the counterpart of ZWSP. The understanding was that ZSWP marked a > word boundary and provided a line-break opportunity, while WJ denied > both. This, however, is no longer the case. To quote the TUS section > about WJ: > > P1: (Ignored) > > P2S1: The word joiner must not be confused with the zero width joiner > or the combining grapheme joiner, which have very different functions. > > P2S2: In particular, inserting a word joiner between two characters has > no effect on their ligating and cursive joining behavior. > > P2S3: The word joiner should be ignored in contexts other than line > breaking. > > P2S4: Note in particular that the word joiner is ignored for word > segmentation. > > P2S5: (See Unicode Standard Annex #29, ?Unicode Text Segmentation.?) > > Paragraph 2 Sentence 3 (P2S3) appears to rule out its use in > word-breaking, but perhaps it does not if line-breaking is being used > as evidence for word boundaries. > > P2S4 has three very different interpretations: > > (i) This is an assertion of fact, and may therefore be incorrect. > > (ii) The word 'is' is sloppy wording for 'should be'. Section 23.2 > contains much sloppier wording, as I have already advised members of > the UTC (4 July 2015). > > (iii) This is a deduction from other parts of the specification. Now, > if P2S4 said 'is normally ignored for word segmentation', that would > have made sense, for that applies to the default word boundary > specification in UAX#29. However, just before Section 4.1, UAX#29 > explains that it does not specify what happens for word boundary > determination in Thai! (It does constrain what happens, though.) > > At the end of UAX#29 Section 6.2, there is the provision, "The Ignore > rules should not be overridden by tailorings, with the possible > exception of remapping some of the Format characters to other > classes." To accord with the user perceptions of Unicode-aware > people who work with SE Asian scripts, I am tempted to ask for CLDR > to tailor the word-breaking algorithms for the corresponding languages > so that the word-breaking classes of WJ (and ZWNBSP) are changed from > Format to MidLetter. That would match the widespread old *perception* > that there should be no word break in a sequence > mark,)* WJ, Thai letter>. However, there are several objections: > > (a) Perhaps P2S3 and P2S4 prohibit this. > > (b) If the word-break property of Thai letters falls back to Other, > there would still be a word break between them. > > (c) If the word-break property of Thai letters fell back to ALetter, > an old suggestion, WJ would have no effect on the presence of a word > break. > > (d) If Thai word breaking assigns word-break classes to each letter > (gc=Lo), then word boundaries can be suppressed by choosing the classes > appropriately. Non-spacing Thai vowels are very relevant to Thai > word-breaking, but formally are 'ignored'. WJ could be 'ignored' in > exactly the same way. Still nobody answered the questions Richard Wordingham raised five days ago. I'm very busy and can hardly channel off any time for concerns not related so far, except when I believe there's some need, as this is a discussion list. However the Word Joiner topic made me launch a thread too, which has been thankfully answered. Now I feel that even if the WJ is apparently tailored to delimit words in mainstream word processors, the Standard denies this property, and Richard agrees if I've well understood. The criticism he works out should IMHO be fed into the 9.0 workflow. Any comments from Thai users, implementers, and scientists? Best regards, Marcel -------------- next part -------------- An HTML attachment was scrubbed... URL: From charupdate at orange.fr Thu Aug 27 15:59:14 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Thu, 27 Aug 2015 22:59:14 +0200 (CEST) Subject: Thai Word Breaking Message-ID: <628722595.28314.1440709154153.JavaMail.www@wwinf1h13> On 27 Aug 2015 et 21:49, I wrote: > However the Word Joiner topic made me launch a thread too, which has been thankfully answered. Please read: [...] which has been answered and I'm thankful. Apart, I've an off-topic: Keyboard source files can be converted and compiled with the Kbdutool.exe of MSKLC even when they have not entirely been generated by the software. In other words, we are invited to add chained dead keys directly in the .klc file, because they are supported by Kbdutool, and run this tool thanks to its command line UI. Among the switches, we find also one to get the C sources only. I would have e-mailed this the day the whole process is working (to date, I can just include my custom header, via an #include at the end of the kbd.h in the \inc\ directory), as there is no such switch to get Kbdtool.exe compile from the C sources. IMHO what we must not do, is to insist to have graphic UIs for the whole keyboard layout creation, because experience shows that keyboard editing, especially dead key repertories, as well as the allocation table and ligature table, are best done in spreadsheets (where we can also have the diagrams), with the whole NamesList (or the part containing identifiers and heads/subheads), and the surrogate pairs beside in two formula-generated columns (using little hex conversion tables because Excel can AFAIK not handle the >> and << operators (this is >>, << in the case it disappears). Best regards, Marcel -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Thu Aug 27 18:09:52 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Fri, 28 Aug 2015 00:09:52 +0100 Subject: Thai Word Breaking In-Reply-To: <2062952758.26825.1440704985138.JavaMail.www@wwinf1g16> References: <20150822143530.29f1e883@JRWUBU2> <2062952758.26825.1440704985138.JavaMail.www@wwinf1g16> Message-ID: <20150828000952.14f6ca50@JRWUBU2> On Thu, 27 Aug 2015 21:49:45 +0200 (CEST) Marcel Schneider wrote: > On 22 Aug 2015 at 15:47, Richard Wordingham wrote: > Still nobody answered the questions Richard Wordingham raised five > days ago. There are not many people who are in a position to say what unclear sections of TUS are intended to mean. I may have scared them into silence by noting that people changing code because of one particular *new* sentence in Section 23.2, namely: > > P2S4: Note in particular that the word joiner is ignored for word > > segmentation. are at risk (but see below) of putting themselves in breach of the UK's 'Equality Act 2010'; more generally, they may be in breach of transpositions of the EU Racial Equality Directive (2000/43/EC). You don't need to have racialist intentions to be in breach. > > (ii) The word 'is' is sloppy wording for 'should be'. Section 23.2 > > contains much sloppier wording, as I have already advised members of > > the UTC (4 July 2015). This comment applies to the part of Section 23.2 referring to U+FEFF ZERO WIDTH NO-BREAK SPACE (ZWNBSP). UTC members were advised that to be consistent, it should have changes corresponding to those made for WJ. Such changes weren't made to the section on ZWNBSP, and so I can read Section 23.2 as saying that ZWNBSP can be used to mark word boundaries whereas WJ cannot. Reading the standard this way would probably protect the writers of text editors (including word processors) from the European legislation against indirect discrimination. It's still a shame about the degradation of old text that uses WJ instead of ZWNBSP, but it should still render fine if one switches spell-checking off. Word counts will change, though. Richard. From charupdate at orange.fr Sat Aug 29 15:33:57 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Sat, 29 Aug 2015 22:33:57 +0200 (CEST) Subject: Thai Word Breaking In-Reply-To: <20150828000952.14f6ca50@JRWUBU2> References: <20150822143530.29f1e883@JRWUBU2> <2062952758.26825.1440704985138.JavaMail.www@wwinf1g16> <20150828000952.14f6ca50@JRWUBU2> Message-ID: <1275427180.13085.1440880437214.JavaMail.www@wwinf1n31> On 28 Aug 2015 at 01:19, Richard Wordingham wrote: > I may have scared them into > silence by noting that people changing code because of one particular > *new* sentence in Section 23.2, namely: > > > > P2S4: Note in particular that the word joiner is ignored for word > > > segmentation. > > are at risk (but see below) of putting themselves in breach of the UK's > 'Equality Act 2010'; more generally, they may be in breach of > transpositions of the EU Racial Equality Directive (2000/43/EC). You > don't need to have racialist intentions to be in breach. [?] > so I can read > Section 23.2 as saying that ZWNBSP can be used to mark word boundaries > whereas WJ cannot. Reading the standard this way would probably protect > the writers of text editors (including word processors) from the > European legislation against indirect discrimination. That?s awesome! So when I have the ordinal indicators both on *one* key because I?need the A and O for German precomposed, and have the ? in the Base shift state and the ? in the Shift shift state (because the primary locale is French, which does use ? but not ?, and BTW the ?? is on N, too), may I be accused of discrimination? If so, I must remove the ordinal indicators from the [I] key and have them in Compose only (Compose, a, _; Compose, o, _). Marcel -------------- next part -------------- An HTML attachment was scrubbed... URL: From charupdate at orange.fr Sat Aug 29 15:45:50 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Sat, 29 Aug 2015 22:45:50 +0200 (CEST) Subject: Streamline keyboard programming (was: Re: Thai Word Breaking) In-Reply-To: <628722595.28314.1440709154153.JavaMail.www@wwinf1h13> References: <628722595.28314.1440709154153.JavaMail.www@wwinf1h13> Message-ID: <1323552145.13187.1440881150607.JavaMail.www@wwinf1n31> On 27 Aug 2015 at 23:09, I wrote: > Keyboard source files can be converted and compiled with the Kbdutool.exe of MSKLC even when they have not entirely been generated by the software. In other words, we are invited to add chained dead keys directly in the .klc file, because they are supported by Kbdutool, and run this tool thanks to its command line UI. Among the switches, we find also one to get the C sources only. > I would have e-mailed this the day the whole process is working (to date, I can just include my custom header, via an #include at the end of the kbd.h in the \inc\ directory), as there is no such switch to get Kbdtool.exe compile from the C sources. > IMHO what we must not do, is to insist to have graphic UIs for the whole keyboard layout creation, because experience shows that keyboard editing, especially dead key repertories, as well as the allocation table and ligature table, are best done in spreadsheets (where we can also have the diagrams), with the whole NamesList (or the part containing identifiers and heads/subheads), and the surrogate pairs beside in two formula-generated columns (using little hex conversion tables because Excel can AFAIK not handle the >> and << operators (this is >>, << in the case it disappears). Philippe Verdy kindly made me aware that the binary right shift is an integer division. Yep, I didn't notice, and clumsily removed two hex digits and figured out how to get the next one... Thanks to Philippe's advice, the C formulas from the Unicode Frequently Asked Question stand now as short Excel formulas in my NamesList spreadsheet. Marcel -------------- next part -------------- An HTML attachment was scrubbed... URL: From charupdate at orange.fr Sat Aug 29 15:55:31 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Sat, 29 Aug 2015 22:55:31 +0200 (CEST) Subject: Thai Word Breaking Message-ID: <424376774.13294.1440881731423.JavaMail.www@wwinf1n31> On 29 Aug 2015 at 22:33 (twenty minutes ago), I wrote: > So when I have the ordinal indicators both on *one* key because I?need the A and O for German precomposed, and have the ? in the Base shift state and the ? in the Shift shift state (because the primary locale is French, which does use ? but not ?, and BTW the ?? is on N, too), may I be accused of discrimination? If so, I must remove the ordinal indicators from the [I] key and have them in Compose only (Compose, a, _; Compose, o, _). Arghhh. On the [I] key, I've ??? on Kana, and ??? on Shift+Kana. Sorry. ? I've some other news but they must wait till tomorrow. Seemingly I'm too tired this evening. ? Marcel -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Sat Aug 29 18:09:00 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Sun, 30 Aug 2015 00:09:00 +0100 Subject: Thai Word Breaking In-Reply-To: <1275427180.13085.1440880437214.JavaMail.www@wwinf1n31> References: <20150822143530.29f1e883@JRWUBU2> <2062952758.26825.1440704985138.JavaMail.www@wwinf1g16> <20150828000952.14f6ca50@JRWUBU2> <1275427180.13085.1440880437214.JavaMail.www@wwinf1n31> Message-ID: <20150830000900.4e2339ad@JRWUBU2> On Sat, 29 Aug 2015 22:33:57 +0200 (CEST) Marcel Schneider wrote: > So when I have the ordinal indicators both on *one* key because > I?need the A and O for German precomposed, and have the ? in the Base > shift state and the ? in the Shift shift state (because the primary > locale is French, which does use ? but not ?, and BTW the ?? is on N, > too), may I be accused of discrimination? Your defence would be that that "practice is objectively justified by a legitimate aim and the means of achieving that aim are appropriate and necessary" - 2000/43/EC Article 1 Paragraph 2(b). Mock not. In the UK, needlessly requiring that a job applicant have a driving licence is unlawful discrimination against women. Not making provision for the hard of hearing at a query desk can be unlawful discrimination - I don't remember whether it was by disability or simply on the basis of age. I'm not sure to what extent these are common EU law and to what extent these are just British law. I've got some web pages where colour-coding is used. It looks as though I've now supposed to find a way of switching the colours to help those with impaired colour vision. Perhaps I'll just have to withdraw the pages. Richard. From tgwizard at gmail.com Sat Aug 29 17:47:12 2015 From: tgwizard at gmail.com (Adam Renberg) Date: Sat, 29 Aug 2015 22:47:12 +0000 Subject: Wrong character code for HELM SYMBOL in TR 51 Unicode Emoji? Message-ID: Hi, I've just read through the Unicode Technical Report #51 Unicode Emoji [1], and I have a question. In section 3.3 Methodology [2], third paragraph, it says: "This document takes a functional view regarding the identification of emoji: pictographs are categorized as emoji when it is reasonable to give them an emoji presentation, and where they are sufficiently distinct from other emoji characters. Symbols with a graphical form that people may treat as pictographs, such as U+2615 HELM SYMBOL (introduced in Unicode 3.0) may be included." However, when I look up the HELM SYMBOL, it seems to have code U+2388 [3][4][5]. The character with code U+2615 is HOT BEVERAGE [6][7]. Is this a mistake in the technical report? Best regards, Adam Renberg [1]: http://www.unicode.org/reports/tr51/index.html [2]: http://www.unicode.org/reports/tr51/index.html#Methodology [3]: http://www.unicode.org/charts/PDF/U2300.pdf [4]: http://unicode-table.com/en/2388/ [5]: http://www.unicode.org/Public/3.0-Update/UnicodeData-3.0.0.txt [6]: http://www.unicode.org/charts/PDF/U2600.pdf [7]: http://unicode-table.com/en/2615/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From gwalla at gmail.com Sat Aug 29 19:33:48 2015 From: gwalla at gmail.com (Garth Wallace) Date: Sat, 29 Aug 2015 17:33:48 -0700 Subject: Wrong character code for HELM SYMBOL in TR 51 Unicode Emoji? In-Reply-To: References: Message-ID: It certainly looks that way. In just the next paragraph it mentions "U+2615 HOT BEVERAGE (introduced in Unicode 4.0)" On Sat, Aug 29, 2015 at 3:47 PM, Adam Renberg wrote: > Hi, > > I've just read through the Unicode Technical Report #51 Unicode Emoji [1], > and I have a question. In section 3.3 Methodology [2], third paragraph, it > says: > > "This document takes a functional view regarding the identification of > emoji: pictographs are categorized as emoji when it is reasonable to give > them an emoji presentation, and where they are sufficiently distinct from > other emoji characters. Symbols with a graphical form that people may treat > as pictographs, such as U+2615 HELM SYMBOL (introduced in Unicode 3.0) may > be included." > > However, when I look up the HELM SYMBOL, it seems to have code U+2388 > [3][4][5]. The character with code U+2615 is HOT BEVERAGE [6][7]. > > Is this a mistake in the technical report? > > Best regards, > Adam Renberg > > [1]: http://www.unicode.org/reports/tr51/index.html > [2]: http://www.unicode.org/reports/tr51/index.html#Methodology > [3]: http://www.unicode.org/charts/PDF/U2300.pdf > [4]: http://unicode-table.com/en/2388/ > [5]: http://www.unicode.org/Public/3.0-Update/UnicodeData-3.0.0.txt > [6]: http://www.unicode.org/charts/PDF/U2600.pdf > [7]: http://unicode-table.com/en/2615/ From mark at macchiato.com Sun Aug 30 05:21:14 2015 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Sun, 30 Aug 2015 12:21:14 +0200 Subject: Wrong character code for HELM SYMBOL in TR 51 Unicode Emoji? In-Reply-To: References: Message-ID: Thanks, that's a mis-edit. The following text should have been removed: ". Symbols with a graphical form that people may treat as pictographs, ... are categorized as emoji" Mark *? Il meglio ? l?inimico del bene ?* On Sun, Aug 30, 2015 at 2:33 AM, Garth Wallace wrote: > It certainly looks that way. In just the next paragraph it mentions > "U+2615 HOT BEVERAGE (introduced in Unicode 4.0)" > > On Sat, Aug 29, 2015 at 3:47 PM, Adam Renberg wrote: > > Hi, > > > > I've just read through the Unicode Technical Report #51 Unicode Emoji > [1], > > and I have a question. In section 3.3 Methodology [2], third paragraph, > it > > says: > > > > "This document takes a functional view regarding the identification of > > emoji: pictographs are categorized as emoji when it is reasonable to give > > them an emoji presentation, and where they are sufficiently distinct from > > other emoji characters. Symbols with a graphical form that people may > treat > > as pictographs, such as U+2615 HELM SYMBOL (introduced in Unicode 3.0) > may > > be included." > > > > However, when I look up the HELM SYMBOL, it seems to have code U+2388 > > [3][4][5]. The character with code U+2615 is HOT BEVERAGE [6][7]. > > > > Is this a mistake in the technical report? > > > > Best regards, > > Adam Renberg > > > > [1]: http://www.unicode.org/reports/tr51/index.html > > [2]: http://www.unicode.org/reports/tr51/index.html#Methodology > > [3]: http://www.unicode.org/charts/PDF/U2300.pdf > > [4]: http://unicode-table.com/en/2388/ > > [5]: http://www.unicode.org/Public/3.0-Update/UnicodeData-3.0.0.txt > > [6]: http://www.unicode.org/charts/PDF/U2600.pdf > > [7]: http://unicode-table.com/en/2615/ > -------------- next part -------------- An HTML attachment was scrubbed... URL: From charupdate at orange.fr Mon Aug 31 08:27:17 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Mon, 31 Aug 2015 15:27:17 +0200 (CEST) Subject: Thai Word Breaking In-Reply-To: <20150830000900.4e2339ad@JRWUBU2> References: <20150822143530.29f1e883@JRWUBU2> <2062952758.26825.1440704985138.JavaMail.www@wwinf1g16> <20150828000952.14f6ca50@JRWUBU2> <1275427180.13085.1440880437214.JavaMail.www@wwinf1n31> <20150830000900.4e2339ad@JRWUBU2> Message-ID: <1216480868.9953.1441027637361.JavaMail.www@wwinf1f13> On 30 Aug 2015 at 01:17, Richard Wordingham wrote: > On Sat, 29 Aug 2015 22:33:57 +0200 (CEST) > Marcel Schneider wrote: > > > So when I have the ordinal indicators both on *one* key because > > I?need the A and O for German precomposed, and have the ? in the Base > > shift state and the ? in the Shift shift state [sorry: ? in Kana=i, ? in Shift+Kana+i] > > (because the primary > > locale is French, which does use ? but not ?, and BTW the ?? is on N, > > too), may I be accused of discrimination? > > Your defence would be that that "practice is objectively justified by a > legitimate aim and the means of achieving that aim are appropriate and > necessary" - 2000/43/EC Article 1 Paragraph 2(b). Mock not. This is IMHO a very good defence. It exactly matches the situation on the quoted keyboard layout. However, if there is any risk that anybody could take offence from having the feminine ordinal indicator more hardly accessed than the masculine one, may that be in a unlucky moment of fatigue or personal disappointment, it will be wiser to take them away from that spot. That's what i've done in the wake, and I thank you for having made us aware of the existence of these problems. (To lessen the damage, I've added a couple of additional Compose sequences: Compose, i, a ? ? Compose, i, o ? ?) > In the UK, > needlessly requiring that a job applicant have a driving licence is > unlawful discrimination against women. Not making provision for the > hard of hearing at a query desk can be unlawful discrimination - I don't > remember whether it was by disability or simply on the basis of age. These legal provisions have considerable merit. IMHO one even could sum them up on the theme of technology, saying that nobody must neither require needless technological skill from others, nor neglect providing needed technological devices to relieve those suffering from age and/or disability. > I'm not sure to what extent these are common EU law and to what extent > these are just British law. I hope they are common EU law, otherwise they'll have to be implemented. In the last food allergen emoji thread, William Overington already reported some British legal provisions that I found to be superior to those applicable in other G8 countries: http://www.unicode.org/mail-arch/unicode-ml/y2015-m07/0227.html > > I've got some web pages where colour-coding is used. It looks as > though I've now supposed to find a way of switching the colours to help > those with impaired colour vision. Perhaps I'll just have to withdraw > the pages. Yet another point I must monitor, as I too use colour-coding in the layout overview where some formatting styles are defined in Excel, one for CapsLock sensitive key positions, one for KanaLock sensitive key positions, one for dead keys, and so on. It's hard for me to work out what colours I must combine with what other colours to meet disability. Perhaps there must be several patterns along with one in black and white with grey tones. For PDF?this should be feasible in Excel by setting the styles' background and foreground colours. (The layout not being finished, it's still off line.) I hope you will get a solution allowing to maintain your pages. (BTW I'm quite curious but that's not a matter.) Marcel -------------- next part -------------- An HTML attachment was scrubbed... URL: From charupdate at orange.fr Mon Aug 31 09:12:08 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Mon, 31 Aug 2015 16:12:08 +0200 (CEST) Subject: Custom keyboard source samples In-Reply-To: <1408769184.13369.1440084638814.JavaMail.www@wwinf1d11> References: <1408769184.13369.1440084638814.JavaMail.www@wwinf1d11> Message-ID: <1409194975.10963.1441030328840.JavaMail.www@wwinf1f13> > On 20 Aug 2015 at 03:19, Richard Wordingham wrote: > On Mon, 17 Aug 2015 13:51:26 +0200 (CEST) > Marcel Schneider wrote: >> Since yesterday I know a very simple way to get the source code (in >> C) of any MSKLC layout. > Is this legal? To me it smacks of reverse engineering, which is > prohibited under the MSKLC licence. On 20 Aug 2015 at 17:41 I'd already discarded the licence as not applying but found another prohibited practice, the illegal use of the software, that must be used "only in certain ways", no working around being allowed: > Do I??work around any technical limitations in the software? by picking up the source code of the drivers it generates? This is my main concern about this practice. Are we allowed to use files generated by MSKLC that are not expressedly provided to the user? > Further, are we allowed to use installation packages generated by MSKLC to install other keyboard drivers than those generated by MSKLC? To install keyboard drivers that exceed the limitations of MSKLC? > The questioning becomes even more troublesome when we remember that the WDK is mentioned in the MSKLC Help, and ask: > When we accept the invitation to switch towards WDK, must we package the drivers with the resources the driver kit comes along with (while not knowing how to write an INF file!), or may we use the MSI and setup from MSKLC? > BTW we may wonder why and how MSKLC compiles a Windows-On-Windows driver, while except for a few sparse mentions, nothing seems to be provided for WOW in the WDK. These questions are now about to be going to be answered: http://www.siao2.com/2011/04/09/10151666.aspx ?It believe it is even technically be a EULA violation, though I can?t imagine it ever being enforced in this case, I mean unless someone tried to sue Microsoft for negative consequences of using a keyboard created by this means. In which case the use of the hack would be a pretty reasonable indemnification of Microsoft here, since someone delved into the land of the specifically unsupported?.? We note that Michael Kaplan does not give legal advice, as he *believes* things are so. But I'm pretty likely to take it for a reality, as it makes Microsoft more a user-friendly company that cares for everybody being at ease with the computer, whatever the keyboard he prefers may look like. Right technically, there is a very straightforward way to get the source code of any keyboard from its .klc file. http://www.siao2.com/2011/04/16/10154700.aspx we are showed how to edit the klc file and have Kbdutool running on it to get the drivers, to put these enhanced drivers into the package, and install. These include the WoW driver, which we can't compile in the WinDDK from the era. (Windows-on-Windows is the 32 bit subsystem running on 64 bit machines for support of 32 bit applications. This requires a second keyboard driver, the one that is in the wow64 folder. If this dll is missing while required, keyboard layout installation fails.) When we ask Kbdutool for its switches, it shows us also the -s switch for generating sources without compiling. So to get the C sources, we need to use the two switches -u and -s. Marcel -------------- next part -------------- An HTML attachment was scrubbed... URL: From charupdate at orange.fr Mon Aug 31 13:52:53 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Mon, 31 Aug 2015 20:52:53 +0200 (CEST) Subject: Custom keyboard source samples Message-ID: <112795365.21952.1441047173312.JavaMail.www@wwinf1c24> > On 18 Aug 2015 at 10:09, Philippe Verdy wrote: > i don't know why these c source files need to be deleted so fast when they could just remain in the same folder as the saved.klc file. On 20 Aug 2015 at 17:42, I still didn?t understand (my lenifying answer of 18 Aug 2015 at 11:42 consisted of mere suppositions), and angry at the idea of being outlawed when trying to fix some issues on keyboard customization practice, I even threw a bunch of subsidiary questions in the air: > Why is there no option ?? Keep the C sources ? Delete the C sources?? > Why are there no menu items ?Generate C source? and ?Build from C source?, or an option ?? Build from KLC source ? Build from C source?? > That?s what I?ve wished to find in the MSKLC when I learned about. > Figure that, before, I not even imagined that such sources could ever exist. Let's face it: Kbdutool stores the intermediate files?among which the much-coveted C sources?right in the working directory. You see them appearing and being deleted. And?as I wrote a short wile ago?if we wish to take a glance, we?re welcome: there?s an extra switch for that. It?s not sooner than when Kbdutool is driven by MSKLC, that all those files are kept out of sight. To decrypt why this is so, the secret key is IMHO found in these two blog posts: http://www.siao2.com/2013/04/19/10409187.aspx http://www.siao2.com/2013/04/23/10413216.aspx It?s as if everybody at Microsoft was traumatized of seeing one nation preferring another kind of keyboard than the usual one. Here we must instantly deliver the hidden part of the story, I mean, the part that is not taken into consideration: The Canadian Multilingual Standard keyboard is a genuine implementation of all parts of the ISO 9995 keyboard standard of the era. And say it loud: Nobody can be blamed of following an ISO standard. Today, the willingness of Canada to implement ISO 9995 proves having been a very good idea. It?s not only a hard-wired logic. Apple?Inc. is proposing their products with a physical Canadian Multilingual Standard keyboard. And they?re gaining market parts! All people using the Canadian Standard keyboard are full of praise?even if Canadians themselves find some details to improve. As about the cited utterings, this is utter shame (not for the targeted Standards body!). Never look for the Right Ctrl key elsewhere than where it?s expected... (That?s nothing about the layout, and all about ISO keyboard symbols.) Now back to topic. ?People ask me al the time how can they add shift states like this in MSKLC, but I always refuse to answer. I don?t want to encourage anyone else to author such layouts!? It's this ?strange keyboard? trauma, following the kbd.h authors? terminology, which explains for me the great fear of leaving to everybody the keys of keyboard programming. They don?t miss the point: Once you get the C sources and have means to compile them to drivers, you can do what you want, or nearly what you want. You have no other limits than Windows?. And these allow for pretty much. Consequently, the reason why much is done to keep C sources out of reach, is the care for the user. Users must be guided, they must be prevented from following their folly (please don?t misunderstand: I?m talking of their *supposed* folly). Well, that?s?to date?my *very long* answer. I don?t know well what is best to hope this time: to be right, or to be wrong :-| Marcel -------------- next part -------------- An HTML attachment was scrubbed... URL: From charupdate at orange.fr Mon Aug 31 14:00:31 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Mon, 31 Aug 2015 21:00:31 +0200 (CEST) Subject: Custom keyboard source samples Message-ID: <854246856.22207.1441047631099.JavaMail.www@wwinf1c24> [Note: I didn't see the announcement when I sent my e-mail. This has cost me a lot of time and searches, which I concede as for the topic I'm working on, and trying to deliver a useful answer. I never wanted to interfere with on-going threads or announcements.] Marcel From tgwizard at gmail.com Mon Aug 31 17:34:17 2015 From: tgwizard at gmail.com (Adam Renberg) Date: Mon, 31 Aug 2015 22:34:17 +0000 Subject: Wrong character code for HELM SYMBOL in TR 51 Unicode Emoji? In-Reply-To: References: Message-ID: Thank you for the clarification. Should the text be updated? On Sun, Aug 30, 2015 at 12:26 PM Mark Davis ?? wrote: > Thanks, that's a mis-edit. The following text should have been removed: > ". Symbols with a graphical form that people may treat as pictographs, > ... are categorized as emoji" > > > Mark > > *? Il meglio ? l?inimico del bene ?* > > On Sun, Aug 30, 2015 at 2:33 AM, Garth Wallace wrote: > >> It certainly looks that way. In just the next paragraph it mentions >> "U+2615 HOT BEVERAGE (introduced in Unicode 4.0)" >> >> On Sat, Aug 29, 2015 at 3:47 PM, Adam Renberg wrote: >> > Hi, >> > >> > I've just read through the Unicode Technical Report #51 Unicode Emoji >> [1], >> > and I have a question. In section 3.3 Methodology [2], third paragraph, >> it >> > says: >> > >> > "This document takes a functional view regarding the identification of >> > emoji: pictographs are categorized as emoji when it is reasonable to >> give >> > them an emoji presentation, and where they are sufficiently distinct >> from >> > other emoji characters. Symbols with a graphical form that people may >> treat >> > as pictographs, such as U+2615 HELM SYMBOL (introduced in Unicode 3.0) >> may >> > be included." >> > >> > However, when I look up the HELM SYMBOL, it seems to have code U+2388 >> > [3][4][5]. The character with code U+2615 is HOT BEVERAGE [6][7]. >> > >> > Is this a mistake in the technical report? >> > >> > Best regards, >> > Adam Renberg >> > >> > [1]: http://www.unicode.org/reports/tr51/index.html >> > [2]: http://www.unicode.org/reports/tr51/index.html#Methodology >> > [3]: http://www.unicode.org/charts/PDF/U2300.pdf >> > [4]: http://unicode-table.com/en/2388/ >> > [5]: http://www.unicode.org/Public/3.0-Update/UnicodeData-3.0.0.txt >> > [6]: http://www.unicode.org/charts/PDF/U2600.pdf >> > [7]: http://unicode-table.com/en/2615/ >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: