From charupdate at orange.fr  Sun Aug  2 07:26:45 2015
From: charupdate at orange.fr (Marcel Schneider)
Date: Sun, 2 Aug 2015 14:26:45 +0200 (CEST)
Subject: Windows 10 release (was: Re: WORD JOINER vs ZWNBSP)
Message-ID: <1584793787.9398.1438518405314.JavaMail.www@wwinf1m23>

On 30 Jul 2015 at 20:56, Doug Ewell  wrote:

> Marcel Schneider wrote:

>>> Unfortunately that doesn?t work on at least one recent version of
>>> Windows. An unambigous bug was due to the presence of 0x2060 in the
>>> Ligatures table. This has cost me a whole workday to retrieve, fix,
>>> and verify.

The bug on Windows I encountered at the end of July has been definitely identified and reconstructed. After ninety-five drivers compiled since the bug appeared, I can tell so much as that the problem is related to the length of the so-called ligatures. When the MSKLC was built, they were limited to four characters on Windows (see glossary in the MSKLC Help). On my machine the maximal length is 16 characters. The problem is that this is not equal on all shift states and perhaps keys. Roughly, I can put five characters on modification number three, that is normally AltGr, but not on #4 (Shift+AltGr). Relating the problem to the presence of 0x2060 was due to a misinterpretation.

[About why five characters: The ellipsis made of three times PERIOD looks often better or seemingly, *and* is a part of all fonts, *and* doesn?t bug when a server enforces Latin-1 even on the UTF-8 pages it sends itself (see last month?s thread ?UTF-8 display?). The complete sequence is a braced ellipsis, for more usefulness in a context of quotation. I wanted the braced U+2026 on Ctrl+Alt+Period, and the braced three periods on Shift+Ctrl+Alt+Period. Now it?s the other way round.]

The following source lines show the sole difference between a bugging driver and a driver that works fine:


Bugs:
{VK_OEM_PERIOD /*T33 B08*/ ,3 ,'[' ,0x2026 ,']' ,NONE ,NONE ,NONE ,NONE ,NONE ,NONE ,NONE ,NONE ,NONE ,NONE ,NONE ,NONE ,NONE }, // 
{VK_OEM_PERIOD /*T33 B08*/ ,4 ,'[' ,'.' ,'.' ,'.' ,']' ,NONE ,NONE ,NONE ,NONE ,NONE ,NONE ,NONE ,NONE ,NONE ,NONE ,NONE }, //

Works:
{VK_OEM_PERIOD /*T33 B08*/ ,3 ,'[' ,'.' ,'.' ,'.' ,']' ,NONE ,NONE ,NONE ,NONE ,NONE ,NONE ,NONE ,NONE ,NONE ,NONE ,NONE }, // 
{VK_OEM_PERIOD /*T33 B08*/ ,4 ,'[' ,0x2026 ,']' ,NONE ,NONE ,NONE ,NONE ,NONE ,NONE ,NONE ,NONE ,NONE ,NONE ,NONE ,NONE ,NONE }, //


I was about to catalogize the cases depending on shift states (and possibly key scan codes and so on), but I encountered so many keyboard bugs (Windows keypress added while key not pressed; arrow keys disabled; Backspace disabled; and so on), that I decided not to waste more than the past week on that problem.

BTW, I got really aware that the so-called Windows 7 Starter is not Windows Seven but a sort of relooked Windows Vista. That?s why its version number is 6.1. I think debugging Windows Vista isn?t worthwile, as today we have Windows 10, which as the ultimate Windows version must have fixed all those bugs. I?m hopeful and I expect people will ? it.


>>>
>>> The effect of the bug was that Word, Excel, Firefox and Zotero were
>>> unstartable.

Only when the faulty driver is the one of the default keyboard when Windows starts up. Otherwise the apps aren?t blocked, but the buggy keyboard layout is disabled, but on Word, not on the built-in Notepad.

>>>
>>> As a result, the WORD JOINER cannot be implemented on a driver based
>>> keyboard layout for general use on Windows. By contrast, the ZWNBSP
>>> can.

That?s complete nonsense, sorry. Both can be implemented in the driver.

> and:

>> The so-called ligatures, by contrast, must not be constructed with
>> 0x2060. This however was the case of three items:
>>
>> - A justifying no-break space emulation 0x2060 0x0020 0x2060, for use
>> in word processors where the NBSP is not justifying, unlike as in
>> desktop publishing and high-end editing software as Philippe Verdy
>> referred to, where U+00A0 is justifying. It not being in word
>> processing is consistent with the need of using U+00A0 along with
>> punctuations in French, and the lack of U+202F in many fonts.
>>
>> - A colon with such a justifying no-break space, for use in documents
>> that imitate the usage of at least a part, if not mainstream, old-
>> fashioned typography: 0x2060 0x0020 0x2060 0x003a.
>>
>> - A punctuation apostrophe emulation 0x2060 0x0027 0x2060, mapped to
>> Kana + I.

There is a mistake in my e-mail: the curly punctuation apostrophe is emulated using the letter apostrophe. This sequence runs:
0x2060 0x02bc 0x2060
I?m not sure however if this is useful, as such sequences are obtainable by autocorrect where the word joiners are really useful, while in English the letter apostrophe is preferrable (whereas other languages can use U+2019 unambiguously).

>>
>> I'm about to test on another Windows Edition. I wonder if there is a
>> real issue or not, as you are suggesting. Nevertheless I believe that
>> no such bugs must occur in whatever version and edition of Windows.

That remains true, as the versions we?re talking about are known to be unstable. But nobody?s perfect, and everybody?s invited to improve, notably on keyboard layouts which traditionally are neglected to the benefit of upper-level tools and high-end programs.

> I created, installed, and activated an MSKLC keyboard with the three WJ
> sequences described above, mapped for convenience to AltGr+Z, AltGr+X,
> and AltGr+C respectively 

Thank you again. Curiously I hadn?t not even the idea; perhaps the missing dead key chaining and some other limitations lead me to rely rather on the WDK since I got aware of its existence (despite of its mention in the MSKLC glossary) on an explaining web page.

> (not the Kana key, which I don't have), 

I?m using the standard keyboard on my netbook and wouldn?t have any Kana neither but thanks to the Windows Driver Kit allowing to add this as a modifier and as a toggle. Using Kana as main 3rd level helps limiting the messing of Ctrl+Alt with AltGr. I dismapped the latter and am using Ctrl+Alt in a few cases, like this one.

> and had no trouble opening or using any applications on Windows 7, including 
> the four mentioned above (except Zotero, which I don't use). KLC source
> available on request.

Thank you for the proposal. and your test has even brought me to the idea of making a patch of the layout I?m working on, so I took a subset and made it from scratch in MSKLC. That?s much safer and easier to install. KanaLock is emulated using SGCaps. Compose could be emulated using other apps. But for a number of non-English languages, which SGCaps is for, CapsLock and easy input of multiply diacriticized letters is missing.

> 
> I wouldn't have wasted the 15 minutes but for the continuing, tiresome
> rhetoric about Windows bugs.

I?m sorry. As stated above, Windows made me waste not only fifteen minutes, but about fifty hours. And I?m not even talking about all the other cases and my far over one thousand noted desiderata.


Best regards,

Marcel Schneider
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150802/86ba3bac/attachment.html>

From richard.wordingham at ntlworld.com  Sun Aug  2 16:23:19 2015
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Sun, 2 Aug 2015 22:23:19 +0100
Subject: Mongolian Joining Type
Message-ID: <20150802222319.25a2dd7e@JRWUBU2>

I've been trying to understand the joining type logic that categorises
a Mongolian letter as isolated, initial, medial or final, and the
consequent effect of free variation selectors.

As far as I can tell, it is currently supposed to be controlled by the
property joining_type. However, this property appears only to have been
non-trivially assigned to the characters of the Mongolian script from
Version 6.3.0.  How was categorisation assigned before then?

I am particularly interested in the intended effects of U+202F NARROW
NO-BREAK SPACE and U+180E MONGOLIAN VOWEL SEPARATOR.  They seem to
presently have a Joining_Type value of Non_Joining, but some things
would make more sense to me if they had the value Dual_Joining.  I am
wondering if their effective value has changed; e.g. previously the
definitions for Mongolian characters worked as though they were dual
joining, but when matters were formalised they accidentally became
non-joining.

Richard.

From leob at mailcom.com  Sun Aug  2 19:55:38 2015
From: leob at mailcom.com (Leo Broukhis)
Date: Sun, 2 Aug 2015 17:55:38 -0700
Subject: Emoji characters for food allergens
In-Reply-To: <1958555392.26939.1438376320812.JavaMail.www@wwinf1j14>
References: <20150728122416.665a7a7059d7ee80bb4d670165c8327d.e45e67032a.wbe@email03.secureserver.net>
 <CAGHjPPKjvtLhNE_N+V30YmHS+1GTk0uPeTf0ffhZm2A=boDR1w@mail.gmail.com>
 <27556497.34395.1438177379567.JavaMail.defaultUser@defaultHost>
 <55B9A182.2030504@kli.org>
 <32045577.9821.1438246295044.JavaMail.defaultUser@defaultHost>
 <1369192884.16904.1438276057084.JavaMail.www@wwinf1h34>
 <CALgEMhzMHj1JmP3X3_hp3MWPiE5x=RR1c_OyAafxJU6FKAwkkA@mail.gmail.com>
 <29774584.15201.1438335472835.JavaMail.defaultUser@defaultHost>
 <1958555392.26939.1438376320812.JavaMail.www@wwinf1j14>
Message-ID: <CAFmvRseE8AHkTsxinY6O7N522+PLOJJpt7OowXh5AkLCJKdYiA@mail.gmail.com>

The discussion widens:

http://tech.slashdot.org/story/15/08/02/2248257/unicode-consortium-looks-at-symbols-for-allergies

From mark at macchiato.com  Mon Aug  3 03:39:13 2015
From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=)
Date: Mon, 3 Aug 2015 10:39:13 +0200
Subject: Emoji characters for food allergens
In-Reply-To: <CAFmvRseE8AHkTsxinY6O7N522+PLOJJpt7OowXh5AkLCJKdYiA@mail.gmail.com>
References: <20150728122416.665a7a7059d7ee80bb4d670165c8327d.e45e67032a.wbe@email03.secureserver.net>
 <CAGHjPPKjvtLhNE_N+V30YmHS+1GTk0uPeTf0ffhZm2A=boDR1w@mail.gmail.com>
 <27556497.34395.1438177379567.JavaMail.defaultUser@defaultHost>
 <55B9A182.2030504@kli.org>
 <32045577.9821.1438246295044.JavaMail.defaultUser@defaultHost>
 <1369192884.16904.1438276057084.JavaMail.www@wwinf1h34>
 <CALgEMhzMHj1JmP3X3_hp3MWPiE5x=RR1c_OyAafxJU6FKAwkkA@mail.gmail.com>
 <29774584.15201.1438335472835.JavaMail.defaultUser@defaultHost>
 <1958555392.26939.1438376320812.JavaMail.www@wwinf1j14>
 <CAFmvRseE8AHkTsxinY6O7N522+PLOJJpt7OowXh5AkLCJKdYiA@mail.gmail.com>
Message-ID: <CAJ2xs_HvB+tvXag96bo_krQpJudMPBEaZGgzBeyuc3vDhUT_CA@mail.gmail.com>

BTW, the UTC declined to accept the allergen emoji set proposal. While some
of the food items may be acceptable and the emoji subcommittee could
re-propose them, there are principled problems with trying to deal with
allergens as a set of emoji. So that is off the table.


Mark <https://google.com/+MarkDavis>

*? Il meglio ? l?inimico del bene ?*

On Mon, Aug 3, 2015 at 2:55 AM, Leo Broukhis <leob at mailcom.com> wrote:

> The discussion widens:
>
>
> http://tech.slashdot.org/story/15/08/02/2248257/unicode-consortium-looks-at-symbols-for-allergies
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150803/d51c90bb/attachment.html>

From mailinglists at ngalt.com  Mon Aug  3 03:49:40 2015
From: mailinglists at ngalt.com (Nathan Sharfi)
Date: Mon, 3 Aug 2015 01:49:40 -0700
Subject: Emoji characters for food allergens
In-Reply-To: <CALgEMhwZnoNTe1=xQa6R9X7N553vjqEmj5TRHbd2hHiMOK4whQ@mail.gmail.com>
References: <20150728122416.665a7a7059d7ee80bb4d670165c8327d.e45e67032a.wbe@email03.secureserver.net>
 <CAGHjPPKjvtLhNE_N+V30YmHS+1GTk0uPeTf0ffhZm2A=boDR1w@mail.gmail.com>
 <27556497.34395.1438177379567.JavaMail.defaultUser@defaultHost>
 <CALgEMhwZnoNTe1=xQa6R9X7N553vjqEmj5TRHbd2hHiMOK4whQ@mail.gmail.com>
Message-ID: <63A5860F-743C-4D50-8BD9-FF279842FF1E@ngalt.com>


> On Jul 29, 2015, at 7:27 AM, Andrew West <andrewcwest at gmail.com> wrote:
> 
> On 29 July 2015 at 14:42, William_J_G Overington
> <wjgo_10009 at btinternet.com> wrote:
>> 
>> For example, one such character could be used to be placed before a list of
>> emoji characters for food allergens to indicate that that a list of dietary
>> need follows.
>> 
>> For example,
>> 
>> My dietary need is no gluten no dairy no egg
>> 
>> There could be a way to indicate the following.
>> 
>> My diet can include soya
> 
> There already is, you can write "My diet can include soya".
> 
> If you are likely to swell up and die if you eat a peanut (for
> example), you will not want to trust your life to an emoji picture of
> a peanut which could be mistaken for something else or rendered as a
> square box for the recipient.  There may be a case to be made for
> encoding symbols for food allergens for labelling purposes, but there
> is no case for encoding such symbols as a form of symbolic language
> for communication of dietary requirements.
> 
> Andrew

I've recently tried to closely follow the care tags on my clothes instead of dumping most of them in the cold/cold batch. When I look at the care tags, I squint at the hieroglyphs[1] for five seconds, give up, and then start looking for instructions written in English ? that is, useful instructions.

I'd imagine a chef trying to 'read' dietary-needs symbols would be similarly trying, only with dire consequences for getting it wrong.

I can see why someone might want to communicate their allergies in a language-agnostic manner while traveling abroad, but for that to work, everyone would need to memorize a bunch of pictographs on the off chance that a foreign traveller is incapable of conveying his or her allergies in a mutually understood spoken/written language. This seems like a worse strategy than carrying around a card that says "I can't have nuts or eggs".


[1] https://en.wikipedia.org/wiki/Laundry_symbol

From c933103 at gmail.com  Mon Aug  3 06:38:27 2015
From: c933103 at gmail.com (gfb hjjhjh)
Date: Mon, 3 Aug 2015 19:38:27 +0800
Subject: Emoji characters for food allergens
In-Reply-To: <63A5860F-743C-4D50-8BD9-FF279842FF1E@ngalt.com>
References: <20150728122416.665a7a7059d7ee80bb4d670165c8327d.e45e67032a.wbe@email03.secureserver.net>
 <CAGHjPPKjvtLhNE_N+V30YmHS+1GTk0uPeTf0ffhZm2A=boDR1w@mail.gmail.com>
 <27556497.34395.1438177379567.JavaMail.defaultUser@defaultHost>
 <CALgEMhwZnoNTe1=xQa6R9X7N553vjqEmj5TRHbd2hHiMOK4whQ@mail.gmail.com>
 <63A5860F-743C-4D50-8BD9-FF279842FF1E@ngalt.com>
Message-ID: <CAGHjPP+Dt5V9OoVKEm540rdmB+TrNh=AW34Fay4fTHAd==rWxg@mail.gmail.com>

general public only need to understand those symbol that related to
themselves, people who prepare food can have a legend for which icon mean
what written in their own language. And i think it is actually better to
establish another standard instead of base it on Unicode as unicode can't
do the job of promoting people to use these symbol unlike what standard
formulation committee can do.
2015?8?3? ??4:53? "Nathan Sharfi" <mailinglists at ngalt.com>???

>
> > On Jul 29, 2015, at 7:27 AM, Andrew West <andrewcwest at gmail.com> wrote:
> >
> > On 29 July 2015 at 14:42, William_J_G Overington
> > <wjgo_10009 at btinternet.com> wrote:
> >>
> >> For example, one such character could be used to be placed before a
> list of
> >> emoji characters for food allergens to indicate that that a list of
> dietary
> >> need follows.
> >>
> >> For example,
> >>
> >> My dietary need is no gluten no dairy no egg
> >>
> >> There could be a way to indicate the following.
> >>
> >> My diet can include soya
> >
> > There already is, you can write "My diet can include soya".
> >
> > If you are likely to swell up and die if you eat a peanut (for
> > example), you will not want to trust your life to an emoji picture of
> > a peanut which could be mistaken for something else or rendered as a
> > square box for the recipient.  There may be a case to be made for
> > encoding symbols for food allergens for labelling purposes, but there
> > is no case for encoding such symbols as a form of symbolic language
> > for communication of dietary requirements.
> >
> > Andrew
>
> I've recently tried to closely follow the care tags on my clothes instead
> of dumping most of them in the cold/cold batch. When I look at the care
> tags, I squint at the hieroglyphs[1] for five seconds, give up, and then
> start looking for instructions written in English ? that is, useful
> instructions.
>
> I'd imagine a chef trying to 'read' dietary-needs symbols would be
> similarly trying, only with dire consequences for getting it wrong.
>
> I can see why someone might want to communicate their allergies in a
> language-agnostic manner while traveling abroad, but for that to work,
> everyone would need to memorize a bunch of pictographs on the off chance
> that a foreign traveller is incapable of conveying his or her allergies in
> a mutually understood spoken/written language. This seems like a worse
> strategy than carrying around a card that says "I can't have nuts or eggs".
>
>
> [1] https://en.wikipedia.org/wiki/Laundry_symbol
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150803/094da480/attachment.html>

From charupdate at orange.fr  Mon Aug  3 09:10:12 2015
From: charupdate at orange.fr (Marcel Schneider)
Date: Mon, 3 Aug 2015 16:10:12 +0200 (CEST)
Subject: Emoji characters for food allergens
Message-ID: <382276435.11376.1438611012864.JavaMail.www@wwinf1g28>

On 03 Aug 2015, at 10:39:13, Mark Davis ??  wrote:

> BTW, the UTC declined to accept the allergen emoji set proposal. While some of the food items may be acceptable and the emoji subcommittee could re-propose them, there are principled problems with trying to deal with allergens as a set of emoji. So that is off the table.

Since food emoji encoding goes on then, there are a few points to flash back on last week, sorry to be late.

On 26 Jul 2015, at 05:45, William_J_G Overington  wrote:

> I suggest that, in view of the importance of precision in conveying information about food allergens, that the emoji characters for food allergens should be separate characters from other emoji characters. That is, encoded in a separate quite distinct block of code points far away in the character map from other emoji characters, with no dual meanings for any of the characters: a character for a food allergen should be quite separate and distinct from a character for any other meaning.

I fear that discouraging the use of food pictographs, may they represent allergen containing food or not, in current messages and with any desirable meaning could prevent users from becoming familiar with. Browsing the Charts I cannot see another place for food related symbols than the last Supplemental Symbols and Picrographs 1F900 block. This is already far away from the Emoticons block U+1F600. The meaning, as often in Unicode, is context-determined. Finding an allergen pictograph on a food package with a convenient markup would then have added an unambiguous sense. I'll try to explain a bit later why emojis may be useful.

> I opine that having two separate meanings for the same character, one meaning as an everyday jolly good fun meaning in a text message and one meaning as a specialist food allergen meaning could be a source of confusion. Far better to encode a separate code block with separate characters right from the start than risk needless and perhaps medically dangerous confusion in the future.

Unicode would have encoded the new allergen pictographs under a Food Allergens subhead, ensuring thus the primary meaning. Furthermore, an everyday use will primarily be less obvious in most of the cases, as food allergens are preferrably depicted as ingredients, while typical everyday food emojis like the fast food shown elsewhere in this thread show mostly prepared food. For example, U+1F35A ?? BOWL OF RICE, while gluten-free, will rather have a dish meaning, whereas 1F33E ?? EAR OF RICE may refer more precisely to the ingredient, and a future EAR OF WHEAT will symbolize gluten-containing cereals, as it does already in food labelling. 

BTW I find it urgent to encode all these ears, as WHEAT, BUCKWHEAT (part of the proposal), and so on, because actually only *two* kinds of cereals have their ear in Unicode: 1F33D ?? EAR OF MAIZE, and 1F33E ?? EAR OF RICE.

> I suggest that for each allergen that there be two characters.
> The glyph for the first character of the pair goes from baseline to ascender.
> The glyph for the second character of the pair is a copy of the glyph for the first character of the pair augmented with a thick red line from lower left descender to higher right a little above the base line, the thick red line perhaps being at about thirty degrees from the horizontal. Thus the thick red line would go over the allergen part of the glyph yet just by clipping it a bit so that clarity is maintained.
> The glyphs are thus for the presence of the allergen and the absence of the allergen respectively.

Sorry, I don???t believe that this would have been agreed, because package design is done in high-end software as QuarkXPress, InDesign, PagePro, so it would be easy to add some expressive and unambiguous markup to a unique symbol. IMHO it might be nice to have something surrounding, like a circle for the presence and a (barred) square for the absence, or conversely.

About colors, the absence of an allergen from a given food being good news for patients, we could opt for some green tone, while by contrast the red color conveys rather a warning and might thus be suitable for its presence. With analogy to road symbols, a red circle could perhaps best express this case, as allergic consumers must avoid the product. I?thought about a triangle, but this has an inner field too small for the symbol while it takes too much place (a triangle being bulkier than square and circle). If the industry agrees, a triangle for presence may be adopted.

> It is typical in the United Kingdom to label food packets not only with an ingredients list but also with a list of allergens in the food and also with a list of allergens not in the food.
> For example, a particular food may contain soya yet not gluten.
> Thus I opine that two characters are needed for each allergen.

Correspondingly, French legislation requires that the allergens be marked up with bold font style in the ingredients list, and this be followed by a list of allergens risking contamination due to their use in the workshop. The meaning of the bold markup must be explained (like ?In bold: information intended for allergic persons?). The United?Kingdom solution is more explicit. The problem is how to transpose this into a CJK context, and that???s where the proposed pictographs will become useful.

> I have deliberately avoided a total strike through at fourty-five degrees as I opine that that could lead to problems distinguishing clearly the glyph for the absence of one allergen from the glyph for the absence of another allergen.

About how to place the slash or backslash, I agree with William that it must not hide the symbol. To achieve this, the allergen pictograph might also be raised to the foreground, being thus fully viewable, while in this case the slash must be very thick and can be continued in outline before the pictograph. Its orientation (upper left - lower right, vs lower left - upper right) may be a matter of personal preference but from heraldics, from road symbols and from Unicode (U+20E0) the backslash could be slightly more current.

[I already answered to some other point and will mention others in next replies.]

Thank you for having made the Mailing List aware of this proposal and for supporting it. I'm sad that it will be essentially removed.

All the best,

Marcel Schneider


From charupdate at orange.fr  Mon Aug  3 09:18:45 2015
From: charupdate at orange.fr (Marcel Schneider)
Date: Mon, 3 Aug 2015 16:18:45 +0200 (CEST)
Subject: Emoji characters for food allergens
Message-ID: <419788703.11590.1438611525033.JavaMail.www@wwinf1g28>

On 28 Jul 2015, at 15:00, Michael Everson  wrote [I placed the quotation first]:

> On 26 Jul 2015, at 06:05, Garth Wallace  wrote:
>
>> On Sat, Jul 25, 2015 at 9:43 AM, William_J_G Overington
>>  wrote:
>>> Emoji characters for food allergens
>>>
>>> An interesting document entitled
>>>
>>> Preliminary proposal to add emoji characters for food allergens
>>>
>>> by Hiroyuki Komatsu
>>>
>>> was added into the UTC (Unicode Technical Committee) Document Register
>>> yesterday.
>>>
>>> http://www.unicode.org/L2/L2015/15197-emoji-food-allergens.pdf
>>>
>>> This is a welcome development.
>>
>> I'm skeptical. I understand the rationale, but several of the proposed
>> characters are essentially SMALL PILE OF BROWN DOTS and would be
>> difficult to distinguish at typical sizes.

[I?ve already answered on this point.]

> I do NOT understand the rationale.
> 
> Emojis are not for labelling things. They?re for the playful expression of emotions.
> 
> Standardized symbols for allergens might be useful, if there were a textual use for them.

On 28 Jul 2015, at 20:26, Garth Wallace  replied:

> Well, there are several emoji for various items encountered in daily
> life, and I think the reasoning is that allergens are important things
> to refer to because of their health effects. It's a bit of a leap to
> say that means there's a need for dedicated pictograms though.
> I agree, it does seem to be putting the cart before the horse.

I believe the issue should be replaced into its original context. All over the world, pictographs allow to convey some vital information to tourists, but more specially in CJK countries they avoid also encumbering packages with lots of Latin, Cyrillic, and possibly Greek and other scripts. Well, personally I would suggest to cite the allergens with their Latin scientific name (as TRITICUM for wheat and by extension, gluten), but I would suggest now to remember that prior to depreciate the proposal, we should ask ourselves and the concerned countries if such a Latin labelling is acceptable. The fact that Mr?Komatsu took the pain of working out this proposal, tends to prove it is *not*.

Best regards,

Marcel Schneider


From charupdate at orange.fr  Mon Aug  3 09:21:52 2015
From: charupdate at orange.fr (Marcel Schneider)
Date: Mon, 3 Aug 2015 16:21:52 +0200 (CEST)
Subject: Emoji characters for food allergens
Message-ID: <1352082955.11634.1438611712981.JavaMail.www@wwinf1g28>

On 29 Jul 2015, at 10:21, William_J_G Overington  wrote:

>> Alternately, scanning the EAN barcode on the package could give access to a database intended for food information. This requires the use of a smartphone or other compatible device.

> That is a good idea.

> In which case the emoji would not need to be encoded on the package, yet would be sent by the database facility. Using EAN barcode to database and the results sent to the end user would need a two-way communication link and that could possibly mean queueing problems as the database facility would possibly be answering requests from many people.

> Another possibility would be to encode the Unicode characters for the allergens contained in the food within a QR code (Quick Response Code) on the package.

> Decoding could then be local, in the device being used to scan the QR code.

> [...]

Somehow this device-relying information system wouldn???t make me really happy. IMHO?the most straightforward communication relies on the packaging, and for this a standardized set of emojis would have been useful.

For more clarity, a textual list may complete the labelling, probably using the Latin scientific names. Every allergic person must then be given by hi?s practician or other health care provider a personal list of allergens, a kind of allergen profile, both in local language and in Latin, plus the pictograph. 

We should perhaps take into consideration that allergen lists may be very long, and translating them to emojis will make them somewhat bulky, particularly on small packages. So the emojis will be used only if desired or required.

Best regards,

Marcel Schneider


From charupdate at orange.fr  Mon Aug  3 09:30:11 2015
From: charupdate at orange.fr (Marcel Schneider)
Date: Mon, 3 Aug 2015 16:30:11 +0200 (CEST)
Subject: Emoji characters for food allergens
Message-ID: <1310487596.11828.1438612211991.JavaMail.www@wwinf1g28>

On 29 Jul 2015, at 15:42, William_J_G Overington  wrote:

[On 28 Jul 2015, at 22:26, gfb hjjhjh  wrote:]

>> As according to http://unicode.org/faq/emoji_dingbats.html , emoji characters do not have single semantics. Which I think it is not what the original proposer want? Or were I misunderstanding that

> Garth Wallace has already indicated in his reply to your post that it was me, not the original proposer, who wanted single semantics.
> [...]
> The easiest thing appears to be to not call the items emoji.
> I opine that a new word is needed to mean the following.
> A character that looks like it is an emoji character yet has precise semantics.
> There is an issue here that is, in my opinion, quite fundamental to the future of encoding items that are currently all regarded as emoji: an issue that goes far beyond the matter of encoding emoji characters for food allergens.
> Communication through the language barrier is of huge importance and may become more so in the future.

IMHO we???ve already overcome the language barrier, as we all communicate in English, at the image of medieval Latin communication across Europe, ancient Roman Empire communication, Koine Greek from Alexander???s conquests on.

> Emoji seemed like a wonderful way to achieve communication through the language barrier.

We remember that Esperanto was also a hopeful way to unify the language, raising much enthusiasm among its followers. IMHO?a pictograph based script can hardly be enough performing, unless it ends up to become a kind of new Esperanto except that it doesn???t include speech.

> Yet if semantics are not defined, then there is a problem.

Not only emojis, even natural language semantics are often not precisely defined, but that doesn???t hinder us in defining the semantics of a particular message by adding more words. Equally an allergen emoji might be preceded or followed by a poison emoji (U+2620) to make the health threat unambiguous.

> Please consider the matter of text to speech in the draft Unicode Technical Report 51.
> I remember years ago I was asked in this mailing list what chat means.
> I think that discussing the meaning of chat is some classic Unicode cultural matter.
> In English it is an informal talk between two or more people, in French it is a cat.

As I can see, in today???s French, ?chat? has the meaning of its English homophone, except when the context makes the (original French) zoological meaning unambiguous. Having said that, I hurry up adding that the English word ?chat? has been francicized to ?tchatche?, but not very successfully.

> So the sequence of Unicode characters only has meaning in the context that they are being used.

And Unicode provides even language tags to disambiguate.

> Now the big opportunity with emoji could be to assist communication through the language barrier.

That???s exact, emojis can assist communication, but they cannot replace classical character based communication entirely.

> From reading about semantics in the linked document it appears that that opportunity might be disappearing or may have gone already.
> This, in my opinion, is unfortunate.
> The food allergen characters could, by being precisely defined with one and only one meaning, be either an exception to the general situation or could be the start of a trend.

We cannot define precisely and irrevocably the meaning of any grapheme, except in mathematics. We only can describe its use at a given time of history. I don???t believe that Unicode has the power of forbiding any semantics of any emoji, nor did it ever aim at. See the English apostrophe: Unicode???s primary advice has been overrun by mainstream usage.

> A name other than emoji is needed for such characters that have one and only one meaning, that meaning precisely defined.

Creating a new script is not in Unicode???s purpose, which is (please check if I???m right) to encode all *existing* scripts. I underscore *existing* with respect to the present context, but originally the stress is on *all*. Encoding *all* existing scripts used in present or in past times, is a great purpose and Unicode is about to reach the goal. Subsequently, *if* a user community creates and uses a *new* script made of pictographs or of other signs, Unicode can be pleased to encode it. Sure.

> [...]
> For example, one such character could be used to be placed before a list of emoji characters for food allergens to indicate that that a list of dietary need follows.
> For example,
> My dietary need is no gluten no dairy no egg
> There could be a way to indicate the following.
> My diet can include soya

My nourishment too includes soya in form of much tonyu (whether fermented or not), and it excludes dairy, egg, meat, poultry, fish, honey; things that were very included in the past. The problem as I see it, is whether people are at ease with expressing it, or not. Personally I don???t hesitate using much natural language to explain the facts, nor do other people I know about. The difference might be that in these cases, the nourishment preferences and aversions result uniquely from the awareness of the crimes committed against the animals, whereas dietary requirements basically result from recommendations made by practicians or other health care providers. The two motivations may overlap.

As communicating dietary requirements results in constraints for other people, especially cooks, servers, attendants, hosts, friends, managers, housekeepers, this communication may often be very sensitive and may induce whether self-humiliation or offence, partly also because natural language is never neutral and moreover leaves a margin to interpretation. The task may even turn out to become impossible when foreign languages are implied. Using standardized emojis can greatly alleviate the deal.

The day when food allergen emojis would have been available, I would have suggested to prepare two bullet lists, stacked or side by side. In the first list, every food emoji is preceded by U+2620 ? SKULL AND CROSS BONES. In the second list, every food emoji is preceded by U+2665 ? BLACK HEART SUIT. I say ?bullet lists?, but the array may also be referred to as lists of two-emoji sequences. I can imagine that this would be received with a smile and gladly followed.

> There is a situation that affects further discussion of some aspects of this matter, though not all aspects of this matter, as a totally symbolic representation could still be discussed.
> http://www.unicode.org/mail-arch/unicode-ml/y2015-m06/0208.html
> However, there is also the following.
> http://www.oxforddictionaries.com/definition/english/moratorium
> Please note the use of the word temporary in the definition.
> So maybe all is not lost and discussion of all aspects will become possible at some future time.

Alas, in this particular context, ?moratorium? is an euphemism with the meaning of a prohibition. Please, note that I use angle quotes to avoid making believe that I were scare-quoting the word. That???s a good example of how useful it is to disambiguate quotation quotes and scare quotes. Well I could use some supplemental words to express that, like:
Alas, in this particular context, the word moratorium as it is used, is an euphemism with the meaning of a prohibition.

It???s always the issue about multiple semantics vs precise definition.

I hope that helps.

All the best,

Marcel Schneider


From charupdate at orange.fr  Mon Aug  3 09:36:09 2015
From: charupdate at orange.fr (Marcel Schneider)
Date: Mon, 3 Aug 2015 16:36:09 +0200 (CEST)
Subject: Refining communication about poisons (was: Re: Emoji characters for
 food allergens)
Message-ID: <1993129798.11946.1438612569896.JavaMail.www@wwinf1g28>

On 29 Jul 2015, at 18:39, Doug Ewell  wrote:

> Andrew West  wrote:

>> There may be a case to be made for encoding symbols for food allergens
>> for labelling purposes, but there is no case for encoding such symbols
>> as a form of symbolic language for communication of dietary
>> requirements.

> For what little it is worth, I agree with Andrew on this.

Sorry, I disagree, as I explain in my previous e-mail:
http://www.unicode.org/mail-arch/unicode-ml/y2015-m08/0009.html

> Earlier I mentioned U+2620 SKULL AND CROSSBONES and U+2623 BIOHAZARD SIGN,
> two symbols which have been in Unicode since the dawn of time. [...]
> While communication about food allergens is undoubtedly
important, it's hard to imagine that communication about poisons and
biohazards is any less important.

Agreed. It???s even more important, as food allergies are triggered by slow poisoning through residual pesticides and food additives, and through the consumption of bad quality cereals grown with abuse of mineral fertilizers. 

This is why U+2620 ? SKULL AND CROSSBONES should be used in food labelling whenever the ingredients are *not* organically grown, or certain food additives are used. 

To complete, I therefore suggest to encode a panel of food hazard symbols, among which:

+ PESTICIDE RESIDUES SYMBOL

+ MINERAL FERTILIZER OVERUSE SYMBOL

+ ARTIFICIALLY COLOURED SYMBOL (use: certain synthetical food colors cause health issues)

+ STABILIZER SYMBOL

+ SALTY FOOD SYMBOL

+ VETERINARY DRUGS RESIDUES SYMBOL (That???s so big an issue that the FDA is validating a *new* drug residues analysis selection model for interstate milk shipping.)

and so on.

Equally, the artificially impoverished food ingredients like white sugar and white flour, are acting poison-like on metabolical level (more explanations would be off-topic) and must thus be declared whenever they are not recompleted with bran, germ, and molasses. To achieve this, the following pictographs will be useful:

+ EMPOVERISHED FOOD WARNING SYMBOL

+ MISSING BRAN AND GERM SYMBOL

+ MISSING MOLASSES SYMBOL


Declaring the least and most probably unexistent traces of food allergens, but concealing from the consumers all these health threatening poisons that are likewise purposely added to everyday food, or the basic carbs are transformed to, is a particularly insidious form of hypocrisy.

This criticism must be taken as a motivation to encode these new pictographs. It does not target in any way the proposer of the allergen emojis, nor any other person here around. It refers to the economical background of food allergen labelling, and thus has its place in this thread.

Best regards,

Marcel Schneider


From charupdate at orange.fr  Mon Aug  3 10:08:14 2015
From: charupdate at orange.fr (Marcel Schneider)
Date: Mon, 3 Aug 2015 17:08:14 +0200 (CEST)
Subject: Emoji characters for food allergens
In-Reply-To: <63A5860F-743C-4D50-8BD9-FF279842FF1E@ngalt.com>
References: <20150728122416.665a7a7059d7ee80bb4d670165c8327d.e45e67032a.wbe@email03.secureserver.net>
 <CAGHjPPKjvtLhNE_N+V30YmHS+1GTk0uPeTf0ffhZm2A=boDR1w@mail.gmail.com>
 <27556497.34395.1438177379567.JavaMail.defaultUser@defaultHost>
 <CALgEMhwZnoNTe1=xQa6R9X7N553vjqEmj5TRHbd2hHiMOK4whQ@mail.gmail.com>
 <63A5860F-743C-4D50-8BD9-FF279842FF1E@ngalt.com>
Message-ID: <371935304.12799.1438614495033.JavaMail.www@wwinf1g28>

On 03 Aug 2015, at 10:57, Nathan Sharfi  wrote:

> I've recently tried to closely follow the care tags on my clothes instead of dumping most of them in the cold/cold batch. When I look at the care tags, I squint at the hieroglyphs[https://en.wikipedia.org/wiki/Laundry_symbol] for five seconds, give up, and then start looking for instructions written in English ? that is, useful instructions.

I'm sorry to really disagree with this little understandable criticism of laundry symbols. The most encountered of the care tags are self-explaining, as the washing and iron temperature limits or discouraging. The other symbols mainly concern dry cleaning and laundry professionals.

Many clothes are shipped across the world, so English would not always be suitable as a language.

Further, symbols stay longer readable while text is often washed out.

> 
> I'd imagine a chef trying to 'read' dietary-needs symbols would be similarly trying, only with dire consequences for getting it wrong.

That's another case. All chefs understand English, so presenting allergen lists in English is a working strategy. 

The concern added by William is about how to present such a list, as he wishes a symbol for "I'm allergic to" and a symbol for "My diet can include". For this I suggest the poison and heart symbols. To follow your advice, these may be used as bullets for lists written in English.

> 
> I can see why someone might want to communicate their allergies in a language-agnostic manner while traveling abroad, but for that to work, everyone would need to memorize a bunch of pictographs on the off chance that a foreign traveller is incapable of conveying his or her allergies in a mutually understood spoken/written language. This seems like a worse strategy than carrying around a card that says "I can't have nuts or eggs".

I understand the issue.

Best regards,

Marcel Schneider


From doug at ewellic.org  Mon Aug  3 13:02:38 2015
From: doug at ewellic.org (Doug Ewell)
Date: Mon, 03 Aug 2015 11:02:38 -0700
Subject: Windows keyboard restrictions (was: Re: Windows 10 release)
Message-ID: <20150803110238.665a7a7059d7ee80bb4d670165c8327d.301dafd5ac.wbe@email03.secureserver.net>

Marcel Schneider <charupdate at orange dot fr> wrote:

> The bug on Windows I encountered at the end of July has been
> definitely identified and reconstructed. After ninety-five drivers
> compiled since the bug appeared, I can tell so much as that the
> problem is related to the length of the so-called ligatures. When the
> MSKLC was built, they were limited to four characters on Windows (see
> glossary in the MSKLC Help). On my machine the maximal length is 16
> characters. The problem is that this is not equal on all shift states
> and perhaps keys. Roughly, I can put five characters on modification
> number three, that is normally AltGr, but not on #4 (Shift+AltGr).

As far as I can tell, the limit for a ligature on a Windows keyboard
layout is four UTF-16 code points:

MSKLC help, under "Validation Reference":
"Ligatures cannot contain more than four UTF-16 code points"

Presentation from IUC 23 by Michael Kaplan (author of MSKLC) and Cathy
Wissink:
http://tinyurl.com/o49r4bz

KbdEdit:
http://www.kbdedit.com/

MUFI:
http://folk.uib.no/hnooh/mufi/keyboards/WinXPkeyboard.html

I understand that there are some tools (such as Keyboard Layout Manager)
that claim a higher limit, and it may even be possible in some cases to
assign more than four, but the DOCUMENTED limit appears to be four. (If
you claim that it is not, please provide a link to the relevant official
documentation, and note that C++ code showing 16 fields is not
documentation.)

It is not a bug for software to fail to perform BEYOND its documented
limits.

Since you are so very eager to declare this a bug, or a collection of
bugs, rather than a design limitation, I strongly recommend you get in
touch with Microsoft Technical Support and express your concerns to
them. Make sure to let them know just how certain you are that these are
bugs. See if they'll send you a T-shirt.

--
Doug Ewell | http://ewellic.org | Thornton, CO ????


From asmus-inc at ix.netcom.com  Mon Aug  3 14:01:09 2015
From: asmus-inc at ix.netcom.com (Asmus Freytag (t))
Date: Mon, 3 Aug 2015 12:01:09 -0700
Subject: Emoji characters for food allergens
In-Reply-To: <371935304.12799.1438614495033.JavaMail.www@wwinf1g28>
References: <20150728122416.665a7a7059d7ee80bb4d670165c8327d.e45e67032a.wbe@email03.secureserver.net>
 <CAGHjPPKjvtLhNE_N+V30YmHS+1GTk0uPeTf0ffhZm2A=boDR1w@mail.gmail.com>
 <27556497.34395.1438177379567.JavaMail.defaultUser@defaultHost>
 <CALgEMhwZnoNTe1=xQa6R9X7N553vjqEmj5TRHbd2hHiMOK4whQ@mail.gmail.com>
 <63A5860F-743C-4D50-8BD9-FF279842FF1E@ngalt.com>
 <371935304.12799.1438614495033.JavaMail.www@wwinf1g28>
Message-ID: <55BFBA75.8090500@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150803/519fe6fe/attachment.html>

From asmus-inc at ix.netcom.com  Mon Aug  3 14:38:06 2015
From: asmus-inc at ix.netcom.com (Asmus Freytag (t))
Date: Mon, 3 Aug 2015 12:38:06 -0700
Subject: Emoji characters for food allergens
In-Reply-To: <371935304.12799.1438614495033.JavaMail.www@wwinf1g28>
References: <20150728122416.665a7a7059d7ee80bb4d670165c8327d.e45e67032a.wbe@email03.secureserver.net>
 <CAGHjPPKjvtLhNE_N+V30YmHS+1GTk0uPeTf0ffhZm2A=boDR1w@mail.gmail.com>
 <27556497.34395.1438177379567.JavaMail.defaultUser@defaultHost>
 <CALgEMhwZnoNTe1=xQa6R9X7N553vjqEmj5TRHbd2hHiMOK4whQ@mail.gmail.com>
 <63A5860F-743C-4D50-8BD9-FF279842FF1E@ngalt.com>
 <371935304.12799.1438614495033.JavaMail.www@wwinf1g28>
Message-ID: <55BFC31E.2060607@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150803/05cc2117/attachment.html>

From gwalla at gmail.com  Mon Aug  3 15:22:01 2015
From: gwalla at gmail.com (Garth Wallace)
Date: Mon, 3 Aug 2015 13:22:01 -0700
Subject: Olympic sports emoji
In-Reply-To: <CA+p4_H1p2LPbm7-CZjFxTFgx6STFTv2gQ425=drMQiB0H1kJig@mail.gmail.com>
References: <20150727151200.665a7a7059d7ee80bb4d670165c8327d.8d64c4d981.wbe@email03.secureserver.net>
 <CA+p4_H2=tw5J_TQDAr-irvnJ0V-8Cy_Q=Hp4j928kfmTufC1=g@mail.gmail.com>
 <CA+p4_H1p2LPbm7-CZjFxTFgx6STFTv2gQ425=drMQiB0H1kJig@mail.gmail.com>
Message-ID: <CA+p4_H2Ms0LpOD8mofz8FfjJ3aea=+c05zu-RRg7sr-3sp=mVQ@mail.gmail.com>

On Mon, Jul 27, 2015 at 10:55 PM, Garth Wallace <gwalla at gmail.com> wrote:
>
> On Mon, Jul 27, 2015 at 3:12 PM, Doug Ewell <doug at ewellic.org> wrote:
>> Leo Broukhis <leob at mailcom dot com> wrote:
>>
>>> Fonts vary and can be copyrighted, no doubt, but Unicode is not about
>>> fonts.
>>
>> I was going to bust out the Apple logo as an analogy to the Olympic
>> symbols, but apparently the Apple logo is trademarked and not merely
>> copyrighted, so never mind.
>>
>> In any case, if this is just a character/glyph thing, then there
>> shouldn't be a problem using either the existing emoji or the ones
>> proposed in L2/15-196R for Olympic sports, since the glyphs can simply
>> be styled as needed.
>
> Would this be considered within the normal range of glyphic variation?
> Would an icon of two pugilists fighting be an acceptable rendering of
> a BOXING GLOVE emoji?
>
> BTW, speaking as a martial artist myself, I have to say an empty dogi
> is an odd representation for martial arts, even specifically Japanese
> ones. The proposal says that it could be used for judo, karate, and
> tae kwon do; it at least matches the first two (they are distinct, but
> not in a way that would , and practice uniforms for TKD are similar,
> but competitive TKD under WTF rules (including Olympic competition)
> uses several pieces of protective equipment (helmet, gloves, chest
> guard) with colored padding over the dobok.

Also, has anyone else noticed that the proposed WRESTLING emoji
doesn't depict competitive wrestling? It's a pair of shirtless men in
baggy pants standing straight up, with one apparently grabbing the
other by the ponytail and hitting his face.

From petercon at microsoft.com  Mon Aug  3 17:24:25 2015
From: petercon at microsoft.com (Peter Constable)
Date: Mon, 3 Aug 2015 22:24:25 +0000
Subject: Emoji characters for food allergens
In-Reply-To: <55BFBA75.8090500@ix.netcom.com>
References: <20150728122416.665a7a7059d7ee80bb4d670165c8327d.e45e67032a.wbe@email03.secureserver.net>
 <CAGHjPPKjvtLhNE_N+V30YmHS+1GTk0uPeTf0ffhZm2A=boDR1w@mail.gmail.com>
 <27556497.34395.1438177379567.JavaMail.defaultUser@defaultHost>
 <CALgEMhwZnoNTe1=xQa6R9X7N553vjqEmj5TRHbd2hHiMOK4whQ@mail.gmail.com>
 <63A5860F-743C-4D50-8BD9-FF279842FF1E@ngalt.com>
 <371935304.12799.1438614495033.JavaMail.www@wwinf1g28>
 <55BFBA75.8090500@ix.netcom.com>
Message-ID: <BY2PR0301MB160888BCD659DDBEE9B57EB6D5770@BY2PR0301MB1608.namprd03.prod.outlook.com>

Once back when I was living in Thailand, I was riding in a taxi to the Bangkok airport on a recently-opened highway. There were road signs posted at intervals that had a two-digit number (?60? or something like that) enclosed in a circle. Having had enough experience with road signage in my home country and also other countries, I recognized this to be a speed limit.

But knowing common practices for how many Thais at the time would obtain their driver?s license, and the education level of many Thais coming from rural areas to work as taxi drivers in Bangkok,  I was curious enough to ask the driver what the sign meant. (He being monolingual, this was all in Thai.) He thought for a moment and then responded that it was the distance to the airport.

Anecdote aside, the assumption of these discussions is that symbols are iconic ? which means that the symbol communicates a conventional semantic. And the point of this being _conventional_ is that the semantic is not self-evident from the appearance of the image, but rather is based on a shared agreement. For example, a photograph of a chair is not iconic since it is an ostensive rendition of an actual chair. But a symbol of an iron with a dot inside it intended to mean ?can be ironed with low heat? is iconic because it?s meaning is conventional, and like any convention, must be learned.

Some conventions may be universally learned, but very few are. Most are limited to particular cultures, and even if used in many cultures, may be learned by only small portions of the given culture. Even something like a speed limit sign that a driver without a given culture sees every day and is expected to understand is not necessarily something that the driver has learned. Much less something like icons for handling of laundry, which have been used in several countries for a few decades now but that nobody has ever been required to learn, and that few people actually do learn to any great extent.


Peter

From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Asmus Freytag (t)
Sent: Monday, August 3, 2015 12:01 PM
To: unicode at unicode.org
Subject: Re: Emoji characters for food allergens


I'm sorry to really disagree with this little understandable criticism of laundry symbols. The most encountered of the care tags are self-explaining, as the washing and iron temperature limits or discouraging. The other symbols mainly concern dry cleaning and laundry professionals.

The laundry symbols are like traffic signs. The ones you see daily aren't difficult to remember, but any there are always some rare ones that are a bit baffling. What you apparently do not realize is that in significant parts of the world, these symbols are not common (or occur only as adjunct to text). There's therefore no daily reinforcement at all.

Where you live, the situation is reversed; no wonder you are baffled.


All chefs understand English,

I would regard that statement to have a very high probability of being wrong. Which would make any conclusions based on it invalid.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150803/848ba234/attachment.html>

From asmus-inc at ix.netcom.com  Mon Aug  3 18:14:57 2015
From: asmus-inc at ix.netcom.com (Asmus Freytag (t))
Date: Mon, 3 Aug 2015 16:14:57 -0700
Subject: Emoji characters for food allergens
In-Reply-To: <BY2PR0301MB160888BCD659DDBEE9B57EB6D5770@BY2PR0301MB1608.namprd03.prod.outlook.com>
References: <20150728122416.665a7a7059d7ee80bb4d670165c8327d.e45e67032a.wbe@email03.secureserver.net>
 <CAGHjPPKjvtLhNE_N+V30YmHS+1GTk0uPeTf0ffhZm2A=boDR1w@mail.gmail.com>
 <27556497.34395.1438177379567.JavaMail.defaultUser@defaultHost>
 <CALgEMhwZnoNTe1=xQa6R9X7N553vjqEmj5TRHbd2hHiMOK4whQ@mail.gmail.com>
 <63A5860F-743C-4D50-8BD9-FF279842FF1E@ngalt.com>
 <371935304.12799.1438614495033.JavaMail.www@wwinf1g28>
 <55BFBA75.8090500@ix.netcom.com>
 <BY2PR0301MB160888BCD659DDBEE9B57EB6D5770@BY2PR0301MB1608.namprd03.prod.outlook.com>
Message-ID: <55BFF5F1.6000204@ix.netcom.com>

Nice anecdote.

I share the concerns you raise in your reflection on the limits of 
shared conventions. Unicode cannot be so constrained that it encodes 
only universally accepted icons, but it should be constrained to not 
encode characters on foot of possible conventions that are not actually 
demonstrated anywhere. There's currently no convention for denoting 
allergens by emoji (pictorial renditions), so that usage is something 
that is speculative at the moment.

Not as speculative is the suggestion that certain food items should be 
added - it seems to be an acceptable principle to encode "iconic" foods. 
That would argue for emoji for milk and bread (w/o cross aliasing it as 
gluten), but not for soy beans, for example.

A./

On 8/3/2015 3:24 PM, Peter Constable wrote:
>
> Once back when I was living in Thailand, I was riding in a taxi to the 
> Bangkok airport on a recently-opened highway. There were road signs 
> posted at intervals that had a two-digit number (?60? or something 
> like that) enclosed in a circle. Having had enough experience with 
> road signage in my home country and also other countries, I recognized 
> this to be a speed limit.
>
> But knowing common practices for how many Thais at the time would 
> obtain their driver?s license, and the education level of many Thais 
> coming from rural areas to work as taxi drivers in Bangkok,  I was 
> curious enough to ask the driver what the sign meant. (He being 
> monolingual, this was all in Thai.) He thought for a moment and then 
> responded that it was the distance to the airport.
>
> Anecdote aside, the assumption of these discussions is that symbols 
> are iconic ? which means that the symbol communicates a conventional 
> semantic. And the point of this being _/conventional/_ is that the 
> semantic is not self-evident from the appearance of the image, but 
> rather is based on a shared agreement. For example, a photograph of a 
> chair is not iconic since it is an ostensive rendition of an actual 
> chair. But a symbol of an iron with a dot inside it intended to mean 
> ?can be ironed with low heat? is iconic because it?s meaning is 
> conventional, and like any convention, must be learned.
>
> Some conventions may be universally learned, but very few are. Most 
> are limited to particular cultures, and even if used in many cultures, 
> may be learned by only small portions of the given culture. Even 
> something like a speed limit sign that a driver without a given 
> culture sees every day and is expected to understand is not 
> necessarily something that the driver has learned. Much less something 
> like icons for handling of laundry, which have been used in several 
> countries for a few decades now but that nobody has ever been required 
> to learn, and that few people actually do learn to any great extent.
>
> Peter
>
> *From:*Unicode [mailto:unicode-bounces at unicode.org] *On Behalf Of 
> *Asmus Freytag (t)
> *Sent:* Monday, August 3, 2015 12:01 PM
> *To:* unicode at unicode.org
> *Subject:* Re: Emoji characters for food allergens
>
>     I'm sorry to really disagree with this little understandable criticism of laundry symbols. The most encountered of the care tags are self-explaining, as the washing and iron temperature limits or discouraging. The other symbols mainly concern dry cleaning and laundry professionals.
>
>
> The laundry symbols are like traffic signs. The ones you see daily 
> aren't difficult to remember, but any there are always some rare ones 
> that are a bit baffling. What you apparently do not realize is that in 
> significant parts of the world, these symbols are not common (or occur 
> only as adjunct to text). There's therefore no daily reinforcement at all.
>
> Where you live, the situation is reversed; no wonder you are baffled.
>
>
>     All chefs understand English,
>
>
> I would regard that statement to have a very high probability of being 
> wrong. Which would make any conclusions based on it invalid.
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150803/8284e153/attachment.html>

From mark at kli.org  Mon Aug  3 18:24:07 2015
From: mark at kli.org (Mark E. Shoulson)
Date: Mon, 03 Aug 2015 19:24:07 -0400
Subject: Emoji characters for food allergens
In-Reply-To: <1310487596.11828.1438612211991.JavaMail.www@wwinf1g28>
References: <1310487596.11828.1438612211991.JavaMail.www@wwinf1g28>
Message-ID: <55BFF817.2070401@kli.org>

On 08/03/2015 10:30 AM, Marcel Schneider wrote:
> On 29 Jul 2015, at 15:42, William_J_G Overington  wrote:
>
>> Emoji seemed like a wonderful way to achieve communication through the language barrier.
> We remember that Esperanto was also a hopeful way to unify the language, raising much enthusiasm among its followers. IMHO a pictograph based script can hardly be enough performing, unless it ends up to become a kind of new Esperanto except that it doesn???t include speech.

It's already noted that this is totally out of scope for Unicode, but if 
you're interested in this kind of pictographic pidgin, take a look at 
https://www.kwikpoint.com/  Someone already did some of it.

~mark


From richard.wordingham at ntlworld.com  Wed Aug  5 14:32:02 2015
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Wed, 5 Aug 2015 20:32:02 +0100
Subject: Mongolian Joining Type
In-Reply-To: <20150802222319.25a2dd7e@JRWUBU2>
References: <20150802222319.25a2dd7e@JRWUBU2>
Message-ID: <20150805203202.0d148761@JRWUBU2>

On Sun, 2 Aug 2015 22:23:19 +0100
Richard Wordingham <richard.wordingham at ntlworld.com> wrote:

> As far as I can tell, it is currently supposed to be controlled by the
> property joining_type. However, this property appears only to have
> been non-trivially assigned to the characters of the Mongolian script
> from Version 6.3.0.  How was categorisation assigned before then?

Is there anyone alive and here who remembers?  Or knows where to find
the information?  (As opposed to merely knowing where the information
ought to have been recorded.) 

Richard.

From roozbeh at unicode.org  Wed Aug  5 14:48:59 2015
From: roozbeh at unicode.org (Roozbeh Pournader)
Date: Wed, 5 Aug 2015 12:48:59 -0700
Subject: Mongolian Joining Type
In-Reply-To: <20150805203202.0d148761@JRWUBU2>
References: <20150802222319.25a2dd7e@JRWUBU2> <20150805203202.0d148761@JRWUBU2>
Message-ID: <CABWzK_UEqFAx-wJK04NiJCrbA5w1zmdi8eXfYjVfXvHQKEtSww@mail.gmail.com>

These were the original proposals:
http://www.unicode.org/L2/L2012/12202-shaping.txt
http://www.unicode.org/L2/L2012/12360-mong-shaping.txt
(with considerable UTC discussions).

A good trick is going through the posted UTC minutes and searching for the
topic you are interested in. Or just do Google searches, restricting your
search to site:unicode.org and adding "L2" to the search string.

On Wed, Aug 5, 2015 at 12:32 PM, Richard Wordingham <
richard.wordingham at ntlworld.com> wrote:

> On Sun, 2 Aug 2015 22:23:19 +0100
> Richard Wordingham <richard.wordingham at ntlworld.com> wrote:
>
> > As far as I can tell, it is currently supposed to be controlled by the
> > property joining_type. However, this property appears only to have
> > been non-trivially assigned to the characters of the Mongolian script
> > from Version 6.3.0.  How was categorisation assigned before then?
>
> Is there anyone alive and here who remembers?  Or knows where to find
> the information?  (As opposed to merely knowing where the information
> ought to have been recorded.)
>
> Richard.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150805/ff5520af/attachment.html>

From richard.wordingham at ntlworld.com  Wed Aug  5 17:49:52 2015
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Wed, 5 Aug 2015 23:49:52 +0100
Subject: Mongolian Joining Type
In-Reply-To: <CABWzK_UEqFAx-wJK04NiJCrbA5w1zmdi8eXfYjVfXvHQKEtSww@mail.gmail.com>
References: <20150802222319.25a2dd7e@JRWUBU2> <20150805203202.0d148761@JRWUBU2>
 <CABWzK_UEqFAx-wJK04NiJCrbA5w1zmdi8eXfYjVfXvHQKEtSww@mail.gmail.com>
Message-ID: <20150805234952.52a8fcb2@JRWUBU2>

On Wed, 5 Aug 2015 12:48:59 -0700
Roozbeh Pournader <roozbeh at unicode.org> wrote:

> These were the original proposals:
> http://www.unicode.org/L2/L2012/12202-shaping.txt
> http://www.unicode.org/L2/L2012/12360-mong-shaping.txt
> (with considerable UTC discussions).
> 
> A good trick is going through the posted UTC minutes and searching
> for the topic you are interested in. Or just do Google searches,
> restricting your search to site:unicode.org and adding "L2" to the
> search string.

So how did you obtain the joining types for U+180E MONGOLIAN VOWEL
SEPARATOR (MVS) (specified as U) and U+202F NARROW NO-BREAK SPACE
(NNBSP) (defaulting to U)?  Did you study the Mongolian variation
sequences?  Did someone tell you how they behaved?

One problem from your source is that UTC discussions are rarely
minuted.  The decisions are recorded, but not the reasoning.  The
other, is that I am interested in what the state of affairs was before
the change.  I have a suspicion that no-one had defined the meaning of
the joining forms because 'it was obvious'.

There are arguments going round that NNBSP acts as though joining to the
following character (the one to the right in horizontal text), which
would make it joining type L.  MVS seems a bit of an oddity.  The
standardized variants make most sense if it is of joining type T
('transparent') or D ('dual_joining'), but a further contextual
substitution is still required if there is no variation selector.

Richard.

From richard.wordingham at ntlworld.com  Wed Aug  5 21:00:14 2015
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Thu, 6 Aug 2015 03:00:14 +0100
Subject: Mongolian Joining Type
In-Reply-To: <CABWzK_WFLVYzGyykMv1MsGgrb_gxUyKMqdCpsH6iaOcuww=L_A@mail.gmail.com>
References: <20150802222319.25a2dd7e@JRWUBU2> <20150805203202.0d148761@JRWUBU2>
 <CABWzK_UEqFAx-wJK04NiJCrbA5w1zmdi8eXfYjVfXvHQKEtSww@mail.gmail.com>
 <20150805234952.52a8fcb2@JRWUBU2>
 <CABWzK_WFLVYzGyykMv1MsGgrb_gxUyKMqdCpsH6iaOcuww=L_A@mail.gmail.com>
Message-ID: <20150806030014.7aa9906d@JRWUBU2>

On Wed, 5 Aug 2015 17:26:57 -0700
Roozbeh Pournader <roozbeh at google.com> wrote:

> On Wed, Aug 5, 2015 at 3:49 PM, Richard Wordingham <
> richard.wordingham at ntlworld.com> wrote:

> > MVS seems a bit of an oddity.  The
> > standardized variants make most sense if it is of joining type T
> > ('transparent') or D ('dual_joining'), but a further contextual
> > substitution is still required if there is no variation selector.

> That seems contradictory with what the Core Specification says ...

Please tell me where.  I couldn't find anything helpful when I looked.

> ... and
> how I understand the MVS. Would you please provide examples for it
> behaving like a T or D?

In isolation, the form of the vowel after MVS is produced by the
following combinations:

1820 180B; second form; final # MONGOLIAN LETTER A

1821 180B; second form; final # MONGOLIAN LETTER E

(They're the same glyph.)

For these to be a final, as opposed to an isolated form, MVS must be T,
D or L.

In isolation, the form of the consonant before MVS is produced by the
following combinations:

1828 180C; third form; medial # MONGOLIAN LETTER NA

182C 180D; fourth form; medial # MONGOLIAN LETTER QA

182D 180C; third form; medial # MONGOLIAN LETTER GA

1836 180C; third form; medial # MONGOLIAN LETTER YA

For these to be medial, MVS must be T, D or R. 

Consequently, MVS must be T or D!

If there are no variation selectors, it doesn't really matter what MVS
is, provided the contextual changes triggering on MVS change all four
forms (isolated, initial, medial and final).

The Mongolian Baiti font is in the process of abandoning support for the
above variations in accordance with a deeply buried proposal to tinker
with the encoding of Mongolian.  (Unicode string encodings aren't stable
until there's a large volume of use or a change would be too
embarrassing.)

Richard.

From charupdate at orange.fr  Thu Aug  6 02:43:54 2015
From: charupdate at orange.fr (Marcel Schneider)
Date: Thu, 6 Aug 2015 09:43:54 +0200 (CEST)
Subject: Windows keyboard restrictions
In-Reply-To: <20150803110238.665a7a7059d7ee80bb4d670165c8327d.301dafd5ac.wbe@email03.secureserver.net>
References: <20150803110238.665a7a7059d7ee80bb4d670165c8327d.301dafd5ac.wbe@email03.secureserver.net>
Message-ID: <32441471.4754.1438847034766.JavaMail.www@wwinf1e15>

A part of the documentation you request is available:

?Download Windows Driver Kit Version 7.1.0 from Official Microsoft Download Center.? N. p., 1 Dec. 2014. Web. 1 Dec. 2014.

https://www.microsoft.com/en-us/download/details.aspx?id=11800

C:\WinDDK\7600.16385.1\inc\api\kbd.h

Line 469, and preceding.

---

Note: This file can be accessed on-line on third party websites. You may wish to try a Bing* search for "kbd.h".

* My preferred internet search engine is Bing???

[This ballot box U+2610 is on my keyboard at Shift+Kana+L.]

---

To circumvent the issues araising from the word ?bug?, we may simply ban that and focus on a few comments:

* Ligature is an internal name for Wchar sequences that are 
* generated when a specified key is pressed.
*
* The ligatures are all in *one* table. [That is, unlike the key mapping allocation tables, which *can* be more than one, and often are so.]
* The number of characters of the longest ligature
* determines the number replacing the  in
* static ALLOC_SECTION_LDATA LIGATURE
* and in
* ,
* sizeof(aLigature[0]),
* (PLIGATURE1)aLigature
* below in the static ALLOC_SECTION_LDATA KBDTABLES.
*
* The maximum length of ligatures is 16 characters.
* Characters from 17th on are discarded.
*
* The ligatures table must be defined for  characters, 
* whether in kbd.h, or kbdfrfre.h, or here before,
* using the following define:
* TYPEDEF_LIGATURE() 
* For clarification, a trailing comment is added:
* // LIGATURE, *PLIGATURE
* Tables for up to 5 characters length are already defined in
* C:\WinDDK\7600.16385.1\inc\api\kbd.h.
*
* The lasting Wchar fields of each ligature that is shorter than 
* the maximum length defined, may be filled up with 0xF000, or with
* WCH_NONE as defined in kbd.h, or NONE if defined in the custom header.
* These entries may be shortened, especially when the ligatures table
* is not edited in a single spreadsheet table.


What???s new for me, is that ?sometimes? [scare quotes], the ligature length must not exceed four characters. I already knew what???s written in the MSKLC Help about this topic, and I explained in my previous e-mail that, when the MSKLC?was built, Windows did not support more than four characters per ligature. (That???s the only straightforward explanation of this point of the MSKLC.) As this proved to be insufficient, Microsoft must have decided to raise the limit to sixteen. Following some advice I programmed a ligature of 35 characters on Sat Feb 28, 2015; thus I've got the opportunity of seeing that 16?characters were inserted, while the overflow was discarded. Windows worked normally. The mapped virtual key was VK_OEM_COMMA, and the modification number was?8, that was Shift+AltGr+Kana (which it is still today).

After the recent event, we may add the following:

* CAUTION: THE INITIAL MAXIMUM LENGTH OF LIGATURES WAS 4 CHARACTERS.
* DEPENDING ON THE SHIFT STATE, IT MAY HAPPEN THAT LONGER LIGATURES
* CAUSE THE KEYBOARD DRIVER TO FAIL.
* IF THE FAILING DRIVER IS THE ONE OF THE DEFAULT KEYBOARD LAYOUT,
* THE SYSTEM MAY NOT WORK AS EXPECTED.

I???m hopeful that you will agree upon this formulation, and I hope that helps.

Best regards,

Marcel Schneider


On 03 Aug 2015, at 20:02, Doug Ewell  wrote:

> Marcel Schneider wrote:
> 
> > The bug on Windows I encountered at the end of July has been
> > definitely identified and reconstructed. After ninety-five drivers
> > compiled since the bug appeared, I can tell so much as that the
> > problem is related to the length of the so-called ligatures. When the
> > MSKLC was built, they were limited to four characters on Windows (see
> > glossary in the MSKLC Help). On my machine the maximal length is 16
> > characters. The problem is that this is not equal on all shift states
> > and perhaps keys. Roughly, I can put five characters on modification
> > number three, that is normally AltGr, but not on #4 (Shift+AltGr).
> 
> As far as I can tell, the limit for a ligature on a Windows keyboard
> layout is four UTF-16 code points:
> 
> MSKLC help, under "Validation Reference":
> "Ligatures cannot contain more than four UTF-16 code points"
> 
> Presentation from IUC 23 by Michael Kaplan (author of MSKLC) and Cathy
> Wissink:
> http://tinyurl.com/o49r4bz
> 
> KbdEdit:
> http://www.kbdedit.com/
> 
> MUFI:
> http://folk.uib.no/hnooh/mufi/keyboards/WinXPkeyboard.html
> 
> I understand that there are some tools (such as Keyboard Layout Manager)
> that claim a higher limit, and it may even be possible in some cases to
> assign more than four, but the DOCUMENTED limit appears to be four. (If
> you claim that it is not, please provide a link to the relevant official
> documentation, and note that C++ code showing 16 fields is not
> documentation.)
> 
> It is not a bug for software to fail to perform BEYOND its documented
> limits.
> 
> Since you are so very eager to declare this a bug, or a collection of
> bugs, rather than a design limitation, I strongly recommend you get in
> touch with Microsoft Technical Support and express your concerns to
> them. Make sure to let them know just how certain you are that these are
> bugs. See if they'll send you a T-shirt.
> 
> --
> Doug Ewell | http://ewellic.org | Thornton, CO ????
> 
> 
>


From charupdate at orange.fr  Thu Aug  6 08:29:14 2015
From: charupdate at orange.fr (Marcel Schneider)
Date: Thu, 6 Aug 2015 15:29:14 +0200 (CEST)
Subject: Windows keyboard restrictions
Message-ID: <1232752979.9255.1438867754224.JavaMail.www@wwinf1h09>

I've got a bug in my mailbox. While taking care to send my e-mail in plain text, I got it converted somehow to HTML with all "tags" disappearing. 
So I ended up replacing all < and > by single angle quotation marks. That seems safer than converting them to HTML codes.
Perhaps I shouldn't call it a bug, but just that I don't know how to use a mailbox. Sorry to send it twice.
N.B. I'll send two others in reply to the emoji thread, I just can't do it all at once.
___________________________________________________________________

A part of the documentation you request is available:

?Download Windows Driver Kit Version 7.1.0 from Official Microsoft Download Center.? N. p., 1 Dec. 2014. Web. 1 Dec. 2014.

https://www.microsoft.com/en-us/download/details.aspx?id=11800

C:\WinDDK\7600.16385.1\inc\api\kbd.h

Line 469, and preceding.

---

Note: This file can be accessed on-line on third party websites. You may wish to try a Bing* search for "kbd.h".

* My preferred internet search engine is Bing???

[This ballot box U+2610 is on my keyboard at Shift+Kana+L.]

---

To circumvent the issues araising from the word ?bug?, we may simply ban that and focus on a few comments:

* Ligature is an internal name for Wchar sequences that are 
* generated when a specified key is pressed.
*
* The ligatures are all in *one* table.
* The number of characters of the longest ligature
* determines the number replacing the ?n? in
* static ALLOC_SECTION_LDATA LIGATURE?n?
* and in
* ?n?,
* sizeof(aLigature[0]),
* (PLIGATURE1)aLigature
* below in the static ALLOC_SECTION_LDATA KBDTABLES.
*
* The maximum length of ligatures is 16 characters.
* Characters from 17th on are discarded.
*
* The ligatures table must be defined for ?n? characters, 
* whether in kbd.h, or kbdfrfre.h, or here before,
* using the following define:
* TYPEDEF_LIGATURE(?n?) 
* For clarification, a trailing comment is added:
* // LIGATURE?n?, *PLIGATURE?n?
* Tables for up to 5 characters length are already defined in
* C:\WinDDK\7600.16385.1\inc\api\kbd.h.
*
* The lasting Wchar fields of each ligature that is shorter than 
* the maximum length defined, may be filled up with 0xF000, or with
* WCH_NONE as defined in kbd.h, or NONE if defined in the custom header.
* These entries may be shortened, especially when the ligatures table
* is not edited in a single spreadsheet table.


What?s new for me, is that ?sometimes? [scare quotes], the ligature length must not exceed four characters. I already knew what?s written in the MSKLC Help about this topic, and I explained in my previous e-mail that, when the MSKLC?was built, Windows did not support more than four characters per ligature. (That?s the only straightforward explanation of this point of the MSKLC.) As this proved to be insufficient, Microsoft must have decided to raise the limit to sixteen. Following some advice I programmed a ligature of 35 characters on Sat Feb 28, 2015; thus I've got the opportunity of seeing that 16?characters were inserted, while the overflow was discarded. Windows worked normally. The mapped virtual key was VK_OEM_COMMA, and the modification number was?8, that was Shift+AltGr+Kana (which it is still today).

After the recent event, we may add the following:

* CAUTION: THE INITIAL MAXIMUM LENGTH OF LIGATURES WAS 4 CHARACTERS.
* DEPENDING ON THE SHIFT STATE, IT MAY HAPPEN THAT LONGER LIGATURES
* CAUSE THE KEYBOARD DRIVER TO FAIL.
* IF THE FAILING DRIVER IS THE ONE OF THE DEFAULT KEYBOARD LAYOUT,
* THE SYSTEM MAY NOT WORK AS EXPECTED.

I?m hopeful that you will agree upon this formulation, and I hope that helps.

Best regards,

Marcel Schneider


On 03 Aug 2015, at 20:02, Doug Ewell wrote:

> Marcel Schneider wrote:
> 
> > The bug on Windows I encountered at the end of July has been
> > definitely identified and reconstructed. After ninety-five drivers
> > compiled since the bug appeared, I can tell so much as that the
> > problem is related to the length of the so-called ligatures. When the
> > MSKLC was built, they were limited to four characters on Windows (see
> > glossary in the MSKLC Help). On my machine the maximal length is 16
> > characters. The problem is that this is not equal on all shift states
> > and perhaps keys. Roughly, I can put five characters on modification
> > number three, that is normally AltGr, but not on #4 (Shift+AltGr).
> 
> As far as I can tell, the limit for a ligature on a Windows keyboard
> layout is four UTF-16 code points:
> 
> MSKLC help, under "Validation Reference":
> "Ligatures cannot contain more than four UTF-16 code points"
> 
> Presentation from IUC 23 by Michael Kaplan (author of MSKLC) and Cathy
> Wissink:
> http://tinyurl.com/o49r4bz
> 
> KbdEdit:
> http://www.kbdedit.com/
> 
> MUFI:
> http://folk.uib.no/hnooh/mufi/keyboards/WinXPkeyboard.html
> 
> I understand that there are some tools (such as Keyboard Layout Manager)
> that claim a higher limit, and it may even be possible in some cases to
> assign more than four, but the DOCUMENTED limit appears to be four. (If
> you claim that it is not, please provide a link to the relevant official
> documentation, and note that C++ code showing 16 fields is not
> documentation.)
> 
> It is not a bug for software to fail to perform BEYOND its documented
> limits.
> 
> Since you are so very eager to declare this a bug, or a collection of
> bugs, rather than a design limitation, I strongly recommend you get in
> touch with Microsoft Technical Support and express your concerns to
> them. Make sure to let them know just how certain you are that these are
> bugs. See if they'll send you a T-shirt.
> 
> --
> Doug Ewell | http://ewellic.org | Thornton, CO ????
> 
> 
>


From doug at ewellic.org  Thu Aug  6 11:00:21 2015
From: doug at ewellic.org (Doug Ewell)
Date: Thu, 06 Aug 2015 09:00:21 -0700
Subject: Windows keyboard restrictions
Message-ID: <20150806090021.665a7a7059d7ee80bb4d670165c8327d.dee6bc8c58.wbe@email03.secureserver.net>

Marcel Schneider <charupdate at orange dot fr> wrote:

> A part of the documentation you request is available:
>
> ?Download Windows Driver Kit Version 7.1.0 from Official Microsoft
> Download Center.? N. p., 1 Dec. 2014. Web. 1 Dec. 2014.
>
> https://www.microsoft.com/en-us/download/details.aspx?id=11800
>
> C:\WinDDK\7600.16385.1\inc\api\kbd.h
>
> Line 469, and preceding.

This snippet of code -- again, code is not documentation -- is the only
place in the entire DDK that gives any indication that anyone thought a
ligature of more than 4 code points could be valid:

> #define TYPEDEF_LIGATURE(n) typedef struct _LIGATURE##n {     \
>           BYTE  VirtualKey;         \
>           WORD  ModificationNumber; \
>           WCHAR wch[n];             \
>       } LIGATURE##n, *KBD_LONG_POINTER PLIGATURE##n;
> [...]
> TYPEDEF_LIGATURE(5) // LIGATURE5, *PLIGATURE5;

No code within the DDK, including the samples, appears to use
TYPEDEF_LIGATURE(5) or any larger value. So I don't see any evidence in
the code that the DDK actually supports ligatures longer than 4 code
points.

> To circumvent the issues araising from the word ?bug?, we may simply
> ban that and focus on a few comments:
> * Ligature is an internal name for Wchar sequences that are
> * generated when a specified key is pressed.
> [...]
> * The maximum length of ligatures is 16 characters.
> * Characters from 17th on are discarded.

I can't find this text anywhere within the DDK (not even the substrings
"Wchar sequences" or "length of ligatures"), unless for some reason it's
in UTF-16 encoded text. So I also don't see any documentation that the
DDK supports ligatures longer than 4 code points.

> What???s new for me, is that ?sometimes? [scare quotes], the ligature
> length must not exceed four characters. I already knew what???s
> written in the MSKLC Help about this topic, and I explained in my
> previous e-mail that, when the MSKLC was built, Windows did not
> support more than four characters per ligature. (That???s the only
> straightforward explanation of this point of the MSKLC.) As this
> proved to be insufficient, Microsoft must have decided to raise the
> limit to sixteen.

Speculation is also not documentation.

Seriously, please take this to Microsoft or to one of the forums where
the Driver Development Kit is discussed. This has nothing to do with
Unicode.

--
Doug Ewell | http://ewellic.org | Thornton, CO ????


From richard.wordingham at ntlworld.com  Thu Aug  6 12:32:32 2015
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Thu, 6 Aug 2015 18:32:32 +0100
Subject: Windows keyboard restrictions
In-Reply-To: <20150806090021.665a7a7059d7ee80bb4d670165c8327d.dee6bc8c58.wbe@email03.secureserver.net>
References: <20150806090021.665a7a7059d7ee80bb4d670165c8327d.dee6bc8c58.wbe@email03.secureserver.net>
Message-ID: <20150806183232.6f2e4b5e@JRWUBU2>

On Thu, 06 Aug 2015 09:00:21 -0700
"Doug Ewell" <doug at ewellic.org> wrote:

> Seriously, please take this to Microsoft or to one of the forums where
> the Driver Development Kit is discussed. This has nothing to do with
> Unicode.

That depends on the availability of Tavultesoft Keyman.  The UK has been
discussing whether a certain user-perceived character should be encoded
as a single character in a new script.  Users ought to have this
character on their keyboards, but there is a worry about technical
problems if it is encoded as a sequence of three characters, i.e. six
UTF-16 code units.  If Windows easily supports a ligature of six UTF-16
code units, then one argument for encoding it is eliminated.

Richard.

From doug at ewellic.org  Thu Aug  6 12:56:51 2015
From: doug at ewellic.org (Doug Ewell)
Date: Thu, 06 Aug 2015 10:56:51 -0700
Subject: Windows keyboard restrictions
Message-ID: <20150806105651.665a7a7059d7ee80bb4d670165c8327d.8ed83d4adc.wbe@email03.secureserver.net>

Richard Wordingham <richard dot wordingham at ntlworld dot com> wrote:

> The UK has been discussing whether a certain user-perceived character
> should be encoded as a single character in a new script. Users ought
> to have this character on their keyboards, but there is a worry about
> technical problems if it is encoded as a sequence of three characters,
> i.e. six UTF-16 code units.

What is this character? Is it currently encoded as three SMP characters?
What are they?

--
Doug Ewell | http://ewellic.org | Thornton, CO ????


From jcb+unicode at inf.ed.ac.uk  Thu Aug  6 13:08:14 2015
From: jcb+unicode at inf.ed.ac.uk (Julian Bradfield)
Date: Thu,  6 Aug 2015 19:08:14 +0100 (BST)
Subject: Windows keyboard restrictions
References: <20150806090021.665a7a7059d7ee80bb4d670165c8327d.dee6bc8c58.wbe@email03.secureserver.net>
 <20150806183232.6f2e4b5e@JRWUBU2>
Message-ID: <slrnms78kd.coq.jcb@home.stevens-bradfield.com>

On 2015-08-06, Richard Wordingham <richard.wordingham at ntlworld.com> wrote:
> That depends on the availability of Tavultesoft Keyman.  The UK has been
> discussing whether a certain user-perceived character should be encoded
> as a single character in a new script.  Users ought to have this
> character on their keyboards, but there is a worry about technical
> problems if it is encoded as a sequence of three characters, i.e. six
> UTF-16 code units.  If Windows easily supports a ligature of six UTF-16
> code units, then one argument for encoding it is eliminated.

Unicode is supposed to be for the (sadly probably rather short) life
of human civilization, until we have no more need for text. Using an
ephemeral property of an ephemeral operating system for ephemeral
computers in an encoding argument makes no sense.

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.


From charupdate at orange.fr  Thu Aug  6 15:09:32 2015
From: charupdate at orange.fr (Marcel Schneider)
Date: Thu, 6 Aug 2015 22:09:32 +0200 (CEST)
Subject: Emoji characters for food allergens
In-Reply-To: <55BFF817.2070401@kli.org>
References: <1310487596.11828.1438612211991.JavaMail.www@wwinf1g28>
 <55BFF817.2070401@kli.org>
Message-ID: <148825130.21047.1438891772329.JavaMail.www@wwinf1j18>

On 04 Aug 2015, at 01:24, Mark E. Shoulson  wrote:

> if you're interested in this kind of pictographic pidgin, take a look at https://www.kwikpoint.com/ Someone already did some of it.

Personally I can???t do, nor support thoroughly, anything about pictographic language. I just answer some e-mails. I???m too busy with implementing a relatively modest subset of Unicode at keyboard driver level. But it???s indeed very interesting to learn about this already thriving publishing, and I???m glad that even human lives have been saved thanks to this new way to overcome the language barrier. So as this post refers to one of my replies, I assume the task of thanking Mr?Shoulson for this information.

I note that Kwikpoint Instructional Systems are used notably when there is no means of calling a translator, as a pragmatic approach like in the example on the home page, where body language is used to complete the item on which it is understandable. By a lucky coincidence, this example has been performed in the country of ancient Babel.

Applying the point-to-picture method to food allergens, one could wish to point to skull-and-crossbones, then to an ear of wheat or a loaf of bread, then to an egg, then to a cheese wedge or a glass of milk, then to a lupin flower, then to some kinds of nuts, finishing with skull-and-crossbones again. Because as Mr?Freytag points out, the allergen meaning of a food symbol cannot be induced safely enough. And as he explains, it???s desirable that the needed symbols be at least highly iconic, and ideally regulated by other standards bodies than Unicode. I do wish that Kwikpoint be so successful that the symbols it creates for missing items become widely popular.

Best regards,
Marcel

On 04 Aug 2015, at 01:24, Mark E. Shoulson  wrote:

> On 08/03/2015 10:30 AM, Marcel Schneider wrote:
> > On 29 Jul 2015, at 15:42, William_J_G Overington wrote:
> >
> >> Emoji seemed like a wonderful way to achieve communication through the language barrier.
> > We remember that Esperanto was also a hopeful way to unify the language, raising much enthusiasm among its followers. IMHO a pictograph based script can hardly be enough performing, unless it ends up to become a kind of new Esperanto except that it doesn?t include speech.
> 
> It's already noted that this is totally out of scope for Unicode, but if 
> you're interested in this kind of pictographic pidgin, take a look at 
> https://www.kwikpoint.com/ Someone already did some of it.
> 
> ~mark
> 
>


From charupdate at orange.fr  Thu Aug  6 15:59:20 2015
From: charupdate at orange.fr (Marcel Schneider)
Date: Thu, 6 Aug 2015 22:59:20 +0200 (CEST)
Subject: Emoji characters for food allergens
In-Reply-To: <27139592.47254.1438882560402.JavaMail.defaultUser@defaultHost>
References: <1310487596.11828.1438612211991.JavaMail.www@wwinf1g28>
 <55BFF817.2070401@kli.org>
 <27139592.47254.1438882560402.JavaMail.defaultUser@defaultHost>
Message-ID: <640641349.21541.1438894760208.JavaMail.www@wwinf1j28>

I believe that standardizing A numbers for allergens, as Mr Overington suggests commenting the interesting blog post he's sharing, is an excellent idea.

?

I think so because these numbers be added to the names, this would be even better than the E numbers, behind which often the additives are hidden, some of which have long-time harmful effects, and the continuous consumption of most of which causes overall health issues. 

?

Consequently, there are good solutions on-going, so that the support Unicode cannot safely provide, will be replaced.

?

Best regards,

?

Marcel

?

?

> Message du 06/08/15 19:36
> De : "William_J_G Overington" 
> A : "Marcel Schneider" , mark at kli.org, komatsu at google.com, unicode at unicode.org, gwalla at gmail.com
> Copie ? : 
> Objet : Re: Emoji characters for food allergens
> 
> Please may I draw to your attention the following blog post.
> 
> http://www.michellesblog.co.uk/emoji-ing-food-allergens/
> 
> The blog is by the same lady that runs the following website, a specialist website about food allergens and freefrom food..
> 
> http://www.foodsmatter.com/
> 
> William Overington
> 
> 6 August 2015
> 
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150806/13834396/attachment.html>

From richard.wordingham at ntlworld.com  Thu Aug  6 16:56:46 2015
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Thu, 6 Aug 2015 22:56:46 +0100
Subject: Windows keyboard restrictions
In-Reply-To: <20150806105651.665a7a7059d7ee80bb4d670165c8327d.8ed83d4adc.wbe@email03.secureserver.net>
References: <20150806105651.665a7a7059d7ee80bb4d670165c8327d.8ed83d4adc.wbe@email03.secureserver.net>
Message-ID: <20150806225646.46f94c6d@JRWUBU2>

On Thu, 06 Aug 2015 10:56:51 -0700
"Doug Ewell" <doug at ewellic.org> wrote:

> Richard Wordingham <richard dot wordingham at ntlworld dot com> wrote:
> 
> > The UK has been discussing whether a certain user-perceived
> > character should be encoded as a single character in a new script.
> > Users ought to have this character on their keyboards, but there is
> > a worry about technical problems if it is encoded as a sequence of
> > three characters, i.e. six UTF-16 code units.
> 
> What is this character? Is it currently encoded as three SMP
> characters? What are they?

It's part of an unencoded, living script.  There is no suitable
contiguous place for the script in the BMP.  There is a set of
characters within the script that appear to be sequences of three
characters, and encoding these characters as single elements almost
makes about as much sense as encoding English <wh> on the basis that it
represents the sound [hw], not the sound [wh].  Several of the
sequences of three characters occur in the region's language of high
culture and religion, which apparently is also written in the script.

The 'UK has been discussing' means there has been discussion of what
position the UK should take over this set of characters in the ISO
10646 amendment process.

Richard.

From doug at ewellic.org  Thu Aug  6 17:31:57 2015
From: doug at ewellic.org (Doug Ewell)
Date: Thu, 06 Aug 2015 15:31:57 -0700
Subject: Windows keyboard restrictions
Message-ID: <20150806153157.665a7a7059d7ee80bb4d670165c8327d.3746d61c4e.wbe@email03.secureserver.net>

Richard Wordingham <richard dot wordingham at ntlworld dot com> wrote:

> It's part of an unencoded, living script. There is no suitable
> contiguous place for the script in the BMP. There is a set of
> characters within the script that appear to be sequences of three
> characters, and encoding these characters as single elements almost
> makes about as much sense as encoding English <wh> on the basis that
> it represents the sound [hw], not the sound [wh]. Several of the
> sequences of three characters occur in the region's language of high
> culture and religion, which apparently is also written in the script.
 
If this is about murmured consonants in Newa, the arguments presented in
L2/14-281, both for and against, seem more relevant than whether a
cluster of three SMP characters can fit on a single key in a Windows
keyboard layout.

--
Doug Ewell | http://ewellic.org | Thornton, CO ????


From wjgo_10009 at btinternet.com  Thu Aug  6 12:36:00 2015
From: wjgo_10009 at btinternet.com (William_J_G Overington)
Date: Thu, 6 Aug 2015 18:36:00 +0100 (BST)
Subject: Emoji characters for food allergens
In-Reply-To: <55BFF817.2070401@kli.org>
References: <1310487596.11828.1438612211991.JavaMail.www@wwinf1g28>
 <55BFF817.2070401@kli.org>
Message-ID: <27139592.47254.1438882560402.JavaMail.defaultUser@defaultHost>

Please may I draw to your attention the following blog post.

http://www.michellesblog.co.uk/emoji-ing-food-allergens/

The blog is by the same lady that runs the following website, a specialist website about food allergens and freefrom food..

http://www.foodsmatter.com/

William Overington

6 August 2015


From philip_chastney at yahoo.com  Fri Aug  7 05:01:55 2015
From: philip_chastney at yahoo.com (philip chastney)
Date: Fri, 7 Aug 2015 03:01:55 -0700
Subject: Windows keyboard restrictions
Message-ID: <1438941715.78626.YahooMailBasic@web181504.mail.ne1.yahoo.com>

--------------------------------------------
On Thu, 6/8/15, Julian Bradfield <jcb+unicode at inf.ed.ac.uk> wrote:
 
.> On 2015-08-06, Richard Wordingham
 > <richard.wordingham at ntlworld.com>
 > wrote:
 > > That depends on the availability of Tavultesoft Keyman.? The UK has been
 > > discussing whether a certain user-perceived character should be encoded
 > > as a single character in a new script.? Users ought to have this
 > > character on their keyboards, but there is a worry about technical
 > > problems if it is encoded as a sequence of three characters, i.e. six
 > > UTF-16 code units.? If Windows easily supports a ligature of six UTF-16
 > > code units, then one argument for encoding it is eliminated.
 
 > Unicode is supposed to be for the (sadly probably rather short) life
 > of human civilization, until we have no more need for text. Using an
 > ephemeral property of an ephemeral operating system for ephemeral
 > computers in an encoding argument makes no sense.
 
requirements, too, can be ephemeral

the Oxford English Dictionary aims to include every word in "general use" since Chaucer, where "general use" means it was continuously used in that sense for a minimum of 10 years (or something along those lines)

when "ghettoblaster" was included, the story made it into the newspapers   --  when did you last even see a ghettoblaster?  but still, a definition may be useful for somebody in fifty years' time writing a survey of English novels from the 1980s, so the word's inclusion is justified

I also remember last Christmas being surprised to see a dingbat in use  --  will all those dingbats in Unicode be of use in a few years time?

will emoji?

/phil


From doug at ewellic.org  Fri Aug  7 11:26:56 2015
From: doug at ewellic.org (Doug Ewell)
Date: Fri, 07 Aug 2015 09:26:56 -0700
Subject: Windows keyboard restrictions
Message-ID: <20150807092656.665a7a7059d7ee80bb4d670165c8327d.154138f225.wbe@email03.secureserver.net>

Michael Kaplan, author of MSKLC, reports that not only is the limit on
UTF-16 code points in a Windows keyboard ligature still 4, it is likely
to remain 4 for the foreseeable future:

http://www.siao2.com/2015/08/07/8770668856267196989.aspx

"People who want input methods capable of handling more than four UTF-16
code points really need to look into IMEs (Input Method Editors) which
are all now run through TSF (the Text Services Framework), a completely
different system of input that allows such things, admittedly at the
price of a lot of complexity."

This should settle the matter.

--
Doug Ewell | http://ewellic.org | Thornton, CO ????


From richard.wordingham at ntlworld.com  Fri Aug  7 13:54:15 2015
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Fri, 7 Aug 2015 19:54:15 +0100
Subject: Windows keyboard restrictions
In-Reply-To: <20150807092656.665a7a7059d7ee80bb4d670165c8327d.154138f225.wbe@email03.secureserver.net>
References: <20150807092656.665a7a7059d7ee80bb4d670165c8327d.154138f225.wbe@email03.secureserver.net>
Message-ID: <20150807195415.4c725da4@JRWUBU2>

On Fri, 07 Aug 2015 09:26:56 -0700
"Doug Ewell" <doug at ewellic.org> wrote:

> Michael Kaplan, author of MSKLC, reports that not only is the limit on
> UTF-16 code points in a Windows keyboard ligature still 4, it is
> likely to remain 4 for the foreseeable future:
> 
> http://www.siao2.com/2015/08/07/8770668856267196989.aspx

It's good to see he's still with us.

> "People who want input methods capable of handling more than four
> UTF-16 code points really need to look into IMEs (Input Method
> Editors) which are all now run through TSF (the Text Services
> Framework), a completely different system of input that allows such
> things, admittedly at the price of a lot of complexity."

What we're waiting for is a guide we can follow, or some code we can
ape.  Such should be, or should have been, available in a Tavultesoft
Keyman rip-off.

In the mean time, I notice Micha Kaplan's comment:

"even if there were, such a keyboard layout would not be compatible with
any prior version of Windows;"

I think that is exactly what Marcel Schneider encountered.  Note
further that Micha implied that he got the specification by reading a
header file, exactly the sort of documentation you disallowed.

The data structure (field cbLgEntry) allows for arbitrary lengths; its
precise semantics may have been established by experiment.  It is
possible that it may have been broken for arbitrary sizes and has now
been fixed.

> This should settle the matter.

MSKLC doesn't seem to be liked by Microsoft.  Quite possibly they would
like to get rid of the interface its keyboards generate.  Supporting
such user-defined keyboards may just be an overhead for them.  Any
comment from the Microsoft employees?

Richard.


From charupdate at orange.fr  Fri Aug  7 15:40:55 2015
From: charupdate at orange.fr (Marcel Schneider)
Date: Fri, 7 Aug 2015 22:40:55 +0200 (CEST)
Subject: Windows keyboard restrictions
In-Reply-To: <20150807092656.665a7a7059d7ee80bb4d670165c8327d.154138f225.wbe@email03.secureserver.net>
References: <20150807092656.665a7a7059d7ee80bb4d670165c8327d.154138f225.wbe@email03.secureserver.net>
Message-ID: <1609214653.16412.1438980055712.JavaMail.www@wwinf1h15>

On 07 Aug 2015, at 18:38, Doug Ewell  wrote:

> Michael Kaplan, author of MSKLC, reports that not only is the limit on
> UTF-16 code points in a Windows keyboard ligature still 4, it is likely
> to remain 4 for the foreseeable future:
> 
> http://www.siao2.com/2015/08/07/8770668856267196989.aspx
> 
> "People who want input methods capable of handling more than four UTF-16
> code points really need to look into IMEs (Input Method Editors) which
> are all now run through TSF (the Text Services Framework), a completely
> different system of input that allows such things, admittedly at the
> price of a lot of complexity."
> 
> This should settle the matter.


I?wouldn?t have made a ?battle? of that. Please, note that I?m quoting somebody else; these quotes cannot be mistaken for scare quotes (which BTW would probably have been more appropriate, and thus more expected). And I wouldn?t have answered any more. I just don?t want to let the Mailing List believe that I?agreed being classified as ?fighting the [bad] fight?, if not even as a ?bad boy? (that isn?t quoted from here). So unfortunately I?can?t help replying again.

For all ?documentation?, this a bit vulgar blog post that is being shared, cites internal references (other blog posts from the same author on the same web site). 
The header file it refers to, remains unquoted and unlinked.
Thus, this blog post is biased with the authority bias.

I?m not quite sure whether people are conscious that by contesting the accuracy of the original actual Windows keyboard driver header file (kbd.h), they are insulting the developer(s) who wrote it, as well as the company that stands behind him/them.

For not wanting to make anybody loose face, I didn?t mention that a copy of the cited and quoted header file is included in the MSKLC. The version 1.4 of which dates from Thu, Jan 25, 2007, ?23:14:22, whereas the included kbd.h shows ?10-Jan-1991 GregoryW? in the file history.

Therefore, my supposition (I?hadn?t looked up that!) that ?when the MSKLC?was built, Windows did not support more than four characters per ligature. (That?s the only straightforward explanation of this point of the MSKLC.)? turns out to be completely wrong (except the parenthesized disclaimer). I could become more explicit, but I just stand away in order not to heat up the discussion with ad personam conclusions.

At this unexpected point of the thread, I?m extremely sickened. At the same time, the shared blog post helps me to understand a bit better some asperities of the overall (most of the time) rather sympathetic MSKLC.

I often wondered why the description page [https://msdn.microsoft.com/fr-fr/goglobal/bb964665.aspx] and even less the download page [https://www.microsoft.com/en-us/download/details.aspx?id=22339] have not been updated (no mention of Windows?8 on the former, no mention even of Windows?7 in the system requirements on the download page), and why there?s no 2.0 version of the MSKLC. Most times I?answered to myself that the little interest on users? side discouraged Microsoft from investing in such an update.?That?s now to be revised. I?d never imagined that a limitation in the MSKLC (not the only, but the most striking one) could be justified and defended the way it is.

IMO it would have been wise to limit this thread to ?Ligature length on Windows?. Now that it extended to all ?Windows keyboard limitations?, let?s extend a bit more to prevent further disruptions. I?m not here to criticize Microsoft. I?ask everybody to be honest and to answer for himself one single question: How on earth can I prefer Bing if I?were battling against Microsoft? Does anybody really believe that I?m annoying myself to find more bugs? So please remember that by the time, the Redmond company got the unlucky reputation of not listening to its users. I?ve got the strong hope that this tendency has been reversed, but I?still believe that as soon as Unicode implementation is concerned, the Unicode Mailing List is one of the best places to send the topic.

I still believe it today, as this thread has taught me a lot.

Hopeful that this will end in a constructive way,

Marcel Schneider
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150807/386a6a06/attachment.html>

From charupdate at orange.fr  Fri Aug  7 15:59:16 2015
From: charupdate at orange.fr (Marcel Schneider)
Date: Fri, 7 Aug 2015 22:59:16 +0200 (CEST)
Subject: Windows keyboard restrictions
In-Reply-To: <20150807195415.4c725da4@JRWUBU2>
References: <20150807092656.665a7a7059d7ee80bb4d670165c8327d.154138f225.wbe@email03.secureserver.net>
 <20150807195415.4c725da4@JRWUBU2>
Message-ID: <1166060353.16668.1438981156446.JavaMail.www@wwinf1h15>

On 07 Aug 2015, at 21:04, Richard Wordingham  wrote: 

> On Fri, 07 Aug 2015 09:26:56 -0700
> "Doug Ewell"  wrote:
> 
> > Michael Kaplan, author of MSKLC, reports that not only is the limit on
> > UTF-16 code points in a Windows keyboard ligature still 4, it is
> > likely to remain 4 for the foreseeable future:
> > 
> > http://www.siao2.com/2015/08/07/8770668856267196989.aspx
> 
> It's good to see he's still with us.

I had a great meaning of the authors of the MSKLC. And I had still when I learned here that MSKLC is the work of a single author. I don't tell more.

> 
> > "People who want input methods capable of handling more than four
> > UTF-16 code points really need to look into IMEs (Input Method
> > Editors) which are all now run through TSF (the Text Services
> > Framework), a completely different system of input that allows such
> > things, admittedly at the price of a lot of complexity."

Dismissing people to complex IMEs while a simple solution is (or can be) available at little expense, is symptomatical of user unfriendly software management.

I brought the good news that SIXTEEN UNICODE CODE POINTS can be generated by a single key stroke on Windows six dot one. The only bad news, because of which I've e-mailed to the List, is that that wasn't working in one single circumstance. It was obvious that the main thing to do, is to inform about this fact, so that other people mustn't search for a bug in the driver if it's only that.

> 
> What we're waiting for is a guide we can follow, or some code we can
> ape. Such should be, or should have been, available in a Tavultesoft
> Keyman rip-off.
> 
> In the mean time, I notice Micha Kaplan's comment:
> 
> "even if there were, such a keyboard layout would not be compatible with
> any prior version of Windows;"
> 
> I think that is exactly what Marcel Schneider encountered. 

Not really. We are talking of a ligatures feature that was programmed in 1991. So it may be possible that the same event is likely to occur on Windows Seven and later. But Mr?Kaplan is addressing as "prior", Windows until Eight (dot one).

> Note further that Micha implied that he got the specification by reading a
> header file, exactly the sort of documentation you disallowed.
> 
> The data structure (field cbLgEntry) allows for arbitrary lengths; its
> precise semantics may have been established by experiment. 

Without any false modesty I can tell that I established a limit as far as my machine is concerned, and that this limit is 16 characters per ligature; now I stated some exception but that doesn't invalidate the principle. To say it all, I have actually one ligature with 16 characters, one with 15, about one with 7 and so on.

> It is possible that it may have been broken for arbitrary sizes and has now
> been fixed.
> 
> > This should settle the matter.
> 
> MSKLC doesn't seem to be liked by Microsoft. Quite possibly they would
> like to get rid of the interface its keyboards generate. Supporting
> such user-defined keyboards may just be an overhead for them. Any
> comment from the Microsoft employees?

I'm impatient to read this comment, and I'm joining my expectations to Mr Wordingham's.

Best regards,

Marcel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150807/31944d63/attachment.html>

From doug at ewellic.org  Fri Aug  7 16:34:47 2015
From: doug at ewellic.org (Doug Ewell)
Date: Fri, 07 Aug 2015 14:34:47 -0700
Subject: Windows keyboard restrictions
Message-ID: <20150807143447.665a7a7059d7ee80bb4d670165c8327d.2c6723ffdc.wbe@email03.secureserver.net>

Richard Wordingham <richard dot wordingham at ntlworld dot com> wrote:

> It's good to see he's still with us.

Still out there, just not on this list.

> What we're waiting for is a guide we can follow, or some code we can
> ape. Such should be, or should have been, available in a Tavultesoft
> Keyman rip-off.

I guess you mean that such a guide would expose the Windows
infrastructure to help explain the inner workings of the third-party
software. That would be informative only to the extent it was accurate.

> In the mean time, I notice Micha Kaplan's comment:
>
> "even if there were, such a keyboard layout would not be compatible
> with any prior version of Windows;"
>
> I think that is exactly what Marcel Schneider encountered.

Sort of. Michael was saying that IF the Windows limit had been increased
from 4 to something higher, and someone implemented a keyboard taking
advantage of that higher limit, it would not work on older versions.
Marcel implemented a keyboard that took advantage of a higher limit
which never existed on Windows, so it doesn't work on ANY version.

But wait! Didn't he say it worked some of the time, with some shift
states but not others?

Sure it did, for the same reason that a buffer overrun in C doesn't
always cause a program crash or a security hole. Sometimes, if you're
lucky, the memory being overwritten doesn't contain critical data at the
time of the overwrite. Sometimes you're not so lucky.

> Note further that Micha implied that he got the specification by
> reading a header file, exactly the sort of documentation you
> disallowed.

I wasn't looking for documentation that the well-known limit of 4
existed in the first place, or had not been changed. I was looking for
documentation that it HAD been changed. That's where the burden of proof
lies.

Michael probably has more extensive expert knowledge of the Windows
keyboard subsystem than anyone else, which is why I asked him.

> The data structure (field cbLgEntry) allows for arbitrary lengths; its
> precise semantics may have been established by experiment. It is
> possible that it may have been broken for arbitrary sizes and has now
> been fixed.

I don't know what "has now been fixed" means. I haven't seen any
evidence that anything about this has changed since the '90s.

> MSKLC doesn't seem to be liked by Microsoft. Quite possibly they would
> like to get rid of the interface its keyboards generate. Supporting
> such user-defined keyboards may just be an overhead for them.

I doubt they ever have to provide support for user-defined keyboards. I
see that MSKLC itself "is distributed 'as is', with no obligations or
technical support from Microsoft Corporation."

If we're speculating on Microsoft's intent, my guess is that the move to
TSF is some sort of attempt to consolidate desktop, tablet, and phone
keyboard behavior into a single framework. I confess I don't know much
about TSF.

> Any comment from the Microsoft employees?

--
Doug Ewell | http://ewellic.org | Thornton, CO ????


From doug at ewellic.org  Fri Aug  7 17:01:54 2015
From: doug at ewellic.org (Doug Ewell)
Date: Fri, 07 Aug 2015 15:01:54 -0700
Subject: Windows keyboard restrictions
Message-ID: <20150807150154.665a7a7059d7ee80bb4d670165c8327d.43590305ac.wbe@email03.secureserver.net>

Marcel Schneider <charupdate at orange dot fr> wrote:

> I just don?t want to let the Mailing List believe that I agreed being
> classified as ?fighting the [bad] fight?

And I don't think Michael implied that. I just want to get the technical
facts, so that hopefully they can take the place of speculation and
presumptions of buggy behavior.

Please note that "bug" is not necessarily a bad word -- we developers
create real bugs all the time, and hopefully own up to them -- but it
rankles when someone applies the word to software that works as
designed, when that person either doesn't understand or doesn't agree
with the intended behavior.

> Thus, this blog post is biased with the authority bias.

It's from someone at Microsoft with expert knowledge of the Windows
keyboard subsystem, if that's what you mean.

> I?m not quite sure whether people are conscious that by contesting the
> accuracy of the original actual Windows keyboard driver header file
> (kbd.h), they are insulting the developer(s) who wrote it, as well as
> the company that stands behind him/them.

kbd.h contains exactly zero examples of keyboards with ligatures with
more than 4 code points. I downloaded and installed the whole DDK just
to find this out, not realizing I already had a copy in my MSKLC folder.


> For not wanting to make anybody loose face, I didn?t mention that a
> copy of the cited and quoted header file is included in the MSKLC.

Yep, I could have saved a lot of time if I'd noticed that.

> The version 1.4 of which dates from Thu, Jan 25, 2007, ?23:14:22,
> whereas the included kbd.h shows ?10-Jan-1991 GregoryW? in the file
> history.

My copy says:

@@BEGIN_DDKSPLIT
* 10-Jan-1991 GregoryW
* 23-Apr-1991 IanJa         VSC_TO_VK _* macros from oemtab.c
@@END_DDKSPLIT

Looks like things have been pretty stable since 1991.

> Therefore, my supposition (I hadn?t looked up that!) that ?when the
> MSKLC was built, Windows did not support more than four characters per
> ligature. (That?s the only straightforward explanation of this point
> of the MSKLC.)? turns out to be completely wrong (except the
> parenthesized disclaimer). I could become more explicit, but I just
> stand away in order not to heat up the discussion with ad personam
> conclusions.

Good idea.

What led you to the conclusion that this limit had been increased,
anyway? ("On my machine the maximal length is 16 characters.") I'm still
curious about that.

> I often wondered why the description page
> [https://msdn.microsoft.com/fr-fr/goglobal/bb964665.aspx] and even
> less the download page
> [https://www.microsoft.com/en-us/download/details.aspx?id=22339] have
> not been updated (no mention of Windows 8 on the former, no mention
> even of Windows 7 in the system requirements on the download page),
> and why there?s no 2.0 version of the MSKLC.

Microsoft simply hasn't dedicated any resources (Michael or anyone else)
to updating MSKLC. Michael has blogged about this many, many times in
the past few years. Big companies make the decisions that they make, for
the reasons they have.

> I?m not here to criticize Microsoft. I ask everybody to be honest and
> to answer for himself one single question: How on earth can I prefer
> Bing if I were battling against Microsoft? Does anybody really believe
> that I?m annoying myself to find more bugs?

I apologize for my tone in this thread. See my explanation above of when
"bug" is an appropriate conclusion to draw, and when it isn't. That got
me started.

> Hopeful that this will end in a constructive way,

Agreed.

--
Doug Ewell | http://ewellic.org | Thornton, CO ????


From doug at ewellic.org  Fri Aug  7 17:21:16 2015
From: doug at ewellic.org (Doug Ewell)
Date: Fri, 07 Aug 2015 15:21:16 -0700
Subject: Windows keyboard restrictions
Message-ID: <20150807152116.665a7a7059d7ee80bb4d670165c8327d.f3e8563f5d.wbe@email03.secureserver.net>

Marcel Schneider <charupdate at orange dot fr> wrote:

> I brought the good news that SIXTEEN UNICODE CODE POINTS can be
> generated by a single key stroke on Windows six dot one. The only bad
> news, because of which I've e-mailed to the List, is that that wasn't
> working in one single circumstance. It was obvious that the main thing
> to do, is to inform about this fact, so that other people mustn't
> search for a bug in the driver if it's only that.

But that's what I've been trying to say. The maximum isn't 16, it's 4.
"That wasn't working" is the expected behavior here.

If you were able to create a keyboard layout where 16 code points ever
worked on Windows 7 (which reports itself as "6.1"), it was purely by
accident -- because Windows 7 did not check for the overrun, and because
the overrun did not happen to cause any collateral damage.

If you have a light bulb that's rated for 110 volts, and you apply 220
volts to it and for some reason the bulb doesn't burn out immediately,
that doesn't mean 220 volts is the correct operating environment for
that bulb. It means you got lucky.

If there's a bug here, it's that Windows didn't detect that the limit
had been exceeded, and respond by locking out the key.

--
Doug Ewell | http://ewellic.org | Thornton, CO ????


From Andrew.Glass at microsoft.com  Fri Aug  7 19:11:46 2015
From: Andrew.Glass at microsoft.com (Andrew Glass (WINDOWS))
Date: Sat, 8 Aug 2015 00:11:46 +0000
Subject: Windows keyboard restrictions
In-Reply-To: <20150807152116.665a7a7059d7ee80bb4d670165c8327d.f3e8563f5d.wbe@email03.secureserver.net>
References: <20150807152116.665a7a7059d7ee80bb4d670165c8327d.f3e8563f5d.wbe@email03.secureserver.net>
Message-ID: <BY1PR03MB15001475268B270B6925D6ED8E720@BY1PR03MB1500.namprd03.prod.outlook.com>

Sorry to be late to this thread. I'm the Program Manager responsible for MSKLC at this time. As far as the history here, I can only reiterate Michael's point that making significant changes to user32.dll faces significant, perhaps insurmountable headwinds. There would have to be compelling reasons to make any kind of changes here. If you have specific feedback for Microsoft on this issue, please follow up with me off line.

Thanks,

Andrew Glass

-----Original Message-----
From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Doug Ewell
Sent: Friday, August 7, 2015 3:21 PM
To: Unicode Mailing List <unicode at unicode.org>
Cc: Marcel Schneider <charupdate at orange.fr>
Subject: Re: Windows keyboard restrictions

Marcel Schneider <charupdate at orange dot fr> wrote:

> I brought the good news that SIXTEEN UNICODE CODE POINTS can be 
> generated by a single key stroke on Windows six dot one. The only bad 
> news, because of which I've e-mailed to the List, is that that wasn't 
> working in one single circumstance. It was obvious that the main thing 
> to do, is to inform about this fact, so that other people mustn't 
> search for a bug in the driver if it's only that.

But that's what I've been trying to say. The maximum isn't 16, it's 4.
"That wasn't working" is the expected behavior here.

If you were able to create a keyboard layout where 16 code points ever worked on Windows 7 (which reports itself as "6.1"), it was purely by accident -- because Windows 7 did not check for the overrun, and because the overrun did not happen to cause any collateral damage.

If you have a light bulb that's rated for 110 volts, and you apply 220 volts to it and for some reason the bulb doesn't burn out immediately, that doesn't mean 220 volts is the correct operating environment for that bulb. It means you got lucky.

If there's a bug here, it's that Windows didn't detect that the limit had been exceeded, and respond by locking out the key.

--
Doug Ewell | http://ewellic.org | Thornton, CO ????


From lang.support at gmail.com  Sat Aug  8 02:05:26 2015
From: lang.support at gmail.com (Andrew Cunningham)
Date: Sat, 8 Aug 2015 17:05:26 +1000
Subject: Windows keyboard restrictions
In-Reply-To: <20150807195415.4c725da4@JRWUBU2>
References: <20150807092656.665a7a7059d7ee80bb4d670165c8327d.154138f225.wbe@email03.secureserver.net>
 <20150807195415.4c725da4@JRWUBU2>
Message-ID: <CAGJ7U-WASJP1Kjz0AVGigEu3CBU6=TQWAge66C9BCX68DxH7iw@mail.gmail.com>

On Saturday, 8 August 2015, Richard Wordingham<
richard.wordingham at ntlworld.com> wrote:

Michael did do a series of blog posts on building TSF based input methods
years ago. Something I tinkered with off and on.

> What we're waiting for is a guide we can follow, or some code we can
> ape.  Such should be, or should have been, available in a Tavultesoft
> Keyman rip-off.
>

I don't believe in rip-offs esp when there a free versions and the enhanced
version doesn't cost much.

But that said there is KMFL on linux which handles a subset of the keyman
definition files. And Keith Striebly, before he died, did a port of the
kmfl lib to windows. But I doubt anyone is maintaining it.

But reality is that the use cases discussed in this and related threads do
not need fairly complex or sophisticated layouts. So kmfl and derivates
should be fine respite how limited I consider them.

Alternative there are a range of input frameworks developed in se asia that
would be easy to work with as well.

Alternative input frameworks have been around for years. Its up to use them
or not use them.

I don't see much point bleating about the limitations of the win32 keyboard
model. Just use amlternative input framework .. wether it is TSF table
based input, keyman , kmfl port to windows or any of a large slather of
input frameworks that are available out there.

Andrew

>

-- 
Andrew Cunningham
Project Manager, Research and Development
(Social and Digital Inclusion)
Public Libraries and Community Engagement
State Library of Victoria
328 Swanston Street
Melbourne VIC 3000
Australia

Ph: +61-3-8664-7430
Mobile: 0459 806 589
Email: acunningham at slv.vic.gov.au
          lang.support at gmail.com

http://www.openroad.net.au/
http://www.mylanguage.gov.au/
http://www.slv.vic.gov.au/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150808/5a19ce96/attachment.html>

From richard.wordingham at ntlworld.com  Sat Aug  8 05:05:06 2015
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Sat, 8 Aug 2015 11:05:06 +0100
Subject: Windows keyboard restrictions
In-Reply-To: <CAGJ7U-WASJP1Kjz0AVGigEu3CBU6=TQWAge66C9BCX68DxH7iw@mail.gmail.com>
References: <20150807092656.665a7a7059d7ee80bb4d670165c8327d.154138f225.wbe@email03.secureserver.net>
 <20150807195415.4c725da4@JRWUBU2>
 <CAGJ7U-WASJP1Kjz0AVGigEu3CBU6=TQWAge66C9BCX68DxH7iw@mail.gmail.com>
Message-ID: <20150808110506.58ec0cc8@JRWUBU2>

On Sat, 8 Aug 2015 17:05:26 +1000
Andrew Cunningham <lang.support at gmail.com> wrote:

> On Saturday, 8 August 2015, Richard Wordingham<
> richard.wordingham at ntlworld.com> wrote:
> 
> Michael did do a series of blog posts on building TSF based input
> methods years ago. Something I tinkered with off and on.

Does this mean that one can put it all together from his
reconstituted blog?  I don't know how much was salvaged.  Michael has
publicly complimented Marc Durdin on being able to find his way
through the published Microsoft documentation to make TSF work for him
once Microsoft had fixed the bugs he had identified. 

> > What we're waiting for is a guide we can follow, or some code we can
> > ape.  Such should be, or should have been, available in a
> > Tavultesoft Keyman rip-off.

> I don't believe in rip-offs esp when there a free versions and the
> enhanced version doesn't cost much.

> But that said there is KMFL on linux which handles a subset of the
> keyman definition files. And Keith Striebly, before he died, did a
> port of the kmfl lib to windows. But I doubt anyone is maintaining it.

I was thinking that, at the very least, that his package was working
code that one could study.  While the porting was morally questionable,
I'm not aware of any issues with the code obtaining the keyboard input,
discovering the current text context or delivering the text changes
once derived.

> But reality is that the use cases discussed in this and related
> threads do not need fairly complex or sophisticated layouts. So kmfl
> and derivates should be fine respite how limited I consider them.

Do very recent systems allow ibus input for one's password when logging
in?  On Ubuntu 12.04 I only see the keyboards defined via X, which only
guarantee codepoint by codepoint input.

Application compatibility with KMfL has increased, but sophisticated
layouts are liable to break. I have seen regressions. For example, when
using an XSAMPA-inspired NFC-generating IPA keyboard layout that changes
the characters sent (it uses backslash cycles through sets of
characters), rescinding characters has failed and the application has
stored both sets of characters. Admittedly, last time the problem came
and went the set up was a bit complex - I was using Ubuntu as the
X-client, Windows 7 as the X-server, and using the X-client to provide
the IME.  I should be thankful it ever works.  I suspect the problem was
in the application. Last month Google document wasn't working with the
same IPA keyboard on Firefox on Ubuntu, though I don't know if it has
ever worked - I don't have much occasion to type IPA in Google document.

> I don't see much point bleating about the limitations of the win32
> keyboard model. Just use amlternative input framework .. wether it is
> TSF table based input, keyman , kmfl port to windows or any of a
> large slather of input frameworks that are available out there.

The interface structure used by DLL for win32 supports arbitrary (well,
up to 60 at least) ligature lengths.  Therefore it isn't obvious that 4
should be the maximum length, especially as I have seen code around
that implies that the maximum length is extended by 3 in 'FE' versions.
4 *characters* isn't an unreasonable limit.  However, we are now getting
minor scripts in modern use that are encoded in the SMP, and for them
the limit drops to 2 characters.  They also lose the deadkey capability
from MSKLC.

Richard.

From charupdate at orange.fr  Sat Aug  8 05:56:40 2015
From: charupdate at orange.fr (Marcel Schneider)
Date: Sat, 8 Aug 2015 12:56:40 +0200 (CEST)
Subject: Windows keyboard restrictions
In-Reply-To: <20150807150154.665a7a7059d7ee80bb4d670165c8327d.43590305ac.wbe@email03.secureserver.net>
References: <20150807150154.665a7a7059d7ee80bb4d670165c8327d.43590305ac.wbe@email03.secureserver.net>
Message-ID: <159876547.3316.1439031400773.JavaMail.www@wwinf2232>

On 08 Aug 2015, at 00:18, Doug Ewell  wrote:

> Marcel Schneider wrote:
> 
> > I just don?t want to let the Mailing List believe that I agreed being
> > classified as ?fighting the [bad] fight?
> 
> And I don't think Michael implied that. I just want to get the technical
> facts, so that hopefully they can take the place of speculation and
> presumptions of buggy behavior.
> 
> Please note that "bug" is not necessarily a bad word -- we developers
> create real bugs all the time, and hopefully own up to them -- but it
> rankles when someone applies the word to software that works as
> designed, when that person either doesn't understand or doesn't agree
> with the intended behavior.

Excuse me please to have referred to the phenomenon as being a bug. I got strong habits with that word at the time I was writing down my observations and sending a few of them to the Microsoft Community Answers Forum. Indeed some handled user unfriendly design limitations, like the disabling of Ctrl+A in the Formula Bar of Excel, or real bugs that have been fixed in the following version of Word, like the disabling of automatic word break when an apostrophe other than U+0027 is included, or the undue application of autocorrect after an automatic word break. But that's past. It's just to explain that the banned word was always about the first thing that came at mind to me. Well, sometimes even when the problem was that I didn't know how to use the software :-|

> 
> > Thus, this blog post is biased with the authority bias.
> 
> It's from someone at Microsoft with expert knowledge of the Windows
> keyboard subsystem, if that's what you mean.

Not at all... The problem is the bias, not the authority. I don't take an authority-contesting posture.
Let's take the example of somebody making a speech with a presentation about keyboards on Windows. Depending on how the topic is labelled, if it's a general outline of the whole keyboard UI, he must speak about all possible modifiers, that is Shift, Ctrl, Alt, Alt+Ctrl which is AltGr, Kana, to quote just the easiest to implement. There is also Oyayubi (Right Oyayubi, Left Oyayubi), but that was for Fujitsu terminals and I don't see how it could work on Western keyboards; perhaps it does. But Kana is so obvious it is implementable on KbdEdit. It's very useful, along with its toggle KanaLock (VK_KANA). This is why, the Help Glossary of the MSKLC cites the WDK and provides a download link. Then he must speak about chaining dead keys, because that's a Windows supported feature (you equally can implement using KbdEdit). When it comes to dead keys, he is forbidden to make hes audience believe that a dead key is just a combination of two keys, the first of which shows no effect. He may say this to take on the topic, but he mustn't end without mentioning that dead keys can be chained on Windows. Imagine the man making such a speech not in Praha but in Hanoi. Will he wait for the QA to tell that Windows allows to press two dead keys before a letter key to get letters with two diacritics, as they are used in Vietnamese and encoded in Unicode? If he does, it would be wise not to post the PowerPoint on the internet.

> 
> > I?m not quite sure whether people are conscious that by contesting the
> > accuracy of the original actual Windows keyboard driver header file
> > (kbd.h), they are insulting the developer(s) who wrote it, as well as
> > the company that stands behind him/them.
> 
> kbd.h contains exactly zero examples of keyboards with ligatures with
> more than 4 code points. I downloaded and installed the whole DDK just
> to find this out, not realizing I already had a copy in my MSKLC folder.

What did you exactly find out? That there are no examples of keyboards with ligatures? That's accurate. In the actual Windows Driver Kit (WDK), there are zero examples of keyboards with ligatures. This point is noteworthy, as it says much about the support Microsoft grants developers of keyboard layouts. Tell me what's the use of that poor samples collection, letting you alone with programming a ligatures table from scratch! Fortunately, I got around this job. But that's not the topic.

Now about what kbd.h contains. It contains a define for a ligature table with two characters, then it contains a define for a ligature table with three characters, then one for a table with four characters, than one for five. Oh what? Yes, for a ligature table containing ligatures of five whole Unicode characters. This define has been quoted in this thread, so there's nothing new. Further we know that kbd.h is not the only header file of a given keyboard layout. Each driver has its dedicated header file. To put what in? Scan code to virtual key undefines and new defines, but also all other needed defines, among which the define of a longer ligature table, which can also be inserted just before the table. I will say with all that, that the developer must look by himself. He is given a number of hints and advice in the comments, but that's roughly all. And unfortunately it isn't complete. At least not about keyboard drivers.

> 
> 
> > For not wanting to make anybody loose face, I didn?t mention that a
> > copy of the cited and quoted header file is included in the MSKLC.
> 
> Yep, I could have saved a lot of time if I'd noticed that.

Sorry.

> 
> > The version 1.4 of which dates from Thu, Jan 25, 2007, ?23:14:22,
> > whereas the included kbd.h shows ?10-Jan-1991 GregoryW? in the file
> > history.
> 
> My copy says:
> 
> @@BEGIN_DDKSPLIT
> * 10-Jan-1991 GregoryW
> * 23-Apr-1991 IanJa VSC_TO_VK _* macros from oemtab.c
> @@END_DDKSPLIT

That's what my copy says too, but I focussed on the author of the biggest part, as Mr Ian Ja only added the macros from oemtab.c. And on the date, which [MSKLC]\inc\kbd.h is the only file to provide, the History in [WDK]\inc\kbd.h being empty.

> 
> Looks like things have been pretty stable since 1991.
> 
> > Therefore, my supposition (I hadn?t looked up that!) that ?when the
> > MSKLC was built, Windows did not support more than four characters per
> > ligature. (That?s the only straightforward explanation of this point
> > of the MSKLC.)? turns out to be completely wrong (except the
> > parenthesized disclaimer). I could become more explicit, but I just
> > stand away in order not to heat up the discussion with ad personam
> > conclusions.
> 
> Good idea.

Objectively, we must induce that there was a briefing to limit ligature support to four characters despite of Windows being built to support far more, and so on. You know, when people invoke the hell when making assertions, I'm quite doubtful.

> 
> What led you to the conclusion that this limit had been increased,
> anyway? ("On my machine the maximal length is 16 characters.") I'm still
> curious about that.

The limit being increased, was not a conclusion of mine, it was an advice I got on a web page somewhere^^
There's been a conclusion of mine, the history of which we can read up in one of my previous replies. In the archive it's all wrecked: http://www.unicode.org/mail-arch/unicode-ml/y2015-m08/0024.html 
I must resend it. In the meantime, it may be quoted: 

>>> Following some advice I programmed a ligature of 35 characters on Sat Feb 28, 2015; thus I've got the opportunity of seeing that 16 characters were inserted, while the overflow was discarded. Windows worked normally. The mapped virtual key was VK_OEM_COMMA, and the modification number was 8, that was Shift+AltGr+Kana (which it is still today).

> 
> > I often wondered why the description page
> > [https://msdn.microsoft.com/fr-fr/goglobal/bb964665.aspx] and even
> > less the download page
> > [https://www.microsoft.com/en-us/download/details.aspx?id=22339] have
> > not been updated (no mention of Windows 8 on the former, no mention
> > even of Windows 7 in the system requirements on the download page),
> > and why there?s no 2.0 version of the MSKLC.
> 
> Microsoft simply hasn't dedicated any resources (Michael or anyone else)
> to updating MSKLC. Michael has blogged about this many, many times in
> the past few years. Big companies make the decisions that they make, for
> the reasons they have.

I'm sorry, I didn't read Michael's blog posts about not updating the MSKLC. He must be angry. It's a pity for everybody. But I understand also the point of view of the company it depends on. To invest in free software that allows users to get more independent of charmaps and autocorrect and IMEs, may be somewhat outside the business model. 

?

But the main reason may be that the need is already catered for, notably by Tavultesoft Keyman.

However, if the 2.0 MSKLC would have sticked with four-character ligatures........

> 
> > I?m not here to criticize Microsoft. I ask everybody to be honest and
> > to answer for himself one single question: How on earth can I prefer
> > Bing if I were battling against Microsoft? Does anybody really believe
> > that I?m annoying myself to find more bugs?
> 
> I apologize for my tone in this thread. See my explanation above of when
> "bug" is an appropriate conclusion to draw, and when it isn't. That got
> me started.

It's all right. I apologize again on my behalf.

> 
> > Hopeful that this will end in a constructive way,
> 
> Agreed.

Best regards,

Marcel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150808/8f0b5345/attachment.html>

From charupdate at orange.fr  Sat Aug  8 06:06:58 2015
From: charupdate at orange.fr (Marcel Schneider)
Date: Sat, 8 Aug 2015 13:06:58 +0200 (CEST)
Subject: Windows keyboard restrictions
Message-ID: <1793223358.3384.1439032018720.JavaMail.www@wwinf2232>

I think about another "bug" in my mailbox, so that this mail I sent in Plain Text on Thu, 6 Aug 2015, landed all wrecked in the Archive. 

Please may I resend this for that behalf.
(I already ended up replacing all < and > by single angle quotation marks.) 

If you are a Mailing List subscriber, please don't take notice :-|
___________________________________________________________________

A part of the documentation you request is available:

?Download Windows Driver Kit Version 7.1.0 from Official Microsoft Download Center.? N. p., 1 Dec. 2014. Web. 1 Dec. 2014.

https://www.microsoft.com/en-us/download/details.aspx?id=11800

C:\WinDDK\7600.16385.1\inc\api\kbd.h

Line 469, and preceding.

---

Note: This file can be accessed on-line on third party websites. You may wish to try a Bing* search for "kbd.h".

* My preferred internet search engine is Bing???

[This ballot box U+2610 is on my keyboard at Shift+Kana+L.]

---

To circumvent the issues araising from the word ?bug?, we may simply ban that and focus on a few comments:

* Ligature is an internal name for Wchar sequences that are 
* generated when a specified key is pressed.
*
* The ligatures are all in *one* table.
* The number of characters of the longest ligature
* determines the number replacing the ?n? in
* static ALLOC_SECTION_LDATA LIGATURE?n?
* and in
* ?n?,
* sizeof(aLigature[0]),
* (PLIGATURE1)aLigature
* below in the static ALLOC_SECTION_LDATA KBDTABLES.
*
* The maximum length of ligatures is 16 characters.
* Characters from 17th on are discarded.
*
* The ligatures table must be defined for ?n? characters, 
* whether in kbd.h, or kbdfrfre.h, or here before,
* using the following define:
* TYPEDEF_LIGATURE(?n?) 
* For clarification, a trailing comment is added:
* // LIGATURE?n?, *PLIGATURE?n?
* Tables for up to 5 characters length are already defined in
* C:\WinDDK\7600.16385.1\inc\api\kbd.h.
*
* The lasting Wchar fields of each ligature that is shorter than 
* the maximum length defined, may be filled up with 0xF000, or with
* WCH_NONE as defined in kbd.h, or NONE if defined in the custom header.
* These entries may be shortened, especially when the ligatures table
* is not edited in a single spreadsheet table.


What?s new for me, is that ?sometimes? [scare quotes], the ligature length must not exceed four characters. I already knew what?s written in the MSKLC Help about this topic, and I explained in my previous e-mail that, when the MSKLC?was built, Windows did not support more than four characters per ligature. (That?s the only straightforward explanation of this point of the MSKLC.) As this proved to be insufficient, Microsoft must have decided to raise the limit to sixteen. Following some advice I programmed a ligature of 35 characters on Sat Feb 28, 2015; thus I've got the opportunity of seeing that 16?characters were inserted, while the overflow was discarded. Windows worked normally. The mapped virtual key was VK_OEM_COMMA, and the modification number was?8, that was Shift+AltGr+Kana (which it is still today).

After the recent event, we may add the following:

* CAUTION: THE INITIAL MAXIMUM LENGTH OF LIGATURES WAS 4 CHARACTERS.
* DEPENDING ON THE SHIFT STATE, IT MAY HAPPEN THAT LONGER LIGATURES
* CAUSE THE KEYBOARD DRIVER TO FAIL.
* IF THE FAILING DRIVER IS THE ONE OF THE DEFAULT KEYBOARD LAYOUT,
* THE SYSTEM MAY NOT WORK AS EXPECTED.

I?m hopeful that you will agree upon this formulation, and I hope that helps.

Best regards,

Marcel Schneider


On 03 Aug 2015, at 20:02, Doug Ewell wrote:

> Marcel Schneider wrote:
> 
> > The bug on Windows I encountered at the end of July has been
> > definitely identified and reconstructed. After ninety-five drivers
> > compiled since the bug appeared, I can tell so much as that the
> > problem is related to the length of the so-called ligatures. When the
> > MSKLC was built, they were limited to four characters on Windows (see
> > glossary in the MSKLC Help). On my machine the maximal length is 16
> > characters. The problem is that this is not equal on all shift states
> > and perhaps keys. Roughly, I can put five characters on modification
> > number three, that is normally AltGr, but not on #4 (Shift+AltGr).
> 
> As far as I can tell, the limit for a ligature on a Windows keyboard
> layout is four UTF-16 code points:
> 
> MSKLC help, under "Validation Reference":
> "Ligatures cannot contain more than four UTF-16 code points"
> 
> Presentation from IUC 23 by Michael Kaplan (author of MSKLC) and Cathy
> Wissink:
> http://tinyurl.com/o49r4bz
> 
> KbdEdit:
> http://www.kbdedit.com/
> 
> MUFI:
> http://folk.uib.no/hnooh/mufi/keyboards/WinXPkeyboard.html
> 
> I understand that there are some tools (such as Keyboard Layout Manager)
> that claim a higher limit, and it may even be possible in some cases to
> assign more than four, but the DOCUMENTED limit appears to be four. (If
> you claim that it is not, please provide a link to the relevant official
> documentation, and note that C++ code showing 16 fields is not
> documentation.)
> 
> It is not a bug for software to fail to perform BEYOND its documented
> limits.
> 
> Since you are so very eager to declare this a bug, or a collection of
> bugs, rather than a design limitation, I strongly recommend you get in
> touch with Microsoft Technical Support and express your concerns to
> them. Make sure to let them know just how certain you are that these are
> bugs. See if they'll send you a T-shirt.
> 
> --
> Doug Ewell | http://ewellic.org | Thornton, CO ????
> 
> 
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150808/5abdbe2c/attachment.html>

From charupdate at orange.fr  Sat Aug  8 07:05:17 2015
From: charupdate at orange.fr (Marcel Schneider)
Date: Sat, 8 Aug 2015 14:05:17 +0200 (CEST)
Subject: Windows keyboard restrictions
In-Reply-To: <BY1PR03MB15001475268B270B6925D6ED8E720@BY1PR03MB1500.namprd03.prod.outlook.com>
References: <20150807152116.665a7a7059d7ee80bb4d670165c8327d.f3e8563f5d.wbe@email03.secureserver.net>
 <BY1PR03MB15001475268B270B6925D6ED8E720@BY1PR03MB1500.namprd03.prod.outlook.com>
Message-ID: <1149930762.7711.1439035518032.JavaMail.www@wwinf1j11>

On 08 Aug 2015, at 02:19, Andrew Glass (WINDOWS)  wrote:

> Sorry to be late to this thread. I'm the Program Manager responsible for MSKLC at this time. As far as the history here, I can only reiterate Michael's point that making significant changes to user32.dll faces significant, perhaps insurmountable headwinds. There would have to be compelling reasons to make any kind of changes here. If you have specific feedback for Microsoft on this issue, please follow up with me off line.

Thank you.

While *one* dimension of this thread is to get minor changes performed in order to asset ligatures support for 16 characters uniformly in Windows keyboard drivers, the main concern at the actual point of the thread is to know something about the actual support as well as at the time of MSKLC:

1. On Windows, up to how many characters may be inserted with one single key stroke:
1.1. At the time of MSKLC 1.0?
1.2. When MSKLC was updated from version 1.3 to 1.4?
1.3. At the time of Windows Seven, that is 6.1, Build 7601 (SP1)?
1.4. Today, that is on Windows 10?

It is supposed that a keyboard driver is used in whose source a ligature table is defined for whatever number of characters (2, 3, 4, 5, 6, ... 16, ... 32, ... 60, ... 100, ...).

2. Supposed that Windows supported more than four characters per ligature:
2.1. Why has the MSKLC been limited to four characters per ligature?
2.2. Who or what body made the demand of the limitation to four characters?
2.3. Why does the MSKLC Help state (Glossary - Ligature) that the maximum number supported by Windows is four characters?
2.4. How Microsoft dealt with user demands for support of longer ligatures?

Best regards,

Marcel Schneider
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150808/c2c814a2/attachment.html>

From richard.wordingham at ntlworld.com  Sat Aug  8 07:51:40 2015
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Sat, 8 Aug 2015 13:51:40 +0100
Subject: Windows keyboard restrictions
In-Reply-To: <1149930762.7711.1439035518032.JavaMail.www@wwinf1j11>
References: <20150807152116.665a7a7059d7ee80bb4d670165c8327d.f3e8563f5d.wbe@email03.secureserver.net>
 <BY1PR03MB15001475268B270B6925D6ED8E720@BY1PR03MB1500.namprd03.prod.outlook.com>
 <1149930762.7711.1439035518032.JavaMail.www@wwinf1j11>
Message-ID: <20150808135140.4d0dc8d7@JRWUBU2>

On Sat, 8 Aug 2015 14:05:17 +0200 (CEST)
Marcel Schneider <charupdate at orange.fr> wrote:

> 2. Supposed that Windows supported more than four characters per
> ligature:

> 2.1. Why has the MSKLC been limited to four characters per
> ligature?

Because that was believed to be the architectural limit.  Note however,
that it isn't 4 *characters* that is the limit, but 4 UTF-16 code units.

Richard.

From charupdate at orange.fr  Sat Aug  8 08:26:31 2015
From: charupdate at orange.fr (Marcel Schneider)
Date: Sat, 8 Aug 2015 15:26:31 +0200 (CEST)
Subject: Windows keyboard restrictions
In-Reply-To: <20150807152116.665a7a7059d7ee80bb4d670165c8327d.f3e8563f5d.wbe@email03.secureserver.net>
References: <20150807152116.665a7a7059d7ee80bb4d670165c8327d.f3e8563f5d.wbe@email03.secureserver.net>
Message-ID: <971687439.5732.1439040391883.JavaMail.www@wwinf2202>

On 08 Aug 2015, at 00:30, Doug Ewell  wrote:

> Marcel Schneider wrote:
> 
> > I brought the good news that SIXTEEN UNICODE CODE POINTS can be
> > generated by a single key stroke on Windows six dot one. The only bad
> > news, because of which I've e-mailed to the List, is that that wasn't
> > working in one single circumstance. It was obvious that the main thing
> > to do, is to inform about this fact, so that other people mustn't
> > search for a bug in the driver if it's only that.
> 
> But that's what I've been trying to say. The maximum isn't 16, it's 4.
> "That wasn't working" is the expected behavior here.
> 
> If you were able to create a keyboard layout where 16 code points ever
> worked on Windows 7 (which reports itself as "6.1"), 

Indeed I didn't check on Wikipedia that Windows 7 has the same version number, build number and service pack as Windows 7 Starter which was delivered with the netbooks. So my Windows is the true Windows 7 except some limitations, but *not* the one that one couldn't open more than three applications simultaneously. This limitation we've been saved from, appears to me as a paradigm of all limitations of that sort: a useless worsening of the usability and of the usefulness of a product. There is a use but it's economical, to allow manufacturers to buy the OS a bit cheaper, with respect to the overall price of netbooks. Now, question: What's the advantage of being limited to four characters, even if you're a professional and corporate user? Or a scholar?

> it was purely by accident -- because Windows 7 did not check for the overrun, 
> and because the overrun did not happen to cause any collateral damage.

Windows *did* check for the overflow! This is why *sixteen* characters *only* were inserted, *not* thirty-five.

> 
> If you have a light bulb that's rated for 110 volts, and you apply 220
> volts to it and for some reason the bulb doesn't burn out immediately,
> that doesn't mean 220 volts is the correct operating environment for
> that bulb. It means you got lucky.

That seems a good reasoning. I'm just not quite sure whether limiting ligatures to four instead of sixteen may be compared to electrotechnics.

> 
> If there's a bug here, it's that Windows didn't detect that the limit
> had been exceeded, and respond by locking out the key.

Again, Windows did detect that the ligature was far too long, and consequently limited it to 16. And it did so *without* any collateral damage: no app blocked, no keyboard disabled, just a handful of characters not inserted while they were programmed. That's not worth mentioning except for the case study. Sixteen on one single keystroke is IMHO largely enough. 

But four is *not*. That is what Microsoft knows, and that is why Microsoft asked its Windows developers to raise the limit, IMHO.

Best regards,

Marcel 

From charupdate at orange.fr  Sat Aug  8 09:22:57 2015
From: charupdate at orange.fr (Marcel Schneider)
Date: Sat, 8 Aug 2015 16:22:57 +0200 (CEST)
Subject: Windows keyboard restrictions
In-Reply-To: <20150808135140.4d0dc8d7@JRWUBU2>
References: <20150807152116.665a7a7059d7ee80bb4d670165c8327d.f3e8563f5d.wbe@email03.secureserver.net>
 <BY1PR03MB15001475268B270B6925D6ED8E720@BY1PR03MB1500.namprd03.prod.outlook.com>
 <1149930762.7711.1439035518032.JavaMail.www@wwinf1j11>
 <20150808135140.4d0dc8d7@JRWUBU2>
Message-ID: <2027666553.10656.1439043777467.JavaMail.www@wwinf1j19>

On 08 Aug 2015, at 15:01, Richard Wordingham  wrote:

> On Sat, 8 Aug 2015 14:05:17 +0200 (CEST)
> Marcel Schneider  wrote:
> 
> > 2. Supposed that Windows supported more than four characters per
> > ligature:
> 
> > 2.1. Why has the MSKLC been limited to four characters per
> > ligature?
> 
> Because that was believed to be the architectural limit. Note however,
> that it isn't 4 *characters* that is the limit, but 4 UTF-16 code units.

I'm very puzzled about this being UTF-16 code units, as stated also in the MSKLC Help. In the driver source kbd*****.c, each of those entities is referred to as WCHAR, which is meant to mean (^^) "UNICODE CHARACTER". Indeed we can write 0x1234 for a given WCHAR in a driver source, but also 0x101234 for another given WCHAR if that's its code point. Nowhere there is any UTF appearing. Despite of having looked up the Unicode FAQs about Unicode transformation formats, I'm unable to make the link.

> Because that was believed to be the architectural limit.

I'm urged not to speculate, and to stick with facts and documentation. Now look here an authoritative expert who is reduced to stand upon his believes about keyboard limitations while working as an employee of the company where the same keyboard limitations were designed, implemented, compliled, released and shipped from-----or NOT. At his place I would have asked my boss for accessing the Windows keyboard layout framework source files and development roadmaps. Turning it the other way round, Microsoft must not ask somebody to write some keyboard creating software without granting him full access to all documentation.

At least that's my opinion. 

BTW, I would like to have everybody note that a Help section of another software is *not* documentation (with the meaning the word has in this thread). Nor is a PowerPoint. Nor are third party keyboard software websites. Nor is anything that is not a comment in a source file, or a technical document issued by the department that really worked out the discussed software; or a code line, because in my belief, this is strong evidence.

Thank you for your comment. 

Further, we're awaiting the responses from Mr?Glass at Microsoft.

Best regards,

Marcel Schneider
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150808/366b20ed/attachment.html>

From eliz at gnu.org  Sat Aug  8 09:31:30 2015
From: eliz at gnu.org (Eli Zaretskii)
Date: Sat, 08 Aug 2015 17:31:30 +0300
Subject: Windows keyboard restrictions
In-Reply-To: <2027666553.10656.1439043777467.JavaMail.www@wwinf1j19>
References: <20150807152116.665a7a7059d7ee80bb4d670165c8327d.f3e8563f5d.wbe@email03.secureserver.net>
 <BY1PR03MB15001475268B270B6925D6ED8E720@BY1PR03MB1500.namprd03.prod.outlook.com>
 <1149930762.7711.1439035518032.JavaMail.www@wwinf1j11>
 <20150808135140.4d0dc8d7@JRWUBU2>
 <2027666553.10656.1439043777467.JavaMail.www@wwinf1j19>
Message-ID: <834mk9rgvx.fsf@gnu.org>

> Date: Sat, 8 Aug 2015 16:22:57 +0200 (CEST)
> From: Marcel Schneider <charupdate at orange.fr>
> 
> I'm very puzzled about this being UTF-16 code units, as stated also in the
> MSKLC Help. In the driver source kbd*****.c, each of those entities is referred
> to as WCHAR, which is meant to mean (^^) "UNICODE CHARACTER". Indeed we can
> write 0x1234 for a given WCHAR in a driver source, but also 0x101234 for
> another given WCHAR if that's its code point. Nowhere there is any UTF
> appearing. Despite of having looked up the Unicode FAQs about Unicode
> transformation formats, I'm unable to make the link.

The Windows WCHAR is a 16-bit data type.  What Windows documentation
calls "Unicode characters" are Unicode codepoints encoded in UTF-16.

From charupdate at orange.fr  Sat Aug  8 10:10:58 2015
From: charupdate at orange.fr (Marcel Schneider)
Date: Sat, 8 Aug 2015 17:10:58 +0200 (CEST)
Subject: Windows keyboard restrictions
In-Reply-To: <834mk9rgvx.fsf@gnu.org>
References: <20150807152116.665a7a7059d7ee80bb4d670165c8327d.f3e8563f5d.wbe@email03.secureserver.net>
 <BY1PR03MB15001475268B270B6925D6ED8E720@BY1PR03MB1500.namprd03.prod.outlook.com>
 <1149930762.7711.1439035518032.JavaMail.www@wwinf1j11>
 <20150808135140.4d0dc8d7@JRWUBU2>
 <2027666553.10656.1439043777467.JavaMail.www@wwinf1j19>
 <834mk9rgvx.fsf@gnu.org>
Message-ID: <646624248.11997.1439046658425.JavaMail.www@wwinf1m26>

On 08 Aug 2015, at 16;39, Eli Zaretskii  wrote:

> 
> The Windows WCHAR is a 16-bit data type. What Windows documentation
> calls "Unicode characters" are Unicode codepoints encoded in UTF-16.
> 

Thanks a lot!

Best regards,

Marcel Schneider
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150808/df701e3e/attachment.html>

From richard.wordingham at ntlworld.com  Sat Aug  8 10:44:55 2015
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Sat, 8 Aug 2015 16:44:55 +0100
Subject: Windows keyboard restrictions
In-Reply-To: <2027666553.10656.1439043777467.JavaMail.www@wwinf1j19>
References: <20150807152116.665a7a7059d7ee80bb4d670165c8327d.f3e8563f5d.wbe@email03.secureserver.net>
 <BY1PR03MB15001475268B270B6925D6ED8E720@BY1PR03MB1500.namprd03.prod.outlook.com>
 <1149930762.7711.1439035518032.JavaMail.www@wwinf1j11>
 <20150808135140.4d0dc8d7@JRWUBU2>
 <2027666553.10656.1439043777467.JavaMail.www@wwinf1j19>
Message-ID: <20150808164455.7b244c50@JRWUBU2>

On Sat, 8 Aug 2015 16:22:57 +0200 (CEST)
Marcel Schneider <charupdate at orange.fr> wrote:

> Further, we're awaiting the responses from Mr?Glass at Microsoft.

See http://unicode.org/pipermail/unicode/2015-August/002465.html .

More information would take time.

Richard.


From doug at ewellic.org  Sat Aug  8 12:36:02 2015
From: doug at ewellic.org (Doug Ewell)
Date: Sat, 8 Aug 2015 11:36:02 -0600
Subject: Windows keyboard restrictions
In-Reply-To: <1149930762.7711.1439035518032.JavaMail.www@wwinf1j11>
References: <20150807152116.665a7a7059d7ee80bb4d670165c8327d.f3e8563f5d.wbe@email03.secureserver.net>
 <BY1PR03MB15001475268B270B6925D6ED8E720@BY1PR03MB1500.namprd03.prod.outlook.com>
 <1149930762.7711.1439035518032.JavaMail.www@wwinf1j11>
Message-ID: <B12EA34C6074448CA6519B786BF204FD@DougEwell>

Now that I know Andrew is the PM for MSKLC ?, and can answer Marcel's 
questions (publicly or privately) with authority, I'll duck out of this 
thread.

? I'm glad to hear that there is such a person. I was afraid the project 
had been left to die.

--
Doug Ewell | http://ewellic.org | Thornton, CO ????


From: Marcel Schneider
Sent: Saturday, August 8, 2015 6:05
To: Andrew Glass (WINDOWS)
Cc: Doug Ewell ; Unicode Mailing List
Subject: RE: Windows keyboard restrictions


On 08 Aug 2015, at 02:19, Andrew Glass (WINDOWS) 
<Andrew.Glass at microsoft.com> wrote:

> Sorry to be late to this thread. I'm the Program Manager responsible 
> for MSKLC at this time. As far as the history here, I can only 
> reiterate Michael's point that making significant changes to 
> user32.dll faces significant, perhaps insurmountable headwinds. There 
> would have to be compelling reasons to make any kind of changes here. 
> If you have specific feedback for Microsoft on this issue, please 
> follow up with me off line.

Thank you.

While *one* dimension of this thread is to get minor changes performed 
in order to asset ligatures support for 16 characters uniformly in 
Windows keyboard drivers, the main concern at the actual point of the 
thread is to know something about the actual support as well as at the 
time of MSKLC:

1. On Windows, up to how many characters may be inserted with one single 
key stroke:
1.1. At the time of MSKLC 1.0?
1.2. When MSKLC was updated from version 1.3 to 1.4?
1.3. At the time of Windows Seven, that is 6.1, Build 7601 (SP1)?
1.4. Today, that is on Windows 10?

It is supposed that a keyboard driver is used in whose source a ligature 
table is defined for whatever number of characters (2, 3, 4, 5, 6, ... 
16, ... 32, ... 60, ... 100, ...).

2. Supposed that Windows supported more than four characters per 
ligature:
2.1. Why has the MSKLC been limited to four characters per ligature?
2.2. Who or what body made the demand of the limitation to four 
characters?
2.3. Why does the MSKLC Help state (Glossary - Ligature) that the 
maximum number supported by Windows is four characters?
2.4. How Microsoft dealt with user demands for support of longer 
ligatures?

Best regards,

Marcel Schneider


From asmus-inc at ix.netcom.com  Sat Aug  8 14:09:12 2015
From: asmus-inc at ix.netcom.com (Asmus Freytag (t))
Date: Sat, 8 Aug 2015 12:09:12 -0700
Subject: Windows keyboard restrictions
In-Reply-To: <971687439.5732.1439040391883.JavaMail.www@wwinf2202>
References: <20150807152116.665a7a7059d7ee80bb4d670165c8327d.f3e8563f5d.wbe@email03.secureserver.net>
 <971687439.5732.1439040391883.JavaMail.www@wwinf2202>
Message-ID: <55C653D8.4010007@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150808/25b56379/attachment.html>

From marc at keyman.com  Sat Aug  8 15:46:14 2015
From: marc at keyman.com (Marc Durdin)
Date: Sat, 8 Aug 2015 20:46:14 +0000
Subject: Windows keyboard restrictions
In-Reply-To: <20150808110506.58ec0cc8@JRWUBU2>
References: <20150807092656.665a7a7059d7ee80bb4d670165c8327d.154138f225.wbe@email03.secureserver.net>
 <20150807195415.4c725da4@JRWUBU2>
 <CAGJ7U-WASJP1Kjz0AVGigEu3CBU6=TQWAge66C9BCX68DxH7iw@mail.gmail.com>
 <20150808110506.58ec0cc8@JRWUBU2>
Message-ID: <1CEDD746887FFF4B834688E7AF5FDA5A821B2135@federation.tavultesoft.local>


Richard Wordingham <richard.wordingham at ntlworld.com> wrote:
> On Sat, 8 Aug 2015 17:05:26 +1000
> Andrew Cunningham <lang.support at gmail.com> wrote:
> >
> > Michael did do a series of blog posts on building TSF based input
> > methods years ago. Something I tinkered with off and on.
> 
> Does this mean that one can put it all together from his reconstituted blog?  I
> don't know how much was salvaged.  Michael has publicly complimented Marc
> Durdin on being able to find his way through the published Microsoft
> documentation to make TSF work for him once Microsoft had fixed the bugs he
> had identified.

The TSF documentation is definitely sparse and there are some real challenges in getting up to speed with what is a very complex API. But the real challenges in input method implementation come more from the vast array of interpretations of how keyboard input should be consumed by Windows apps, and the extremely patchy support for TSF (and "foreign" language input) in apps. That's where I've killed thousands of hours over the last 15 years; once you get your head around the TSF model it's not too hard to code to.

Clearly, you've seen some of the same compatibility problems with KMFL. And our experiences on Mac OS X, Android and iOS are much the same. For example, Norbert Lindenberg's excellent blog on developing keyboards for iOS details much that is missing from the API docs: http://norbertlindenberg.com/2014/12/developing-keyboards-for-ios/

There is a massive cost to developing -- and maintaining -- a native code input method for each language and each OS. I'm really trying to minimize this cost with Keyman Developer 9. Keyman Developer 9 is a free product (http://keyman.com/developer/). It is currently in beta but is relatively stable.


From charupdate at orange.fr  Sat Aug  8 15:57:25 2015
From: charupdate at orange.fr (Marcel Schneider)
Date: Sat, 8 Aug 2015 22:57:25 +0200 (CEST)
Subject: Windows keyboard restrictions
In-Reply-To: <B12EA34C6074448CA6519B786BF204FD@DougEwell>
References: <20150807152116.665a7a7059d7ee80bb4d670165c8327d.f3e8563f5d.wbe@email03.secureserver.net>
 <BY1PR03MB15001475268B270B6925D6ED8E720@BY1PR03MB1500.namprd03.prod.outlook.com>
 <1149930762.7711.1439035518032.JavaMail.www@wwinf1j11>
 <B12EA34C6074448CA6519B786BF204FD@DougEwell>
Message-ID: <2105459356.15150.1439067445761.JavaMail.www@wwinf1k29>

On 08 Aug 2015, at 19;45, Doug Ewell  wrote:

> Now that I know Andrew is the PM for MSKLC ?,

Probably Mr Glass wasn't Mr Kaplan's boss, so he is to overtake a legacy without having been involved in its generating. I didn't well notice the chronological relationship, so I asked questions whose answers could need to search the archives.---

> and can answer Marcel's questions (publicly or privately) with authority,

Mr Kaplan too is authoritative. The difference might be that Mr?Glass has actually access to all the needed documentation.

> I'll duck out of this thread.

You're not supposed to. But in any case, I would like to thank you for all you brought in into this thread. It has been very enriching and brought some insight I wouldn't have got.


> 
> ? I'm glad to hear that there is such a person. I was afraid the project 
> had been left to die.

Indeed there seems to be like a malediction upon the MSKLC. The uppermost problem now is that reputations are linked to the low limit of ligatures length. Supposed the low limit is untrue, then Mr?Glass can hardly answer these questions publicly. If he agrees to do so privately, I'll be bound by a secret and will be hindered in providing help for my eventual future keyboard drivers. I don't know how to get out of trouble. If I write on a web page that we can have up to 16 UTF-16 code units per ligature, there can always be somebody starting up who's telling that's wrong and my drivers were a hack. Probably I do end up wishing there would never have been an MSKLC. At least we might think that possibly there is no update because 2.0 would have stuck with that low limit.

We hope there will be enough solutions for all Unicode implementations.

Best regards,

Marcel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150808/1de4ce60/attachment.html>

From asmus-inc at ix.netcom.com  Sat Aug  8 16:07:08 2015
From: asmus-inc at ix.netcom.com (Asmus Freytag (t))
Date: Sat, 8 Aug 2015 14:07:08 -0700
Subject: Windows keyboard restrictions
In-Reply-To: <2105459356.15150.1439067445761.JavaMail.www@wwinf1k29>
References: <20150807152116.665a7a7059d7ee80bb4d670165c8327d.f3e8563f5d.wbe@email03.secureserver.net>
 <BY1PR03MB15001475268B270B6925D6ED8E720@BY1PR03MB1500.namprd03.prod.outlook.com>
 <1149930762.7711.1439035518032.JavaMail.www@wwinf1j11>
 <B12EA34C6074448CA6519B786BF204FD@DougEwell>
 <2105459356.15150.1439067445761.JavaMail.www@wwinf1k29>
Message-ID: <55C66F7C.6080908@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150808/2aca4e50/attachment.html>

From richard.wordingham at ntlworld.com  Sun Aug  9 05:58:20 2015
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Sun, 9 Aug 2015 11:58:20 +0100
Subject: ZWJ as a Ligature Suppressor
Message-ID: <20150809115820.67e0eead@JRWUBU2>

According to the text just after TUS 7.0.0 Figure 23-3
(http://www.unicode.org/versions/Unicode7.0.0/ch23.pdf#G25237), ZWJ
suppresses ligatures in Arabic script.  Does this rule apply to other
normally cursive joined scripts, e.g. Syriac and Mongolian?

Am I right in thinking that for an OpenType font for other scripts, the
font writer must take precautions to prevent ZWJ accidentally
suppressing ligatures that would be better suppressed by ZWNJ or <ZWJ
ZWNJ ZWJ>?

From richard.wordingham at ntlworld.com  Sun Aug  9 06:09:19 2015
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Sun, 9 Aug 2015 12:09:19 +0100
Subject: Standardised Encoding of Text
Message-ID: <20150809120919.1adacf7c@JRWUBU2>

Is there any mechanism to standardise the encoding of text that
is composed of encoded characters that are all from a specific script or
the common script?

Richard.

From eik at iki.fi  Sun Aug  9 06:46:31 2015
From: eik at iki.fi (Erkki I Kolehmainen)
Date: Sun, 9 Aug 2015 14:46:31 +0300
Subject: Standardised Encoding of Text
In-Reply-To: <20150809120919.1adacf7c@JRWUBU2>
References: <20150809120919.1adacf7c@JRWUBU2>
Message-ID: <000001d0d299$0929fbc0$1b7df340$@fi>

Sorry, but I find myself having a serious problem in understanding what this
is about.

Sincerely,

Erkki I. Kolehmainen
Tilkankatu 12 A 3, 00300 Helsinki, Finland
Mob: +358400825943, Tel: +358943682643, Fax: +35813318116

-----Alkuper?inen viesti-----
L?hett?j?: Unicode [mailto:unicode-bounces at unicode.org] Puolesta Richard
Wordingham
L?hetetty: 9. elokuuta 2015 14:09
Vastaanottaja: unicode at unicode.org
Aihe: Standardised Encoding of Text

Is there any mechanism to standardise the encoding of text that is composed
of encoded characters that are all from a specific script or the common
script?

Richard.


From richard.wordingham at ntlworld.com  Sun Aug  9 08:58:11 2015
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Sun, 9 Aug 2015 14:58:11 +0100
Subject: Standardised Encoding of Text
In-Reply-To: <000001d0d299$0929fbc0$1b7df340$@fi>
References: <20150809120919.1adacf7c@JRWUBU2>
 <000001d0d299$0929fbc0$1b7df340$@fi>
Message-ID: <20150809145811.3d1296d2@JRWUBU2>

On Sun, 9 Aug 2015 14:46:31 +0300
"Erkki I Kolehmainen" <eik at iki.fi> wrote:

> Sorry, but I find myself having a serious problem in understanding
> what this is about.

In some cases the TUS lays down in detail the order of characters and
their interpretation.  While Europeans have canonical combining classes
to standardise the order of combining marks, lesser breeds tend not to
receive them.  It gets even worse when combining marks are defined by
the combination of control character(s) and what appears to be a base
character.  For example, the order for the Khmer script was laid
down in great detail.  Similarly, the order for Burmese was laid out in
great detail.  However, as support for other languages was added to
the 'Myanmar' script, the ordering rules to cover the new characters
were not promptly laid down.

So the question is, how does one rectify the situation where the text
in the Unicode Standard for a script is woefully inadequate.

Richard.

From mark at macchiato.com  Sun Aug  9 10:10:01 2015
From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=)
Date: Sun, 9 Aug 2015 17:10:01 +0200
Subject: Standardised Encoding of Text
In-Reply-To: <20150809145811.3d1296d2@JRWUBU2>
References: <20150809120919.1adacf7c@JRWUBU2>
 <000001d0d299$0929fbc0$1b7df340$@fi>
 <20150809145811.3d1296d2@JRWUBU2>
Message-ID: <CAJ2xs_FoW2Psm9wDg940JjNTtUjaMGkfJi_vGKrYTDP1WMO4FQ@mail.gmail.com>

While it would be good to document more scripts, and more language options
per script, that is always subject to getting experts signed up to develop
them.

What I'd really like to see instead of documentation is a data-based
approach.

For example, perhaps the addition of real data to CLDR for a
"basic-validity-check" on a language-by-language basis. It might be
possible to use a BNF grammar for the components, for which we are already
set up. For example, something like (this was a quick and dirty
transcription):

$word := $syllable+;
$syllable := $B [R C] (S R?)* (Z? V)? $O? $S?;
# UnicodeSets
$R := [\u17CC];
$C := [<consonant shifter>];
$S := [<subscript consonant><independent vowel sign>];
$V := [<dependent vowel sign>]
$Z := [:joiner:]
$O := [...]
$B := [[:sc=khmer:]&[:L:]-$R-$C-$S-$V-$Z-$O]

The more these could use existing properties,
like Indic_Positional_Category or IndicSyllabicCategory, the better.

Doing this would have far more of an impact than just a textual
description, in that it could executed by code, for at least a reference
implementation.


Mark <https://google.com/+MarkDavis>

*? Il meglio ? l?inimico del bene ?*

On Sun, Aug 9, 2015 at 3:58 PM, Richard Wordingham <
richard.wordingham at ntlworld.com> wrote:

> On Sun, 9 Aug 2015 14:46:31 +0300
> "Erkki I Kolehmainen" <eik at iki.fi> wrote:
>
> > Sorry, but I find myself having a serious problem in understanding
> > what this is about.
>
> In some cases the TUS lays down in detail the order of characters and
> their interpretation.  While Europeans have canonical combining classes
> to standardise the order of combining marks, lesser breeds tend not to
> receive them.  It gets even worse when combining marks are defined by
> the combination of control character(s) and what appears to be a base
> character.  For example, the order for the Khmer script was laid
> down in great detail.  Similarly, the order for Burmese was laid out in
> great detail.  However, as support for other languages was added to
> the 'Myanmar' script, the ordering rules to cover the new characters
> were not promptly laid down.
>
> So the question is, how does one rectify the situation where the text
> in the Unicode Standard for a script is woefully inadequate.
>
> Richard.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150809/6a29ebd2/attachment.html>

From richard.wordingham at ntlworld.com  Sun Aug  9 12:10:14 2015
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Sun, 9 Aug 2015 18:10:14 +0100
Subject: Standardised Encoding of Text
In-Reply-To: <CAJ2xs_FoW2Psm9wDg940JjNTtUjaMGkfJi_vGKrYTDP1WMO4FQ@mail.gmail.com>
References: <20150809120919.1adacf7c@JRWUBU2>
 <000001d0d299$0929fbc0$1b7df340$@fi>
 <20150809145811.3d1296d2@JRWUBU2>
 <CAJ2xs_FoW2Psm9wDg940JjNTtUjaMGkfJi_vGKrYTDP1WMO4FQ@mail.gmail.com>
Message-ID: <20150809181014.232da747@JRWUBU2>

On Sun, 9 Aug 2015 17:10:01 +0200
Mark Davis ?? <mark at macchiato.com> wrote:

> While it would be good to document more scripts, and more language
> options per script, that is always subject to getting experts signed
> up to develop them.
> 
> What I'd really like to see instead of documentation is a data-based
> approach.
> 
> For example, perhaps the addition of real data to CLDR for a
> "basic-validity-check" on a language-by-language basis.

CLDR is currently not useful.  Are you really going to get Mayan time
formats when the script is encoded? Without them, there will be no CLDR
data. I would like to add data to a Pali in Thai script locale (or two
- there are two Thai-script Pali writing systems, one with an implicit
vowel and another without) to get proper word- and line-breaking.
However, I'm stymied because the basic requirements for a locale are
beyond me.

It's telling that, the last time I looked, there was no Latin locale.  I
don't know the usage of the administration of the Church of Rome, which
appears to be what CLDR wants for Latin.  (My first degree
was conferred in Latin, and it wasn't conferred in Rome.)  Fortunately,
one doesn't need that for a Latin spell-checker, and the default word-
and line-breaking work well-enough.

Until some sets up locale data for Tai Khuen (or Tai Lue), we
probably won't have a locale to store Lanna script rules with.

> It might be
> possible to use a BNF grammar for the components, for which we are
> already set up.

Are you sure?  Microsft's Universal Script Engine (USE) intended design
has a rule for well-formed syllables which essentially contains a
fragment, when just looking at dependent vowels:

[:InPC=Top:]*[:InPC=Bottom]*

Are you set up to say whether the following NFD Tibetan fragment
conforms to it?

Example: <U+0F71 TIBETAN VOWEL SIGN AA, U+0F72 TIBETAN VOWEL SIGN I> 

The sequence of InPC values is <Bottom, Top>.  There are other examples
around, but this is a pleasant one to think about.

(The USE definition got more complicated when confronted with harsh
reality.  That confrontation may have happened very early in the
design.)
 
> For example, something like (this was a quick and
> dirty transcription):
> 
> $word := $syllable+;
<snip>

Martin Hosken put something like that together for the Lanna script.
On careful inspection:

(a) It seemed to allow almost anything;
(b) It was not too lax.

Much later, I have realised that

(c) It was too strict if read as it was meant to be read, i.e. not
literally.
(d) It overlooked a logogram for 'elephant' that contains a marginally
dependent vowel.

Though it might indeed be useful in general, the formal description
would need to be accompanied by an explanation of what was happening.
The problem with the Lanna script is that it allows a lot of
abbreviation, and it makes sense to store the undeleted characters in
their normal order.  The result of this is that one often can't say a
sequence is non-standard unless you know roughly how to pronounce it.

> Doing this would have far more of an impact than just a textual
> description, in that it could executed by code, for at least a
> reference implementation.

I don't like the idea of associating the description with language
rather than script.  Imagine the trouble you'll have with Tamil
purists.  They'll probably want to ban several consonants.  You'll end
up needing a locale for Sanskrit in the Tamil script.

Richard.


From richard.wordingham at ntlworld.com  Sun Aug  9 13:38:45 2015
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Sun, 9 Aug 2015 19:38:45 +0100
Subject: Standardised Encoding of Text
In-Reply-To: <CAJ2xs_FoW2Psm9wDg940JjNTtUjaMGkfJi_vGKrYTDP1WMO4FQ@mail.gmail.com>
References: <20150809120919.1adacf7c@JRWUBU2>
 <000001d0d299$0929fbc0$1b7df340$@fi>
 <20150809145811.3d1296d2@JRWUBU2>
 <CAJ2xs_FoW2Psm9wDg940JjNTtUjaMGkfJi_vGKrYTDP1WMO4FQ@mail.gmail.com>
Message-ID: <20150809193845.5931b6cf@JRWUBU2>

On Sun, 9 Aug 2015 17:10:01 +0200
Mark Davis <mark at macchiato.com> wrote:

> While it would be good to document more scripts, and more language
> options per script, that is always subject to getting experts signed
> up to develop them.
> 
> What I'd really like to see instead of documentation is a data-based
> approach.
> 
> For example, perhaps the addition of real data to CLDR for a
> "basic-validity-check" on a language-by-language basis.

One aspect this would not help with is with letter forms that do not
resemble their forms in the code charts.  The code charts usually
broadly answer the question "What does this code represent?".  They
don't answer the question, "What code points represent this glyph?".

Problems I've seen in Tai Tham are the use of U+1A57 TAI THAM
CONSONANT SIGN LA TANG LAI for the sequence <U+1A60 TAI THAM SIGN SAKOT,
U+1A43 TAI THAM LETTER LA> and of <U+1A6D TAI THAM VOWEL SIGN OY> for
<U+1A60, U+1A3F TAI THAM LETTER LOW YA>.  The problem is that the
subscript forms for U+1A43 and U+1A3F are only documented in the
proposals.  The subscript consonant signs probably add to the confusion
of anyone working from the code chart.  The people making the errors
were far from ignorant of the script.

Richard.

From mark at macchiato.com  Sun Aug  9 14:14:38 2015
From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=)
Date: Sun, 9 Aug 2015 21:14:38 +0200
Subject: Standardised Encoding of Text
In-Reply-To: <20150809181014.232da747@JRWUBU2>
References: <20150809120919.1adacf7c@JRWUBU2>
 <000001d0d299$0929fbc0$1b7df340$@fi>
 <20150809145811.3d1296d2@JRWUBU2>
 <CAJ2xs_FoW2Psm9wDg940JjNTtUjaMGkfJi_vGKrYTDP1WMO4FQ@mail.gmail.com>
 <20150809181014.232da747@JRWUBU2>
Message-ID: <CAJ2xs_E60WrWix=tkD_zsTCYOLXvgFTLhLvcHeAYOE5yvoS3dA@mail.gmail.com>

Mark <https://google.com/+MarkDavis>

*? Il meglio ? l?inimico del bene ?*

On Sun, Aug 9, 2015 at 7:10 PM, Richard Wordingham <
richard.wordingham at ntlworld.com> wrote:

> On Sun, 9 Aug 2015 17:10:01 +0200
> Mark Davis ?? <mark at macchiato.com> wrote:
>
> > While it would be good to document more scripts, and more language
> > options per script, that is always subject to getting experts signed
> > up to develop them.
> >
> > What I'd really like to see instead of documentation is a data-based
> > approach.
> >
> > For example, perhaps the addition of real data to CLDR for a
> > "basic-validity-check" on a language-by-language basis.
>
> CLDR is currently not useful.  Are you really going to get Mayan time
> formats when the script is encoded? Without them, there will be no CLDR
> data.


?That is a misunderstanding. CLDR provides both locale (language) specific
data for formatting, collation, etc., but also data about languages. It is
not limited to the first.


> > It might be
> > possible to use a BNF grammar for the components, for which we are
> > already set up.
>
> Are you sure?


I said "might be possible". That normally indicates that a degree of
uncertainty. That is, "no, I'm not sure".

T
here is no reason to be unnecessarily argumentative; it doesn't exactly
encourage people to explore solutions to a problem.


> Microsft's Universal Script Engine (USE) intended design
> has a rule for well-formed syllables which essentially contains a
> fragment, when just looking at dependent vowels:
>
> [:InPC=Top:]*[:InPC=Bottom]*
>
> Are you set up to say whether the following NFD Tibetan fragment
> conforms to it?
>
> Example: <U+0F71 TIBETAN VOWEL SIGN AA, U+0F72 TIBETAN VOWEL SIGN I>
>
> The sequence of InPC values is <Bottom, Top>.  There are other examples
> around, but this is a pleasant one to think about.
>


> (The USE definition got more complicated when confronted with harsh
> reality.  That confrontation may have happened very early in the
> design.)
>
> > For example, something like (this was a quick and
> > dirty transcription):
> >
> > $word := $syllable+;
> <snip>
>
> Martin Hosken put something like that together for the Lanna script.
> On careful inspection:
>
> (a) It seemed to allow almost anything;
> (b) It was not too lax.
>
> Much later, I have realised that
>
> (c) It was too strict if read as it was meant to be read, i.e. not
> literally.
> (d) It overlooked a logogram for 'elephant' that contains a marginally
> dependent vowel.
>
> Though it might indeed be useful in general, the formal description
> would need to be accompanied by an explanation of what was happening.
> The problem with the Lanna script is that it allows a lot of
> abbreviation, and it makes sense to store the undeleted characters in
> their normal order.  The result of this is that one often can't say a
> sequence is non-standard unless you know roughly how to pronounce it.
>

I don't think any algorithmic description would get all and only those
strings that would be acceptable to writers of the language. What you'd end
up with is a mechanism that had three values: clearly ok (eg, cat), clearly
bogus (eg, a\u0308\u0308\u0308\u0308), and somewhere in between.


> > Doing this would have far more of an impact than just a textual
> > description, in that it could executed by code, for at least a
> > reference implementation.
>
> I don't like the idea of associating the description with language
> rather than script.  Imagine the trouble you'll have with Tamil
> purists.  They'll probably want to ban several consonants.  You'll end
> up needing a locale for Sanskrit in the Tamil script.
>

?Someone was just saying "
However, as support for other languages was added to
? ?
the 'Myanmar' script, the ordering rules to cover the new characters
were not promptly laid down.
"?

If the goal for the script rules is to cover all languages customarily
written with that script, one way to do that is to develop the language
rules as they come, and make sure that the script rules are broadened if
necessary for each language. But there is also utility to having the
language rules, especially for high-frequency languages.
 ?

>
> Richard.
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150809/7183d509/attachment.html>

From richard.wordingham at ntlworld.com  Sun Aug  9 16:03:37 2015
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Sun, 9 Aug 2015 22:03:37 +0100
Subject: Standardised Encoding of Text
In-Reply-To: <CAJ2xs_E60WrWix=tkD_zsTCYOLXvgFTLhLvcHeAYOE5yvoS3dA@mail.gmail.com>
References: <20150809120919.1adacf7c@JRWUBU2>
 <000001d0d299$0929fbc0$1b7df340$@fi>
 <20150809145811.3d1296d2@JRWUBU2>
 <CAJ2xs_FoW2Psm9wDg940JjNTtUjaMGkfJi_vGKrYTDP1WMO4FQ@mail.gmail.com>
 <20150809181014.232da747@JRWUBU2>
 <CAJ2xs_E60WrWix=tkD_zsTCYOLXvgFTLhLvcHeAYOE5yvoS3dA@mail.gmail.com>
Message-ID: <20150809220337.147e0d72@JRWUBU2>

On Sun, 9 Aug 2015 21:14:38 +0200
Mark Davis ?? <mark at macchiato.com> wrote:

> Mark <https://google.com/+MarkDavis>
> 
> *? Il meglio ? l?inimico del bene ?*
> 
> On Sun, Aug 9, 2015 at 7:10 PM, Richard Wordingham <
> richard.wordingham at ntlworld.com> wrote:
> 
> > On Sun, 9 Aug 2015 17:10:01 +0200
> > Mark Davis ?? <mark at macchiato.com> wrote:

> > > For example, perhaps the addition of real data to CLDR for a
> > > "basic-validity-check" on a language-by-language basis.

> > CLDR is currently not useful.  Are you really going to get Mayan
> > time formats when the script is encoded? Without them, there will
> > be no CLDR data.
 
> ?That is a misunderstanding. CLDR provides both locale (language)
> specific data for formatting, collation, etc., but also data about
> languages. It is not limited to the first.

I'm basing my statement on the 'minimal data commitment' listed in
http://cldr.unicode.org/index/cldr-spec/minimaldata .

If there is a sustained failure to provide 4 main data/time formats, the
locale may be removed.

> > > It might be
> > > possible to use a BNF grammar for the components, for which we are
> > > already set up.

> > Are you sure?

> I said "might be possible". That normally indicates that a degree of
> uncertainty. That is, "no, I'm not sure".

> There is no reason to be unnecessarily argumentative; it doesn't
> exactly encourage people to explore solutions to a problem.

I was responding to the 'for which we are already set up'.  The problem
is that canonical equivalence can make it very difficult to specify a
syntax.  The text segmentation appendices suggest that you have already
hit trouble with canonical equivalence; I suspect you have tools set up
to prevent such problems recurring.

With a view to analysing the effects of analysing the
rquirements of the USE, I investigated the effects of canonical
equivalence on regular expressions.  I eventually discovered the
relevant mathematical theory - it replaces strings by 'traces', which
for our purposes are fully decomposed character strings modulo canonical
equivalence. I found very little interest in the matter on this list.

I gave the example of the regular expression

[:InPC=Top:]*[:InPC=Bottom:]*

Usefully converting that expression to specify NFD equivalents in
accordance with UTS#18 Version 17 Section 2.1 is non-trivial, though it
is doable.  I have a feeling that some have claimed that an expression
like that is already in NFD. 

> I don't think any algorithmic description would get all and only those
> strings that would be acceptable to writers of the language. What
> you'd end up with is a mechanism that had three values: clearly ok
> (eg, cat), clearly bogus (eg, a\u0308\u0308\u0308\u0308), and
> somewhere in between.

What have you got against 8th derivatives? -:)

You are looking at a different issue to me.  One of the issues is rather
that for a word of one syllable, there should only be one order per
meaning, appearance and pronunciation for a pair of non-commuting
combining marks.  For non-Indic scripts, that is generally handled by
ensuring that different orders of non-commuting combining marks render
differently.

> If the goal for the script rules is to cover all languages customarily
> written with that script, one way to do that is to develop the
> language rules as they come, and make sure that the script rules are
> broadened if necessary for each language. But there is also utility
> to having the language rules, especially for high-frequency languages.

The language rules serve a different function.  The sequence
"xxxxlttttuuupppp" is clearly not English, but it is a perfectly
acceptable string for sorting, searching and rendering.

Richard.


From charupdate at orange.fr  Mon Aug 10 06:08:11 2015
From: charupdate at orange.fr (Marcel Schneider)
Date: Mon, 10 Aug 2015 13:08:11 +0200 (CEST)
Subject: The role of documentation in implementation (was: Re: Windows
 keyboard limitations)
Message-ID: <2027820082.7245.1439204891699.JavaMail.www@wwinf1m23>

On 08 Aug 2015, at 15:01, Richard Wordingham  wrote:

> On Sat, 8 Aug 2015 14:05:17 +0200 (CEST)
> Marcel Schneider  wrote:
> 
> > 2. Supposed that Windows supported more than four characters per
> > ligature:
> 
> > 2.1. Why has the MSKLC been limited to four characters per
> > ligature?
> 
> Because that was believed to be the architectural limit. Note however,
> that it isn't 4 *characters* that is the limit, but 4 UTF-16 code units.

Richard (be it allowed to use first name to conform to the usage) turned out to be the only person to answer one of my listed questions. Thanks again and my apologies for the tone of my reply, which finally has been influenced by the tone of the blog post, when I were talking about his author?but this too I?m sorry about and ask Michael to forgive me as we forgive his value lowering.**

Remembering that I presumed that the limit was really four in old Windows versions and that it has been changed, I can now go a step further admitting that the fundamental limit has always been sixteen since ligatures were implemented, and that lowering it to four in one single application was triggered by a cluster of circumstances that produced a belief.

To assess the effects of a lack of documentation, we may cross-reference the situation with two quotations from the same header file [MSKLC]\inc\kbd.h, lines 731?sqq and 751?sqq:

// There is no documented way to detect the manufacturer of the computer that
// is currently running an application. However, a Windows-based application
// can detect the type of OEM Windows by using the return value of the
// GetKeyboardType() function.

// Application programs can use these OEM IDs to distinguish the type of OEM
// Windows. Note, however, that this method is not documented, so Microsoft
// may not support it in the future version of Windows.

May we conclude that whenever documentation is missing, we are allowed to rely on test results and other observations?

About ligatures support, Andrew?Glass and previously Michael?Kaplan assure that there will be no major change. Given that on Windows?7, sixteen code units are supported on all** current (tested) shift states, this allows to conclude that stability must not necessarily rely on documentation (unlike it is suggested in the second quotation above).

Thanks to the parent thread, I take notice however that when documentation is missing, unusual behavior of software must not be referred to as bug.

If this thread, which spins off from a closed thread, is not followed up, we may take these statements for granted and build development policies upon.


Best regards,

Marcel

** Note: More than four code units per ligature works now on *all* tested shift states. 
The reason why the unexpected behavior is now eliminated, seems to be (in my belief) that VK_OEM_PA1 has been replaced with VK_OEM_AX. To map KBDKANA (the Kana modifier key) to Left Alt, I had redefined scan code T38 as VK_OEM_PA1, found in kbd.h. Now on Sun Aug 09, 2015, 23:16 (yesterday in the evening) I found in ?WINUSER.H? (named in capitals) that VK_OEM_AX is far more obvious, as its default scan code stands between two that are actually used on the default French keyboard and are mapped to VK_OEM_8 and VK_OEM_102.
I believe that the presence of the less usual VK_OEM_PA1 (part of ?Nokia/Ericsson definitions?) made Windows more sensitive.

I?m happy that this issue is so appeasingly and gratifyingly resolved. I present my apologies for the trouble it has made, as well as my thanks for the many pieces of information it has brought up.

Marcel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150810/40f8f9e5/attachment.html>

From charupdate at orange.fr  Mon Aug 10 06:10:34 2015
From: charupdate at orange.fr (Marcel Schneider)
Date: Mon, 10 Aug 2015 13:10:34 +0200 (CEST)
Subject: Implementing SMP on a UTF-16 OS (was: Windows keyboard limitations)
Message-ID: <1473422952.7279.1439205034095.JavaMail.www@wwinf1m23>

On 08 Aug 2015, at 16:39, Eli Zaretskii  wrote:

> The Windows WCHAR is a 16-bit data type. What Windows documentation
> calls "Unicode characters" are Unicode codepoints encoded in UTF-16.

I turn out to be unable to use Unicode codepoints encoded in UTF-16 in the C source of the keyboard driver. When in the static ALLOC_SECTION_LDATA VK_TO_WCHARS9 aVkToWch9[], I use
[...],(0xd835,0xdcea),(0xd835,0xdcd0),[...]
I?get the error ?initializer is not a constant?, and when I simply use
[...],0xd835 0xdcea,0xd835 0xdcd0,[...]
I get the syntax error ?constant? with a cascade of comma errors.

I note that the MSKLC converts to ligatures of a surrogates pair any SMP character mapped on a key, and that it cannot admit any SMP character in a dead list.
Such an MSKLC layout with U+1D4EA??? and U+1D4D0??? works on the built-in Notepad, while Word displays .notdef boxes that convert to code points and then to glyphs using Alt+C twice. About LibreOffice and Notepad++, they are unable to display these characters even when pasted from Word or Notepad.

Please don?t dismiss this issue towards other mailing lists or fora, because on such topics it is very hard out there to get any useful answer. And please don?t lock it out of the Unicode Mailing List, because it?s a Unicode implementation topic.

Best regards,

Marcel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150810/0dae79b5/attachment.html>

From charupdate at orange.fr  Mon Aug 10 06:13:20 2015
From: charupdate at orange.fr (Marcel Schneider)
Date: Mon, 10 Aug 2015 13:13:20 +0200 (CEST)
Subject: The role of documentation in implementation (was: Re: Windows
 keyboard restrictions)
Message-ID: <1784631313.7359.1439205200198.JavaMail.www@wwinf1m23>

I confused the parent thread labelling. Please read:


The role of documentation in implementation (was: Re: Windows keyboard restrictions)


?

On 08 Aug 2015, at 15:01, Richard Wordingham  wrote:

> On Sat, 8 Aug 2015 14:05:17 +0200 (CEST)
> Marcel Schneider  wrote:
> 
> > 2. Supposed that Windows supported more than four characters per
> > ligature:
> 
> > 2.1. Why has the MSKLC been limited to four characters per
> > ligature?
> 
> Because that was believed to be the architectural limit. Note however,
> that it isn't 4 *characters* that is the limit, but 4 UTF-16 code units.

Richard (be it allowed to use first name to conform to the usage) turned out to be the only person to answer one of my listed questions. Thanks again and my apologies for the tone of my reply, which finally has been influenced by the tone of the blog post, when I were talking about his author?but this too I?m sorry about and ask Michael to forgive me as we forgive his value lowering.**

Remembering that I presumed that the limit was really four in old Windows versions and that it has been changed, I can now go a step further admitting that the fundamental limit has always been sixteen since ligatures were implemented, and that lowering it to four in one single application was triggered by a cluster of circumstances that produced a belief.

To assess the effects of a lack of documentation, we may cross-reference the situation with two quotations from the same header file [MSKLC]\inc\kbd.h, lines 731?sqq and 751?sqq:

// There is no documented way to detect the manufacturer of the computer that
// is currently running an application. However, a Windows-based application
// can detect the type of OEM Windows by using the return value of the
// GetKeyboardType() function.

// Application programs can use these OEM IDs to distinguish the type of OEM
// Windows. Note, however, that this method is not documented, so Microsoft
// may not support it in the future version of Windows.

May we conclude that whenever documentation is missing, we are allowed to rely on test results and other observations?

About ligatures support, Andrew?Glass and previously Michael?Kaplan assure that there will be no major change. Given that on Windows?7, sixteen code units are supported on all** current (tested) shift states, this allows to conclude that stability must not necessarily rely on documentation (unlike it is suggested in the second quotation above).

Thanks to the parent thread, I take notice however that when documentation is missing, unusual behavior of software must not be referred to as bug.

If this thread, which spins off from a closed thread, is not followed up, we may take these statements for granted and build development policies upon.


Best regards,

Marcel

** Note: More than four code units per ligature works now on *all* tested shift states. 
The reason why the unexpected behavior is now eliminated, seems to be (in my belief) that VK_OEM_PA1 has been replaced with VK_OEM_AX. To map KBDKANA (the Kana modifier key) to Left Alt, I had redefined scan code T38 as VK_OEM_PA1, found in kbd.h. Now on Sun Aug 09, 2015, 23:16 (yesterday in the evening) I found in ?WINUSER.H? (named in capitals) that VK_OEM_AX is far more obvious, as its default scan code stands between two that are actually used on the default French keyboard and are mapped to VK_OEM_8 and VK_OEM_102.
I believe that the presence of the less usual VK_OEM_PA1 (part of ?Nokia/Ericsson definitions?) made Windows more sensitive.

I?m happy that this issue is so appeasingly and gratifyingly resolved. I present my apologies for the trouble it has made, as well as my thanks for the many pieces of information it has brought up.

Marcel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150810/c3eec11f/attachment.html>

From charupdate at orange.fr  Mon Aug 10 10:02:44 2015
From: charupdate at orange.fr (Marcel Schneider)
Date: Mon, 10 Aug 2015 17:02:44 +0200 (CEST)
Subject: The role of documentation in implementation (was: Re: Windows
 keyboard restrictions)
Message-ID: <336224144.13348.1439218964690.JavaMail.www@wwinf1m19>

I'm brought to draw your attention to the fact that presumably my "buggy" mailbox inverted the order of the Copy Addressees, which I find reversed in both mailboxes where I can follow (but not answer in both, under my name associated with a fitting custom mail address). 

This order is mainly determined by the intervention chronology. It may not be important but formally I'm expected to stick with.

Further as I was very short of time I forgot checking to add everybody. Sorry please.


It may be current to put a person's name and address in Copy instead as Main Addressee in order not to seem urging him to reply, as a more informal sending.


Please retrieve the list as I ordered it, taken from my outbox and completed:


?

"Doug Ewell"  ;? "Richard Wordingham"  ; "Julian Bradfield"  ; "Andrew Glass (WINDOWS)"  ; "Andrew Cunningham"  ; "Eli Zaretskii"  ; "Asmus Freytag (t)"  ; "Marc Durdin" 

?

On 08 Aug 2015, at 15:01, Richard Wordingham  wrote:

> On Sat, 8 Aug 2015 14:05:17 +0200 (CEST)
> Marcel Schneider  wrote:
> 
> > 2. Supposed that Windows supported more than four characters per
> > ligature:
> 
> > 2.1. Why has the MSKLC been limited to four characters per
> > ligature?
> 
> Because that was believed to be the architectural limit. Note however,
> that it isn't 4 *characters* that is the limit, but 4 UTF-16 code units.

Richard (be it allowed to use first name to conform to the usage) turned out to be the only person to answer one of my listed questions. Thanks again and my apologies for the tone of my reply, which finally has been influenced by the tone of the blog post, when I were talking about his author?but this too I?m sorry about and ask Michael to forgive me as we forgive his value lowering.**

Remembering that I presumed that the limit was really four in old Windows versions and that it has been changed, I can now go a step further admitting that the fundamental limit has always been sixteen since ligatures were implemented, and that lowering it to four in one single application was triggered by a cluster of circumstances that produced a belief.

To assess the effects of a lack of documentation, we may cross-reference the situation with two quotations from the same header file [MSKLC]\inc\kbd.h, lines 731?sqq and 751?sqq:

// There is no documented way to detect the manufacturer of the computer that
// is currently running an application. However, a Windows-based application
// can detect the type of OEM Windows by using the return value of the
// GetKeyboardType() function.

// Application programs can use these OEM IDs to distinguish the type of OEM
// Windows. Note, however, that this method is not documented, so Microsoft
// may not support it in the future version of Windows.

May we conclude that whenever documentation is missing, we are allowed to rely on test results and other observations?

About ligatures support, Andrew?Glass and previously Michael?Kaplan assure that there will be no major change. Given that on Windows?7, sixteen code units are supported on all** current (tested) shift states, this allows to conclude that stability must not necessarily rely on documentation (unlike it is suggested in the second quotation above).

Thanks to the parent thread, I take notice however that when documentation is missing, unusual behavior of software must not be referred to as bug.

If this thread, which spins off from a closed thread, is not followed up, we may take these statements for granted and build development policies upon.


Best regards,

Marcel

** Note: More than four code units per ligature works now on *all* tested shift states. 
The reason why the unexpected behavior is now eliminated, seems to be (in my belief) that VK_OEM_PA1 has been replaced with VK_OEM_AX. To map KBDKANA (the Kana modifier key) to Left Alt, I had redefined scan code T38 as VK_OEM_PA1, found in kbd.h. Now on Sun Aug 09, 2015, 23:16 (yesterday in the evening) I found in ?WINUSER.H? (named in capitals) that VK_OEM_AX is far more obvious, as its default scan code stands between two that are actually used on the default French keyboard and are mapped to VK_OEM_8 and VK_OEM_102.
I believe that the presence of the less usual VK_OEM_PA1 (part of ?Nokia/Ericsson definitions?) made Windows more sensitive.

I?m happy that this issue is so appeasingly and gratifyingly resolved. I present my apologies for the trouble it has made, as well as my thanks for the many pieces of information it has brought up.

Marcel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150810/85ee6513/attachment.html>

From petercon at microsoft.com  Mon Aug 10 11:38:56 2015
From: petercon at microsoft.com (Peter Constable)
Date: Mon, 10 Aug 2015 16:38:56 +0000
Subject: bang mail
Message-ID: <BY2PR0301MB1608355F18ED0E908E345BBDD5700@BY2PR0301MB1608.namprd03.prod.outlook.com>

I don't think it's helpful or even polite to send bang (high priority) mail to this list.


Cheers,
Peter
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150810/9467abe1/attachment.html>

From petercon at microsoft.com  Mon Aug 10 11:51:01 2015
From: petercon at microsoft.com (Peter Constable)
Date: Mon, 10 Aug 2015 16:51:01 +0000
Subject: Standardised Encoding of Text
In-Reply-To: <20150809193845.5931b6cf@JRWUBU2>
References: <20150809120919.1adacf7c@JRWUBU2>
 <000001d0d299$0929fbc0$1b7df340$@fi> <20150809145811.3d1296d2@JRWUBU2>
 <CAJ2xs_FoW2Psm9wDg940JjNTtUjaMGkfJi_vGKrYTDP1WMO4FQ@mail.gmail.com>
 <20150809193845.5931b6cf@JRWUBU2>
Message-ID: <BY2PR0301MB1608D4B0371070349A3D87ACD5700@BY2PR0301MB1608.namprd03.prod.outlook.com>

Richard, you can always submit a document to UTC with proposed text to add to the Tai Tham block description in a future version.


Peter

-----Original Message-----
From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Richard Wordingham
Sent: Sunday, August 9, 2015 11:39 AM
To: Unicode Public <unicode at unicode.org>
Subject: Re: Standardised Encoding of Text

On Sun, 9 Aug 2015 17:10:01 +0200
Mark Davis <mark at macchiato.com> wrote:

> While it would be good to document more scripts, and more language 
> options per script, that is always subject to getting experts signed 
> up to develop them.
> 
> What I'd really like to see instead of documentation is a data-based 
> approach.
> 
> For example, perhaps the addition of real data to CLDR for a 
> "basic-validity-check" on a language-by-language basis.

One aspect this would not help with is with letter forms that do not resemble their forms in the code charts.  The code charts usually broadly answer the question "What does this code represent?".  They don't answer the question, "What code points represent this glyph?".

Problems I've seen in Tai Tham are the use of U+1A57 TAI THAM CONSONANT SIGN LA TANG LAI for the sequence <U+1A60 TAI THAM SIGN SAKOT,
U+1A43 TAI THAM LETTER LA> and of <U+1A6D TAI THAM VOWEL SIGN OY> for
<U+1A60, U+1A3F TAI THAM LETTER LOW YA>.  The problem is that the subscript forms for U+1A43 and U+1A3F are only documented in the proposals.  The subscript consonant signs probably add to the confusion of anyone working from the code chart.  The people making the errors were far from ignorant of the script.

Richard.


From petercon at microsoft.com  Mon Aug 10 11:52:56 2015
From: petercon at microsoft.com (Peter Constable)
Date: Mon, 10 Aug 2015 16:52:56 +0000
Subject: bang mail
In-Reply-To: <BY2PR0301MB1608355F18ED0E908E345BBDD5700@BY2PR0301MB1608.namprd03.prod.outlook.com>
References: <BY2PR0301MB1608355F18ED0E908E345BBDD5700@BY2PR0301MB1608.namprd03.prod.outlook.com>
Message-ID: <BY2PR0301MB1608B07703A7B26FD3DFFDBBD5700@BY2PR0301MB1608.namprd03.prod.outlook.com>

Possible exception: you've sent mail with a URL that points to something you learned was malicious and want to advise people not to click on that link.

From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Peter Constable
Sent: Monday, August 10, 2015 9:39 AM
To: unicode at unicode.org
Subject: bang mail

I don't think it's helpful or even polite to send bang (high priority) mail to this list.


Cheers,
Peter
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150810/08d7bcc4/attachment.html>

From dzo at bisharat.net  Mon Aug 10 12:00:18 2015
From: dzo at bisharat.net (dzo at bisharat.net)
Date: Mon, 10 Aug 2015 17:00:18 +0000
Subject: bang mail
In-Reply-To: <BY2PR0301MB1608355F18ED0E908E345BBDD5700@BY2PR0301MB1608.namprd03.prod.outlook.com>
References: <BY2PR0301MB1608355F18ED0E908E345BBDD5700@BY2PR0301MB1608.namprd03.prod.outlook.com>
Message-ID: <1855182813-1439226019-cardhu_decombobulator_blackberry.rim.net-1709997657-@b14.c2.bise6.blackberry>

Agreed. Thank you, Peter. Basic list netiquette IMO (though it seems some people are passing on the "high importance" tags inadvertently when replying).  

Sent via BlackBerry by AT&T

-----Original Message-----
From: Peter Constable <petercon at microsoft.com>
Sender: "Unicode" <unicode-bounces at unicode.org>Date: Mon, 10 Aug 2015 16:38:56 
To: unicode at unicode.org<unicode at unicode.org>
Subject: bang mail

I don't think it's helpful or even polite to send bang (high priority) mail to this list.


Cheers,
Peter

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150810/7baa1b80/attachment.html>

From charupdate at orange.fr  Mon Aug 10 12:08:10 2015
From: charupdate at orange.fr (Marcel Schneider)
Date: Mon, 10 Aug 2015 19:08:10 +0200 (CEST)
Subject: Implementing SMP on a UTF-16 OS (was: Windows keyboard
 restrictions)
Message-ID: <1348099338.17889.1439226490802.JavaMail.www@wwinf1m19>

On 10 Aug 2015, at 13;21, I wrote:

> I note that the MSKLC converts to ligatures of a surrogates pair any SMP character mapped on a key, and that it cannot admit any SMP character in a dead list.
> Such an MSKLC layout with U+1D4EA ?? and U+1D4D0 ?? works on the built-in Notepad, while Word displays .notdef boxes that convert to code points and then to glyphs using Alt+C twice. 

I should mention too that my OS is Windows 7 Starter (same version, build number and service pack as Windows Seven). 
Word is Word Starter 2010.

Marcel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150810/e6f185ca/attachment.html>

From charupdate at orange.fr  Mon Aug 10 12:16:51 2015
From: charupdate at orange.fr (Marcel Schneider)
Date: Mon, 10 Aug 2015 19:16:51 +0200 (CEST)
Subject: bang mail
In-Reply-To: <BY2PR0301MB1608355F18ED0E908E345BBDD5700@BY2PR0301MB1608.namprd03.prod.outlook.com>
References: <BY2PR0301MB1608355F18ED0E908E345BBDD5700@BY2PR0301MB1608.namprd03.prod.outlook.com>
Message-ID: <898142779.18218.1439227011781.JavaMail.www@wwinf1m19>

On 10 Aug 2015, at 18:48, Peter Constable 
wrote:

> I don?t think it?s helpful or even polite to send bang (high priority) mail to this list.

Being the one who did, I apologize (once more), for this pushiness.

Best regards,

Marcel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150810/f548cea7/attachment.html>

From Andrew.Glass at microsoft.com  Mon Aug 10 12:58:24 2015
From: Andrew.Glass at microsoft.com (Andrew Glass (WINDOWS))
Date: Mon, 10 Aug 2015 17:58:24 +0000
Subject: ZWJ as a Ligature Suppressor
In-Reply-To: <20150809115820.67e0eead@JRWUBU2>
References: <20150809115820.67e0eead@JRWUBU2>
Message-ID: <BY1PR03MB1500FAE92EFBD918E4ABA1898E700@BY1PR03MB1500.namprd03.prod.outlook.com>

Hi Richard,

To ligate or not to ligate is up to the font designer. Normally, GSUB lookups that perform ligation will be broken by the presence of ZWJ or ZWNJ. If a font designer wishes to ligate in the presence of a ZWJ or ZWNJ then they could choose to include appropriate glyph sequences in their ligation lookups. For example:

glyphA glyphB -> glyphC
glyphA ZWJ glyphB -> glyphC

Cheers,

Andrew


Andrew Glass Ph.D.
Program Manager
Shell Text Input Group | Windows | Microsoft

-----Original Message-----
From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Richard Wordingham
Sent: Sunday, August 9, 2015 3:58 AM
To: unicode at unicode.org
Subject: ZWJ as a Ligature Suppressor

According to the text just after TUS 7.0.0 Figure 23-3 (http://www.unicode.org/versions/Unicode7.0.0/ch23.pdf#G25237), ZWJ suppresses ligatures in Arabic script.  Does this rule apply to other normally cursive joined scripts, e.g. Syriac and Mongolian?

Am I right in thinking that for an OpenType font for other scripts, the font writer must take precautions to prevent ZWJ accidentally suppressing ligatures that would be better suppressed by ZWNJ or <ZWJ ZWNJ ZWJ>?


From richard.wordingham at ntlworld.com  Mon Aug 10 13:26:03 2015
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Mon, 10 Aug 2015 19:26:03 +0100
Subject: ZWJ as a Ligature Suppressor
In-Reply-To: <BY1PR03MB1500FAE92EFBD918E4ABA1898E700@BY1PR03MB1500.namprd03.prod.outlook.com>
References: <20150809115820.67e0eead@JRWUBU2>
 <BY1PR03MB1500FAE92EFBD918E4ABA1898E700@BY1PR03MB1500.namprd03.prod.outlook.com>
Message-ID: <20150810192603.5dafd0a7@JRWUBU2>

On Mon, 10 Aug 2015 17:58:24 +0000
"Andrew Glass (WINDOWS)" <Andrew.Glass at microsoft.com> wrote:
 
I had asked:

>> According to the text just after TUS 7.0.0 Figure 23-3
>> (http://www.unicode.org/versions/Unicode7.0.0/ch23.pdf#G25237), ZWJ
>> suppresses ligatures in Arabic script.  Does this rule apply to other
>> normally cursive joined scripts, e.g. Syriac and Mongolian?

> To ligate or not to ligate is up to the font designer. Normally, GSUB
> lookups that perform ligation will be broken by the presence of ZWJ
> or ZWNJ. If a font designer wishes to ligate in the presence of a ZWJ
> or ZWNJ then they could choose to include appropriate glyph sequences
> in their ligation lookups. For example:

> glyphA glyphB -> glyphC
> glyphA ZWJ glyphB -> glyphC

So, any rule as to what ZWJ means is not implemented in the OpenType
engine, but rather in the font.  (As is the rule that 'a' does not
look like 'b'.)  For which scripts may a font designer defensibly omit
the duplicate with ZWJ?  The TUS says Arabic is one. Are there any
others?

Richard.

From khaledhosny at eglug.org  Mon Aug 10 14:00:41 2015
From: khaledhosny at eglug.org (Khaled Hosny)
Date: Mon, 10 Aug 2015 21:00:41 +0200
Subject: ZWJ as a Ligature Suppressor
In-Reply-To: <BY1PR03MB1500FAE92EFBD918E4ABA1898E700@BY1PR03MB1500.namprd03.prod.outlook.com>
References: <20150809115820.67e0eead@JRWUBU2>
 <BY1PR03MB1500FAE92EFBD918E4ABA1898E700@BY1PR03MB1500.namprd03.prod.outlook.com>
Message-ID: <20150810190041.GA29473@khaled-laptop>

This is not always true, some rendering engines (like HarfBuzz) try to
follow the Unicode rules so ZWJ does not break ligatures except in
Arabic where the standard says it should be interpreted as <ZWJ, ZWNJ,
ZWJ>.

Regards,
Khaled

On Mon, Aug 10, 2015 at 05:58:24PM +0000, Andrew Glass (WINDOWS) wrote:
> Hi Richard,
> 
> To ligate or not to ligate is up to the font designer. Normally, GSUB lookups that perform ligation will be broken by the presence of ZWJ or ZWNJ. If a font designer wishes to ligate in the presence of a ZWJ or ZWNJ then they could choose to include appropriate glyph sequences in their ligation lookups. For example:
> 
> glyphA glyphB -> glyphC
> glyphA ZWJ glyphB -> glyphC
> 
> Cheers,
> 
> Andrew
> 
> 
> Andrew Glass Ph.D.
> Program Manager
> Shell Text Input Group | Windows | Microsoft
> 
> -----Original Message-----
> From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Richard Wordingham
> Sent: Sunday, August 9, 2015 3:58 AM
> To: unicode at unicode.org
> Subject: ZWJ as a Ligature Suppressor
> 
> According to the text just after TUS 7.0.0 Figure 23-3 (http://www.unicode.org/versions/Unicode7.0.0/ch23.pdf#G25237), ZWJ suppresses ligatures in Arabic script.  Does this rule apply to other normally cursive joined scripts, e.g. Syriac and Mongolian?
> 
> Am I right in thinking that for an OpenType font for other scripts, the font writer must take precautions to prevent ZWJ accidentally suppressing ligatures that would be better suppressed by ZWNJ or <ZWJ ZWNJ ZWJ>?

From unicode at maxtruxa.com  Mon Aug 10 14:08:55 2015
From: unicode at maxtruxa.com (Max Truxa)
Date: Mon, 10 Aug 2015 21:08:55 +0200
Subject: Implementing SMP on a UTF-16 OS
Message-ID: <CABHh-8CvDqzhNmBqYM3DEsAojUA46FHZeJ+y40vdma_K8MDVvA@mail.gmail.com>

Hi Marcel,

from what I can see in the short piece of code you posted, it looks
like you are trying to somehow "group" the surrogate pairs (which does
not make any sense to me).
Correct syntax would be: [...] 0xD835, 0xDCEA, 0xD835, 0xDCD0, [...]

IMO this mailing list is not the right place for questions about C
syntax, is it not?

Best regards,

Max

From charupdate at orange.fr  Mon Aug 10 14:46:39 2015
From: charupdate at orange.fr (Marcel Schneider)
Date: Mon, 10 Aug 2015 21:46:39 +0200 (CEST)
Subject: Implementing SMP on a UTF-16 OS (was: Windows keyboard
 restrictions)
In-Reply-To: <CABHh-8A=resnzK0DHch9qAnHgd9bH9ihZi3or7fnJj7G60D=Tg@mail.gmail.com>
References: <CABHh-8A=resnzK0DHch9qAnHgd9bH9ihZi3or7fnJj7G60D=Tg@mail.gmail.com>
Message-ID: <2079848060.15640.1439235999068.JavaMail.www@wwinf1d19>

Hi Max,

On 10 Aug 2015, at 20:25, Max Truxa  wrote:

> IMO this mailing list is not the right place for questions about C syntax, is it not?

Indeed it isn't. Would it not be about Unicode implementation, I wouldn't have sent it to the Unicode Mailing List. 
Whatever language, it's about getting SMP characters into the place where the OS stores the keyboard layout. I stick with the idea that this mustn't stay limited to the BMP. IMHO the ligatures made of a surrogates pair in MSKLC keyboards are a sort of workaround, while something isn't really fit for UTF-16. The idea wasn't to throttle the OS down to BMP, was it? The SMP simply didn't exist yet. Now it does, things get screwed up.

> Correct syntax would be: [...] 0xD835, 0xDCEA, 0xD835, 0xDCD0, [...]

The problem with the commas here is that they don't only separate, they increment the modification number. The trailing surrogate must stay together with the leading one on the same shift state.

As I say, it's got screwed up. However, I'll try with just removing the parentheses, hoping that the surrogates will be automatically grouped together.

By contrast, I've the good news to bring in that the test SMP keyboard layout works on Word?2013. When I press the key with U+1D4EA and U+1D4D0, the glyphs are directly inserted. So there's one less problem.

Best regards,

Marcel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150810/18d21931/attachment.html>

From richard.wordingham at ntlworld.com  Mon Aug 10 15:25:03 2015
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Mon, 10 Aug 2015 21:25:03 +0100
Subject: Implementing SMP on a UTF-16 OS (was: Windows keyboard
 restrictions)
In-Reply-To: <2079848060.15640.1439235999068.JavaMail.www@wwinf1d19>
References: <CABHh-8A=resnzK0DHch9qAnHgd9bH9ihZi3or7fnJj7G60D=Tg@mail.gmail.com>
 <2079848060.15640.1439235999068.JavaMail.www@wwinf1d19>
Message-ID: <20150810212503.58cc24d0@JRWUBU2>

On Mon, 10 Aug 2015 21:46:39 +0200 (CEST)
Marcel Schneider <charupdate at orange.fr> wrote:

> > Correct syntax would be: [...] 0xD835, 0xDCEA, 0xD835, 0xDCD0, [...]

> The problem with the commas here is that they don't only separate,
> they increment the modification number. The trailing surrogate must
> stay together with the leading one on the same shift state.

Non-BMP characters must be entered as 'ligatures'.

> By contrast, I've the good news to bring in that the test SMP
> keyboard layout works on Word?2013. When I press the key with U+1D4EA
> and U+1D4D0, the glyphs are directly inserted. So there's one less
> problem.

Curiously, I think it would work for my cuneiform keyboard with
Windows 2002 if I hadn't chosen the only Mesopotamian locale available
at the time, Iraqi Arabic.  I think I'm suffering from Word being too
clever by half - the font and keyboard kept changing when I claimed
the text was left-to-right. Perhaps Word 2002 has received better
patches. The first time I used it on Windows 7 some Thai was rendered
with black boxes.  I installed the extensions to handle OpenDocument,
and the problems went away.  If I change the font back to a Hittite
font I have, the text appears as it should.  I created the keyboard
for doing Babylonian maths, so I don't think a Turkish locale would
have been appropriate, despite my using a Hittite font.  It would
probably work better, though.

Richard.


From richard.wordingham at ntlworld.com  Mon Aug 10 15:25:20 2015
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Mon, 10 Aug 2015 21:25:20 +0100
Subject: Implementing SMP on a UTF-16 OS
In-Reply-To: <CABHh-8CvDqzhNmBqYM3DEsAojUA46FHZeJ+y40vdma_K8MDVvA@mail.gmail.com>
References: <CABHh-8CvDqzhNmBqYM3DEsAojUA46FHZeJ+y40vdma_K8MDVvA@mail.gmail.com>
Message-ID: <20150810212520.53544f3e@JRWUBU2>

On Mon, 10 Aug 2015 21:08:55 +0200
Max Truxa <unicode at maxtruxa.com> wrote:

> Correct syntax would be: [...] 0xD835, 0xDCEA, 0xD835, 0xDCD0, [...]

> IMO this mailing list is not the right place for questions about C
> syntax, is it not?

View it as a gripe about UTF-16 and how confusing things get when code
units are referred to as characters.


From charupdate at orange.fr  Mon Aug 10 15:53:11 2015
From: charupdate at orange.fr (Marcel Schneider)
Date: Mon, 10 Aug 2015 22:53:11 +0200 (CEST)
Subject: Implementing SMP on a UTF-16 OS
In-Reply-To: <20150810212503.58cc24d0@JRWUBU2>
References: <CABHh-8A=resnzK0DHch9qAnHgd9bH9ihZi3or7fnJj7G60D=Tg@mail.gmail.com>
 <2079848060.15640.1439235999068.JavaMail.www@wwinf1d19>
 <20150810212503.58cc24d0@JRWUBU2>
Message-ID: <1761086697.16696.1439239991108.JavaMail.www@wwinf1d19>

On Mon, 10 Aug 2015, at 22:33, Richard Wordingham  wrote:

> Non-BMP characters must be entered as 'ligatures'.

This is bad news for a universal Latin keyboard layout, where a number of SMP characters should be available trough dead keys, or Compose.
We can implement Compose as a dead key chaining tree, but it seems to be limited to the BMP.
The mathematical letters are part of the symbols, and it would be handy to get them too with dead keys, as Compose, &, &, for the script alphabet. 
But the deadtrans combined character argument must be one code unit, not one character. So there seems to be no place for SMP.

This is clearly a Unicode implementation problem. C and C++ should be standardized for handling of UTF-16. IMO we cannot consider that Windows supports UTF-16 for internal use, if it does not support surrogates pairs except with workarounds using ligatures.

I may be wrong, but that's how I see the problem now.

Best regards,

Marcel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150810/c849105c/attachment.html>

From richard.wordingham at ntlworld.com  Mon Aug 10 15:58:32 2015
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Mon, 10 Aug 2015 21:58:32 +0100
Subject: Implementing SMP on a UTF-16 OS
In-Reply-To: <1761086697.16696.1439239991108.JavaMail.www@wwinf1d19>
References: <CABHh-8A=resnzK0DHch9qAnHgd9bH9ihZi3or7fnJj7G60D=Tg@mail.gmail.com>
 <2079848060.15640.1439235999068.JavaMail.www@wwinf1d19>
 <20150810212503.58cc24d0@JRWUBU2>
 <1761086697.16696.1439239991108.JavaMail.www@wwinf1d19>
Message-ID: <20150810215832.5c3246cf@JRWUBU2>

On Mon, 10 Aug 2015 22:53:11 +0200 (CEST)
Marcel Schneider <charupdate at orange.fr> wrote:

> On Mon, 10 Aug 2015, at 22:33, Richard Wordingham  wrote:
 
> > Non-BMP characters must be entered as 'ligatures'.
 
> This is clearly a Unicode implementation problem. C and C++ should be
> standardized for handling of UTF-16. IMO we cannot consider that
> Windows supports UTF-16 for internal use, if it does not support
> surrogates pairs except with workarounds using ligatures.

Perhaps this is why Windows offers a new method of keyboard
mapping, via the Text Services Framework (TSF).

> I may be wrong, but that's how I see the problem now.

I think you're not looking hard enough.

Richard.


From unicode at maxtruxa.com  Tue Aug 11 01:27:09 2015
From: unicode at maxtruxa.com (Max Truxa)
Date: Tue, 11 Aug 2015 08:27:09 +0200
Subject: Implementing SMP on a UTF-16 OS
In-Reply-To: <1761086697.16696.1439239991108.JavaMail.www@wwinf1d19>
References: <CABHh-8A=resnzK0DHch9qAnHgd9bH9ihZi3or7fnJj7G60D=Tg@mail.gmail.com>
 <2079848060.15640.1439235999068.JavaMail.www@wwinf1d19>
 <20150810212503.58cc24d0@JRWUBU2>
 <1761086697.16696.1439239991108.JavaMail.www@wwinf1d19>
Message-ID: <CABHh-8Bmj3+vD0cauCBTF3Dp9G2dxFvzDMGcDXiY3Tnc6y-0mw@mail.gmail.com>

On Aug 10, 2015 10:53 PM, "Marcel Schneider" <charupdate at orange.fr> wrote:
>
> This is clearly a Unicode implementation problem. C and C++ should be
standardized for handling of UTF-16. IMO we cannot consider that Windows
supports UTF-16 for internal use, if it does not support surrogates pairs
except with workarounds using ligatures.

C and C++ *are* "standardized for handling of UTF-16"... and UTF-8... and
UTF-32.
If you are interested in this topic just search for "C++ Unicode string
literals" and "C++ Unicode character literals" which are standardized since
C11/C++11 (with the exception of UTF-8 character literals which will follow
in C++11; don't know about C though).
The reason you won't be able to easily use these features is because the
compiler shipping with the WDK is still only supporting C89/C90. And sadly
for us driver developers Microsoft will not change this.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150811/f03e4d0d/attachment.html>

From unicode at maxtruxa.com  Tue Aug 11 01:29:59 2015
From: unicode at maxtruxa.com (Max Truxa)
Date: Tue, 11 Aug 2015 08:29:59 +0200
Subject: Implementing SMP on a UTF-16 OS
In-Reply-To: <CABHh-8Bmj3+vD0cauCBTF3Dp9G2dxFvzDMGcDXiY3Tnc6y-0mw@mail.gmail.com>
References: <CABHh-8A=resnzK0DHch9qAnHgd9bH9ihZi3or7fnJj7G60D=Tg@mail.gmail.com>
 <2079848060.15640.1439235999068.JavaMail.www@wwinf1d19>
 <20150810212503.58cc24d0@JRWUBU2>
 <1761086697.16696.1439239991108.JavaMail.www@wwinf1d19>
 <CABHh-8Bmj3+vD0cauCBTF3Dp9G2dxFvzDMGcDXiY3Tnc6y-0mw@mail.gmail.com>
Message-ID: <CABHh-8DbQ7zKiyOYTxFgVcbnYfPWV3=RS6Vqpd8MavVvN799Vw@mail.gmail.com>

On Aug 11, 2015 8:27 AM, "Max Truxa" <unicode at maxtruxa.com> wrote:
>
> (with the exception of UTF-8 character literals which will follow in
C++11; don't know about C though).

Sorry for that typo. UTF-8 character literals will follow in C++17.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150811/c5b7de65/attachment.html>

From mark at macchiato.com  Tue Aug 11 02:35:38 2015
From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=)
Date: Tue, 11 Aug 2015 09:35:38 +0200
Subject: Bogus glyphs for halfwidth characters
Message-ID: <CAJ2xs_Ep_OLexKs1F4Cw2ESj9tqADy7eNkhtGGmcJ3+OYPcF4w@mail.gmail.com>

For halfwidth characters like U+FF70 HALFWIDTH KATAKANA-HIRAGANA PROLONGED
SOUND MARK I'm getting bogus glyphs from Kaiti* and Songti* (and STKaiti,
STSongti).
? See screenshot below.

?Anyone else see that, or know what is happening? Unfortunately, these
fonts are getting picked up first in the fallback chain? for my browser, so
they are pretty apparent!

[image: Inline image 1]

Mark <https://google.com/+MarkDavis>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150811/36b2d7e1/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Screen Shot 2015-08-11 at 09.32.26.png
Type: image/png
Size: 27906 bytes
Desc: not available
URL: <http://unicode.org/pipermail/unicode/attachments/20150811/36b2d7e1/attachment.png>

From A.Schappo at lboro.ac.uk  Tue Aug 11 05:07:13 2015
From: A.Schappo at lboro.ac.uk (Andre Schappo)
Date: Tue, 11 Aug 2015 10:07:13 +0000
Subject: Bogus glyphs for halfwidth characters
In-Reply-To: <CAJ2xs_Ep_OLexKs1F4Cw2ESj9tqADy7eNkhtGGmcJ3+OYPcF4w@mail.gmail.com>
References: <CAJ2xs_Ep_OLexKs1F4Cw2ESj9tqADy7eNkhtGGmcJ3+OYPcF4w@mail.gmail.com>
Message-ID: <71A7E9CE-3BC3-4B75-98C4-8CD672C8647A@lboro.ac.uk>


Yes. Ditto. Mac OSX 10.10.4

Broken CMAPs?

Andr? Schappo

On 11 Aug 2015, at 08:35, Mark Davis ?? wrote:

For halfwidth characters like U+FF70 HALFWIDTH KATAKANA-HIRAGANA PROLONGED SOUND MARK I'm getting bogus glyphs from Kaiti* and Songti* (and STKaiti, STSongti).
? See screenshot below.

?Anyone else see that, or know what is happening? Unfortunately, these fonts are getting picked up first in the fallback chain? for my browser, so they are pretty apparent!

<Screen Shot 2015-08-11 at 09.32.26.png>

Mark<https://google.com/+MarkDavis>


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150811/68b6f465/attachment.html>

From A.Schappo at lboro.ac.uk  Tue Aug 11 11:00:24 2015
From: A.Schappo at lboro.ac.uk (Andre Schappo)
Date: Tue, 11 Aug 2015 16:00:24 +0000
Subject: Bogus glyphs for halfwidth characters
In-Reply-To: <CAJ2xs_Ep_OLexKs1F4Cw2ESj9tqADy7eNkhtGGmcJ3+OYPcF4w@mail.gmail.com>
References: <CAJ2xs_Ep_OLexKs1F4Cw2ESj9tqADy7eNkhtGGmcJ3+OYPcF4w@mail.gmail.com>
Message-ID: <BEE1E8E1-13FA-44B2-BAF7-67097BB9709B@lboro.ac.uk>


The bug is consistent. All the below fonts are by Changzhou SinoType Technology and U+FF70 is at font glyph 147

Andr? Schappo

On 11 Aug 2015, at 08:35, Mark Davis ?? wrote:

For halfwidth characters like U+FF70 HALFWIDTH KATAKANA-HIRAGANA PROLONGED SOUND MARK I'm getting bogus glyphs from Kaiti* and Songti* (and STKaiti, STSongti).
? See screenshot below.

?Anyone else see that, or know what is happening? Unfortunately, these fonts are getting picked up first in the fallback chain? for my browser, so they are pretty apparent!

<Screen Shot 2015-08-11 at 09.32.26.png>

Mark<https://google.com/+MarkDavis>


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150811/ba84b0de/attachment.html>

From tom at bluesky.org  Tue Aug 11 11:10:35 2015
From: tom at bluesky.org (Tom Gewecke)
Date: Tue, 11 Aug 2015 12:10:35 -0400
Subject: Bogus glyphs for halfwidth characters
In-Reply-To: <BEE1E8E1-13FA-44B2-BAF7-67097BB9709B@lboro.ac.uk>
References: <CAJ2xs_Ep_OLexKs1F4Cw2ESj9tqADy7eNkhtGGmcJ3+OYPcF4w@mail.gmail.com>
 <BEE1E8E1-13FA-44B2-BAF7-67097BB9709B@lboro.ac.uk>
Message-ID: <08FF27D2-971B-4A40-8D1E-F6DA202AB8E8@bluesky.org>

It looks like glyphs 132-194 are all mislabeled as halfwidth katakana-hiragana.

On Aug 11, 2015, at 12:00 PM, Andre Schappo wrote:

> 
> The bug is consistent. All the below fonts are by Changzhou SinoType Technology and U+FF70 is at font glyph 147
> 
> Andr? Schappo
> 
> On 11 Aug 2015, at 08:35, Mark Davis ?? wrote:
> 
>> For halfwidth characters like U+FF70 HALFWIDTH KATAKANA-HIRAGANA PROLONGED SOUND MARK I'm getting bogus glyphs from Kaiti* and Songti* (and STKaiti, STSongti). ? See screenshot below.
>> 
>> ?Anyone else see that, or know what is happening? Unfortunately, these fonts are getting picked up first in the fallback chain? for my browser, so they are pretty apparent!
>> 
>> <Screen Shot 2015-08-11 at 09.32.26.png>
>> 
>> Mark
>> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150811/12b348be/attachment.html>

From charupdate at orange.fr  Tue Aug 11 13:49:08 2015
From: charupdate at orange.fr (Marcel Schneider)
Date: Tue, 11 Aug 2015 20:49:08 +0200 (CEST)
Subject: bang mail
In-Reply-To: <000001d0d3a6$696939c0$3c3bad40$@fi>
References: <BY2PR0301MB1608355F18ED0E908E345BBDD5700@BY2PR0301MB1608.namprd03.prod.outlook.com>
 <898142779.18218.1439227011781.JavaMail.www@wwinf1m19>
 <000001d0d3a6$696939c0$3c3bad40$@fi>
Message-ID: <1160363707.13272.1439318948241.JavaMail.www@wwinf1f27>

I should not bring more explanations about a behavior of mine that has been identified as inappropriate. However, in this particular circumstance, I?d like to outline briefly why I banged my e-mails:
1 - I observed that at normal priority, an e-mail takes much more time until it is received.
2 - As often I?take much time and pain to make accurate e-mails, I looked for a way to get the addressee at least take a glance among the mass of e-mails that is said to be constantly received.
3 - When I?d e-mailed to the List while forgetting to set the bang, I thought, ?argh, I hope the addressees won?t take notice that I?made a difference in treatment.?

Thanks to the Unicode Mailing List, I now learned that an e-mail is better viewed when the prioritization tool hadn?t been used. That?s very useful and I?d like to personally thank Peter for having taken the initiative of preventing me, as well as dzo and Erkki for having answered in this thread.

Best regards,

Marcel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150811/e03cc259/attachment.html>

From charupdate at orange.fr  Tue Aug 11 13:51:33 2015
From: charupdate at orange.fr (Marcel Schneider)
Date: Tue, 11 Aug 2015 20:51:33 +0200 (CEST)
Subject: The role of documentation in implementation (was: Re: Windows
 keyboard restrictions)
In-Reply-To: <336224144.13348.1439218964690.JavaMail.www@wwinf1m19>
References: <336224144.13348.1439218964690.JavaMail.www@wwinf1m19>
Message-ID: <1679568348.13334.1439319093909.JavaMail.www@wwinf1f27>

I?ve got a driver with a five code units ligature on Shift+Ctrl+Alt, and where Word (and Excel) opened. As I was in a hurry and wrote in English, I didn?t notice that the dead keys were disabled. That was the driver I?was writing about when I spun off this thread.

Now I?ve compiled a driver where the only difference is that two complete lines (the ones with the ellipses) are inverted, at a place that isn?t sorting sensitive. Word and Excel are blocked. I?tried again the driver above: Firefox and Zotero are blocked.

It?s very hard for me to write this to the Mailing List, but honestly I must admit that having not enough time, I didn?t work properly.

All that, and some more, leads me to the conclusion that when Windows was built, there was often not enough time to write up the documentation; or there was a fear that such a documentation could be copied and carried away. So the teams were told not to waste any time upon. These are suppositions, but as everybody at Microsoft mustn?t disclose any information about how things were done (for the same reason that there?s little documentation), we?re reduced to build our own views, to get at least some working idea.

So now I?believe that when Michael?Kaplan did his own tests, he found out that there?s a problem when he put on Shift+Ctrl+Alt a ligature that exceeded four code units, and that he asked some colleagues but nobody knew anything about, so he remembered the header file he had seen (but that perhaps he couldn?t find again because it hadn?t been documented).

Really, when 16 units work on all shift states except one, an official keyboard layout software must equalize the limit at the low level. If a user read that he could put up to 16 on all shift states except on Shift+AltGr, where he could put up to?4, he would get a curious feeling. It?s like the Liebig rule: the lowest level determines the overall limit.

But when developing ready-to-use keyboard layouts with the WDK, as Michael?Kaplan seems to suggest in the MSKLC glossary, we aren?t caught to stick with the safe limit and may feel free to place as many units as we find really working. Well, when I place less than five on Shift+Ctrl+Alt, I?m not forced to divulge that more wouldn?t work there. I?m not meant to write up the documentation Microsoft didn?t?:-)

Best regards,

Marcel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150811/b47af571/attachment.html>

From asmus-inc at ix.netcom.com  Tue Aug 11 14:08:48 2015
From: asmus-inc at ix.netcom.com (Asmus Freytag (t))
Date: Tue, 11 Aug 2015 12:08:48 -0700
Subject: The role of documentation in implementation (was: Re: Windows
 keyboard restrictions)
In-Reply-To: <1679568348.13334.1439319093909.JavaMail.www@wwinf1f27>
References: <336224144.13348.1439218964690.JavaMail.www@wwinf1m19>
 <1679568348.13334.1439319093909.JavaMail.www@wwinf1f27>
Message-ID: <55CA4840.8090109@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150811/71333c47/attachment.html>

From charupdate at orange.fr  Tue Aug 11 14:27:27 2015
From: charupdate at orange.fr (Marcel Schneider)
Date: Tue, 11 Aug 2015 21:27:27 +0200 (CEST)
Subject: Implementing SMP on a UTF-16 OS
Message-ID: <362174827.4688.1439321247508.JavaMail.www@wwinf1j19>

On 10 Aug 2015, at 21:45, Max Truxa  wrote:

> from what I can see in the short piece of code you posted, it looks
> like you are trying to somehow "group" the surrogate pairs (which does
> not make any sense to me).
> Correct syntax would be: [...] 0xD835, 0xDCEA, 0xD835, 0xDCD0, [...]

On 10 Aug 2015, at 23:06, Richard Wordingham  wrote:

> On Mon, 10 Aug 2015 22:53:11 +0200 (CEST)
> Marcel Schneider  wrote:
> 
> > On Mon, 10 Aug 2015, at 22:33, Richard Wordingham wrote:
> 
> > > Non-BMP characters must be entered as 'ligatures'.
> 
> > This is clearly a Unicode implementation problem. C and C++ should be
> > standardized for handling of UTF-16. IMO we cannot consider that
> > Windows supports UTF-16 for internal use, if it does not support
> > surrogates pairs except with workarounds using ligatures.
> 
> Perhaps this is why Windows offers a new method of keyboard
> mapping, via the Text Services Framework (TSF).
> 
> > I may be wrong, but that's how I see the problem now.
> 
> I think you're not looking hard enough.

I?ve tried to just remove the parentheses and let the string. This was compiled, but the keyboard test showed that in the keyboard driver DLL, UTF-16 strings with SMP characters aren?t handled as such. Each surrogate code unit is considered as a single character even when it?s followed by a trailing one. Only the code unit corresponding to the shift state (modification number) is taken, no matter if it?s only a surrogate and the other half comes next.

Windows can handle 32?bit code units. I found evidence in C:\WinDDK\7600.16385.1\inc\api\functiondiscoverykeys.h. So I tried this in the driver source:
{'A' /*T10 D01*/ ,0x01 ,'a' ,'A' ,NONE ,0xd835dcea ,0xd835dcd0 ,0x00e6 ,0x00c6 ,NONE ,NONE }, // ,0x0061 ,0x0041
But the compiler returned:
warning C4305: 'initializing' : truncation from 'unsigned int' to 'WCHAR'
and:
error C2220: warning treated as error - no 'object' file generated

I understand that the compiler read correctly the first of the 32?bit integers, but as here it expected a WCHAR, it deleted 16?bits and wouln?t go forth.

On 11 Aug 2015, at 8:27, Max Truxa"  wrote [corrected typo following your next e-mail]:

> On Aug 10, 2015 10:53 PM, "Marcel Schneider"  wrote:
> >
> > This is clearly a Unicode implementation problem. C and C++ should be standardized for handling of UTF-16. IMO we cannot consider that Windows supports UTF-16 for internal use, if it does not support surrogates pairs except with workarounds using ligatures.

> C and C++ *are* "standardized for handling of UTF-16"... and UTF-8... and UTF-32.
> If you are interested in this topic just search for "C++ Unicode string literals" and "C++ Unicode character literals" which are standardized since C11/C++11 (with the exception of UTF-8 character literals which will follow in C++17; don't know about C though).
> The reason you won't be able to easily use these features is because the compiler shipping with the WDK is still only supporting C89/C90. And sadly for us driver developers Microsoft will not change this.

Is this the reason why a Unicode character cannot be represented alternatively as a 32?bit integer on Windows? Being UTF-16, the OS could handle a complete surrogates pair in one single 32?bit?integer. Couldn't this be performed on driver level by modifying a program and updating this when the driver is installed?

If yes, we must modify the interface so that keyboard driver DLLs are really read in UTF-16. And/or we must find another compiler. 

Must the Windows driver be compiled by a Microsoft compiler?

Meanwhile, the only workaround I see for getting SMP characters in the deadtrans list, is that these must be programmed on two entries, so that a user must type for example Compose, &, &, &, A, 1, and then Compose, &, &, &, A, 2, to get ?? (bold script, when normal script is with two ampersands, and ?with curl?, one ampersand). (Instead of 1 and 2 we can also choose l for leading, and t for trailing.) Normally a user should be able to get this letter with five key strokes, not ten. On Word we?ve already an autocorrect for script letters (??,???), so that we should add another series for bold script (which is bolder than ?bold? ?script?). But that working on Office, not on the Notepad and elsewhere, a keyboard driver or TSF based solution is preferrable, also because typing \ s c r i p t a Space Backspace is already ten keystrokes, too! (A trailing backslash would save one.)

Best regards,

Marcel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150811/040eba07/attachment.html>

From richard.wordingham at ntlworld.com  Tue Aug 11 16:06:51 2015
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Tue, 11 Aug 2015 22:06:51 +0100
Subject: Implementing SMP on a UTF-16 OS
In-Reply-To: <362174827.4688.1439321247508.JavaMail.www@wwinf1j19>
References: <362174827.4688.1439321247508.JavaMail.www@wwinf1j19>
Message-ID: <20150811220651.7ed376c1@JRWUBU2>

On Tue, 11 Aug 2015 21:27:27 +0200 (CEST)
Marcel Schneider <charupdate at orange.fr> wrote:

> I?ve tried to just remove the parentheses and let the string. This
> was compiled, but the keyboard test showed that in the keyboard
> driver DLL, UTF-16 strings with SMP characters aren?t handled as
> such. Each surrogate code unit is considered as a single character
> even when it?s followed by a trailing one. Only the code unit
> corresponding to the shift state (modification number) is taken, no
> matter if it?s only a surrogate and the other half comes next.

This is exactly what one should expect.  The data is an array of
UTF-16 code units rather than a UTF-16 string.  Moreover, it was
probably written as UCS-2.  I believe it is the application that has
the job of stitching the surrogate pairs together.

> Is this the reason why a Unicode character cannot be represented
> alternatively as a 32?bit integer on Windows?

They are, from time to time.  There's a Windows message that delivers a
supplementary character rather a UTF-16 code unit, and fairly obviously
they have to be handled as such when performing font lookups.  I've a
suspicion that this message hit an interoperability problem.  A program
that can handle pairs of surrogates but predates the message will not
work with the more recent message.  Therefore using the message type is
deferred until applications can handle it.  Therefore applications don't
need to handle it, and don't.  Therefore the message type doesn't get
used.

> Being UTF-16, the OS
> could handle a complete surrogates pair in one single 32?bit?integer.
> Couldn't this be performed on driver level by modifying a program and
> updating this when the driver is installed?

You really talking about a parallel set of routines.  I suspect the
answer is that Microsoft don't want to work on extending a primitive
keyboarding system when TSF is available.

You want to use dead keys.  Why?  Is it not that they are the only
mechanism you have experience of.

Better systems can be built, in which one sees what one is doing.  Is
it not much better to type 'e' and then a circumflex, and see the 'e'
and then the 'e' with a circumflex?  Dead keys are an imitation of a
limitation of typewriter technology.  If I was typing cuneiform, I'd
much rather type 'bi<COMMIT>' and see the growing sequence 'b', 'bi',
'<CUNEIFORM SIGN BI>' as I typed.  (What you have for a <COMMIT> key
is your choice.)  TSF lets one do this. A simple extension of the
keyboard definition DLLs generated by MSKLC does not.  What you should
be pressing for is a usable tutorial on how to do this in TSF.

> If yes, we must modify the interface so that keyboard driver DLLs are
> really read in UTF-16. And/or we must find another compiler. 
> 
> Must the Windows driver be compiled by a Microsoft compiler?

The compiler is not the issue.  The point is that the 16-bit code
exists, and programs that use the 16-bit API exist.  Language upgrades
may make supplementary characters easier to use in programs, but that
is all.  They don't change existing binary interfaces.

Richard.


From marc at keyman.com  Tue Aug 11 16:47:26 2015
From: marc at keyman.com (Marc Durdin)
Date: Tue, 11 Aug 2015 21:47:26 +0000
Subject: Michael Kaplan leaves Microsoft
Message-ID: <1CEDD746887FFF4B834688E7AF5FDA5A821BF3CA@federation.tavultesoft.local>

This is a little off topic, I would just like to pay some respect to Michael who has been on disability leave from Microsoft for some time and has just been RIF'ed (see his blog titled "Something Happened" http://www.siao2.com/2015/08/11/8770668856267197009.aspx).

Thank you to Michael for your tireless work over many years in Unicode and i18n work in Windows, for writing MSKLC and for the vast array knowledge you collected and shared in your blog over many years. I wish you all the best in the future.

Marc
CEO, Tavultesoft Pty Ltd
Keyman: Type to the world in your language
PO Box 550  Sandy Bay  TAS  7006  AUSTRALIA
ph: +61 3 6225 1665  mobile: +61 400 737 106  fax: +61 3 9923 6047
email: marc at keyman.com<mailto:marc at keyman.com>  web: keyman.com<http://www.keyman.com/>  skype: mcdurdin<callto:mcdurdin>
twitter: @MarcDurdin<http://twitter.com/MarcDurdin> (personal) @keyman<http://twitter.com/keyman>
facebook: keymanapp<https://facebook.com/keymanapp> google+: keymanapp<https://plus.google.com/keymanapp>


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150811/c60e4b71/attachment.html>

From charupdate at orange.fr  Wed Aug 12 03:09:36 2015
From: charupdate at orange.fr (Marcel Schneider)
Date: Wed, 12 Aug 2015 10:09:36 +0200 (CEST)
Subject: Michael Kaplan leaves Microsoft
In-Reply-To: <1CEDD746887FFF4B834688E7AF5FDA5A821BF3CA@federation.tavultesoft.local>
References: <1CEDD746887FFF4B834688E7AF5FDA5A821BF3CA@federation.tavultesoft.local>
Message-ID: <177879370.5993.1439366976157.JavaMail.www@wwinf1c20>

I'm very sad. This news hits hard.

I hurry up (having e-mailed what I have) joining my thanks and wishes to Marc's.

Thank you Marc for having brought it to us. No, I feel this isn't off topic :-(|


Marcel


On 11 Aug 2015, at 23:56, Marc Durdin  wrote:

This is a little off topic, I would just like to pay some respect to Michael who has been on disability leave from Microsoft for some time and has just been RIF?ed (see his blog titled ?Something Happened? http://www.siao2.com/2015/08/11/8770668856267197009.aspx).

Thank you to Michael for your tireless work over many years in Unicode and i18n work in Windows, for writing MSKLC and for the vast array knowledge you collected and shared in your blog over many years. I wish you all the best in the future.


Marc

CEO, Tavultesoft Pty Ltd
> Keyman: Type to the world in your language
> PO Box 550? Sandy Bay? TAS? 7006? AUSTRALIA
> ph: +61 3 6225 1665? mobile: +61 400 737 106? fax: +61 3 9923 6047
> email: marc at keyman.com? web: keyman.com? skype: mcdurdin
> twitter: @MarcDurdin (personal)@keyman
> facebook: keymanapp google+: keymanapp

?

?


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150812/93c83bd9/attachment.html>

From Andrew.Glass at microsoft.com  Wed Aug 12 11:54:26 2015
From: Andrew.Glass at microsoft.com (Andrew Glass (WINDOWS))
Date: Wed, 12 Aug 2015 16:54:26 +0000
Subject: ZWJ as a Ligature Suppressor
In-Reply-To: <20150810192603.5dafd0a7@JRWUBU2>
References: <20150809115820.67e0eead@JRWUBU2>
 <BY1PR03MB1500FAE92EFBD918E4ABA1898E700@BY1PR03MB1500.namprd03.prod.outlook.com>
 <20150810192603.5dafd0a7@JRWUBU2>
Message-ID: <BY1PR03MB15000452B1A30CA13342A7258E7E0@BY1PR03MB1500.namprd03.prod.outlook.com>

[Speaking for Uniscribe]

>So, any rule as to what ZWJ means is not implemented in the OpenType engine, but rather in the font.  (As is the rule that 'a' does not look like 'b'.)

Our Arabic and Universal Shaping Engines understand that ZWJ invokes a joining form for joining scripts. Ligation is handled by the font. The presence of ZWJ invokes the joining forms, but since it is not included in the ligature lookup, the ligature does not form. The ZWNJ does not invoke a joining form. Thus we can achieve the forms illustrated figure 23-3.

The fi case is different because Latin is not a joining script. Furthermore, the ligated form ?, when supported, is usually a discretionary ligature. Therefore to achieve the Latin forms in 23-3, I would attach a lookup for the fi substitution to <dlig>, and specify a substitution that includes ZWJ under <rlig>.

As Chapter 23 states (TUS 7.0, p. 804), there is no way to request a discretionary ligature in plain text for Arabic (and other joining scripts).

>For which scripts may a font designer defensibly omit the duplicate with ZWJ?  The TUS says Arabic is one. Are there any others?

In general I  would say that a designer can omit the lookup with ZWJ for joining scripts that include ligated forms. 
Our Mongolian font has ligatures that behave in the same way as Arabic, in that they can be blocked, but the components still join. Our Syriac fonts have <rlig> lookups but these are cosmetic and don't produce a visually distinct ligated form so the impact of ZWJ is negligible, but effectively still the same as Arabic. Our other joining scripts don't have ligatures.
For Indic scripts see TUS (7.0) Figure 12.7: http://www.unicode.org/versions/Unicode7.0.0/ch12.pdf

Cheers,

Andrew


From petercon at microsoft.com  Wed Aug 12 11:55:51 2015
From: petercon at microsoft.com (Peter Constable)
Date: Wed, 12 Aug 2015 16:55:51 +0000
Subject: Implementing SMP on a UTF-16 OS
In-Reply-To: <CABHh-8Bmj3+vD0cauCBTF3Dp9G2dxFvzDMGcDXiY3Tnc6y-0mw@mail.gmail.com>
References: <CABHh-8A=resnzK0DHch9qAnHgd9bH9ihZi3or7fnJj7G60D=Tg@mail.gmail.com>
 <2079848060.15640.1439235999068.JavaMail.www@wwinf1d19>
 <20150810212503.58cc24d0@JRWUBU2>
 <1761086697.16696.1439239991108.JavaMail.www@wwinf1d19>
 <CABHh-8Bmj3+vD0cauCBTF3Dp9G2dxFvzDMGcDXiY3Tnc6y-0mw@mail.gmail.com>
Message-ID: <BY2PR0301MB16088A377F41E395BB36993DD57E0@BY2PR0301MB1608.namprd03.prod.outlook.com>

I?m no expert on driver development, but Max?s comments got me curious.

?Windows Driver Kit (WDK) 10 is integrated with Microsoft Visual Studio 2015??
https://msdn.microsoft.com/en-us/library/windows/hardware/ff557573(v=vs.85).aspx


?In Visual Studio 2015, the C++ compiler and standard library have been updated with enhanced support for C++11 and initial support for certain C++14 features. They also include preliminary support for certain features expected to be in the C++17 standard.?
https://msdn.microsoft.com/en-us/library/hh409293.aspx


From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Max Truxa
Sent: Monday, August 10, 2015 11:27 PM
To: Marcel Schneider <charupdate at orange.fr>
Cc: Unicode Mailing List <unicode at unicode.org>
Subject: Re: Implementing SMP on a UTF-16 OS


On Aug 10, 2015 10:53 PM, "Marcel Schneider" <charupdate at orange.fr<mailto:charupdate at orange.fr>> wrote:
>
> This is clearly a Unicode implementation problem. C and C++ should be standardized for handling of UTF-16. IMO we cannot consider that Windows supports UTF-16 for internal use, if it does not support surrogates pairs except with workarounds using ligatures.

C and C++ *are* "standardized for handling of UTF-16"... and UTF-8... and UTF-32.
If you are interested in this topic just search for "C++ Unicode string literals" and "C++ Unicode character literals" which are standardized since C11/C++11 (with the exception of UTF-8 character literals which will follow in C++11; don't know about C though).
The reason you won't be able to easily use these features is because the compiler shipping with the WDK is still only supporting C89/C90. And sadly for us driver developers Microsoft will not change this.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150812/94993ce1/attachment.html>

From unicode at maxtruxa.com  Thu Aug 13 01:54:21 2015
From: unicode at maxtruxa.com (Max Truxa)
Date: Thu, 13 Aug 2015 08:54:21 +0200
Subject: Implementing SMP on a UTF-16 OS
In-Reply-To: <BY2PR0301MB16088A377F41E395BB36993DD57E0@BY2PR0301MB1608.namprd03.prod.outlook.com>
References: <CABHh-8A=resnzK0DHch9qAnHgd9bH9ihZi3or7fnJj7G60D=Tg@mail.gmail.com>
 <2079848060.15640.1439235999068.JavaMail.www@wwinf1d19>
 <20150810212503.58cc24d0@JRWUBU2>
 <1761086697.16696.1439239991108.JavaMail.www@wwinf1d19>
 <CABHh-8Bmj3+vD0cauCBTF3Dp9G2dxFvzDMGcDXiY3Tnc6y-0mw@mail.gmail.com>
 <BY2PR0301MB16088A377F41E395BB36993DD57E0@BY2PR0301MB1608.namprd03.prod.outlook.com>
Message-ID: <CABHh-8BA=To+1apsoKLOCfK8DAx0NCpY96Brj8QHLZEZKe7m7A@mail.gmail.com>

On Aug 12, 2015 6:55 PM, "Peter Constable" <petercon at microsoft.com> wrote:
>
>
> ?In Visual Studio 2015, the C++ compiler and standard library have been updated with enhanced support for C++11 and initial support for certain C++14 features. They also include preliminary support for certain features expected to be in the C++17 standard.?
>

You are right. I have to admit that my statement was not 100% correct.
Traditionally drivers for Windows are built using C (not C++). Part of
the reason for this is that Microsoft did not officially support C++
in kernel code up to the WDK 8. Nowadays there is the /kernel switch
which enables a subset of C++ which Microsoft considers safe to use in
kernel mode. The most recent C standard fully supported is still C89
though (plus a few C99 features that were added with VS2013). C99
support is *far* from being complete and I don't know of a single C11
feature being implemented.
This means C++11 could be used in a driver but one would need to
"convert" the driver to C++ (or at least those sources that make use
of modern features).

Anyhow, Marcel could certainly declare the mapping in a .cpp (using
extern "C" to ensure interoperability with C code) but that wouldn't
change that surrogate pairs seem to be unsupported for keyboard
drivers. (Like I said I have no experience writing keyboard drivers so
I can't confirm this.)


From doug at ewellic.org  Thu Aug 13 09:58:40 2015
From: doug at ewellic.org (Doug Ewell)
Date: Thu, 13 Aug 2015 07:58:40 -0700
Subject: Update on flag tags (PRI =?UTF-8?Q?=23=32=39=39=29=3F?=
Message-ID: <20150813075840.665a7a7059d7ee80bb4d670165c8327d.bbcb98170c.wbe@email03.secureserver.net>

The recently posted minutes from UTC #144
<http://www.unicode.org/L2/L2015/15187.htm> include the following:

> B.11.1.1.3 PRI 299 feedback and mailing list discussion [Edberg,
> L2/15-210]
>
> Discussion. UTC took no action at this time.

and:

> [144-A93] Action Item for Rick McGowan: Close PRI #299, saying: All
> feedback has been considered and will be part of the deliberations for
> possible future extension mechanisms.

This sounds like the entire flag-tag proposal described by PRI #299 has
been put on indefinite hold. Is that accurate?

--
Doug Ewell | http://ewellic.org | Thornton, CO ????


From charupdate at orange.fr  Thu Aug 13 10:07:50 2015
From: charupdate at orange.fr (Marcel Schneider)
Date: Thu, 13 Aug 2015 17:07:50 +0200 (CEST)
Subject: Implementing SMP on a UTF-16 OS
Message-ID: <916581674.16210.1439478470453.JavaMail.www@wwinf1c10>

[Given the bad news we got, I spared this in my draft folder from 12 Aug 11:12 on.] 


?

On 11 Aug 2015, at 23:18, Richard Wordingham  wrote [I've replaced < > with ???, as already I've got a disappearance and am not sure whether once <> converted, a second conversion won't happen]:

> On Tue, 11 Aug 2015 21:27:27 +0200 (CEST)
> Marcel Schneider  wrote:
> 
> > I?ve tried to just remove the parentheses and let the string. This
> > was compiled, but the keyboard test showed that in the keyboard
> > driver DLL, UTF-16 strings with SMP characters aren?t handled as
> > such. Each surrogate code unit is considered as a single character
> > even when it?s followed by a trailing one. Only the code unit
> > corresponding to the shift state (modification number) is taken, no
> > matter if it?s only a surrogate and the other half comes next.
> 
> This is exactly what one should expect. The data is an array of
> UTF-16 code units rather than a UTF-16 string. Moreover, it was
> probably written as UCS-2. I believe it is the application that has
> the job of stitching the surrogate pairs together.
> 
> > Is this the reason why a Unicode character cannot be represented
> > alternatively as a 32?bit integer on Windows?
> 
> They are, from time to time. There's a Windows message that delivers a
> supplementary character rather a UTF-16 code unit, and fairly obviously
> they have to be handled as such when performing font lookups. I've a
> suspicion that this message hit an interoperability problem. A program
> that can handle pairs of surrogates but predates the message will not
> work with the more recent message. Therefore using the message type is
> deferred until applications can handle it. Therefore applications don't
> need to handle it, and don't. Therefore the message type doesn't get
> used.
> 
> > Being UTF-16, the OS
> > could handle a complete surrogates pair in one single 32?bit?integer.
> > Couldn't this be performed on driver level by modifying a program and
> > updating this when the driver is installed?
> 
> You really talking about a parallel set of routines. I suspect the
> answer is that Microsoft don't want to work on extending a primitive
> keyboarding system when TSF is available.
> 
> You want to use dead keys. Why? Is it not that they are the only
> mechanism you have experience of.

Yes, along with the allocation table and the ligatures (and modifier key mapping). Dead keys are the only way I found in the driver source to easily input precomposed characters and to work out a Compose functionality. Marc Durdin told us that for most languages, dead keys are not the best way for input. However, we're accustomed to. About Compose I found out that preceding diacritics are the only way to efficiently input multiply diacriticized precomposed letters. When we use combining diacritics, the problem is where to place all the diacritics on a backwards compatible layout. The Compose key idea is to use punctuation keys to input diacritics. Basically we need to hit Compose once only, while generating combining marks out of punctuation needs at least one differenciating keystroke for each one. Given the limited number of keys, we can scarcely have more than one special dead key like Compose in the Base shift state. And as diacritical marks are so numerous that all keyboard punctuation together is not sufficient, we need sequences of punctuation for a number of less current diacritics. This brings the need of a triggering keystroke at the end. Most characters are therefore best input when diacritics come before the triggering letter. But that's my experience only, I wonder how it works on TSF.

> 
> Better systems can be built, in which one sees what one is doing.

I read that on Mac OS X, the dead key input and the Compose functionality that is made of, are accompanied by a visual feedback, which shows what characters have already been typed. 

> Is it not much better to type 'e' and then a circumflex, and see the 'e'
> and then the 'e' with a circumflex? 

Yes, in fact the precomposed characters are legacy characters from the beginning of Unicode on. The most up-to-date input of diacriticized characters is with use of combining diacritical marks. This produces directly the string that is generated by the canonical decomposition algorithms. However, on the internet, AFAIK, precomposed characters must be used for a web page to be validated W3C.

> Dead keys are an imitation of a limitation of typewriter technology. 
> If I was typing cuneiform, I'd much rather type 'bi?COMMIT?' and see 
> the growing sequence 'b', 'bi', '?CUNEIFORM SIGN BI?' as I typed.
> (What you have for a ?COMMIT? key is your choice.) TSF lets one do this.
> A simple extension of the keyboard definition DLLs generated by MSKLC
> does not. What you should be pressing for is a usable tutorial on how
> to do this in TSF.

Agreed. I'll look for. Marc does all in TSF. But recently he shared how hard it was at the beginning and over 15?years. Now he's got it run, and when we need TSF, let's consider using his software.

> 
> > If yes, we must modify the interface so that keyboard driver DLLs are
> > really read in UTF-16. And/or we must find another compiler. 
> > 
> > Must the Windows driver be compiled by a Microsoft compiler?
> 
> The compiler is not the issue. The point is that the 16-bit code
> exists, and programs that use the 16-bit API exist. Language upgrades
> may make supplementary characters easier to use in programs, but that
> is all. They don't change existing binary interfaces.

Indeed. And if it would make sense to use other compilers than those shipping with the WDK, Max would have told us in this thread. So best practice is to stick with the original development environment. Or to use TSF.

Thanks,

Marcel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150813/2a16a30f/attachment.html>

From charupdate at orange.fr  Thu Aug 13 10:13:03 2015
From: charupdate at orange.fr (Marcel Schneider)
Date: Thu, 13 Aug 2015 17:13:03 +0200 (CEST)
Subject: Implementing SMP on a UTF-16 OS
In-Reply-To: <CABHh-8BA=To+1apsoKLOCfK8DAx0NCpY96Brj8QHLZEZKe7m7A@mail.gmail.com>
References: <CABHh-8A=resnzK0DHch9qAnHgd9bH9ihZi3or7fnJj7G60D=Tg@mail.gmail.com>
 <2079848060.15640.1439235999068.JavaMail.www@wwinf1d19>
 <20150810212503.58cc24d0@JRWUBU2>
 <1761086697.16696.1439239991108.JavaMail.www@wwinf1d19>
 <CABHh-8Bmj3+vD0cauCBTF3Dp9G2dxFvzDMGcDXiY3Tnc6y-0mw@mail.gmail.com>
 <BY2PR0301MB16088A377F41E395BB36993DD57E0@BY2PR0301MB1608.namprd03.prod.outlook.com>
 <CABHh-8BA=To+1apsoKLOCfK8DAx0NCpY96Brj8QHLZEZKe7m7A@mail.gmail.com>
Message-ID: <1925503470.16339.1439478783459.JavaMail.www@wwinf1c10>

> On 12 Aug 2015 at 18:55, Peter Constable 
wrote:
> >
> > ?In Visual Studio 2015, the C++ compiler and standard library have been updated with enhanced support for C++11 and initial support for certain C++14 features. They also include preliminary support for certain features expected to be in the C++17 standard.?

Thanks; I've just downloaded the WDK 10.

Best regards,

Marcel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150813/9deeda97/attachment.html>

From charupdate at orange.fr  Thu Aug 13 10:23:59 2015
From: charupdate at orange.fr (Marcel Schneider)
Date: Thu, 13 Aug 2015 17:23:59 +0200 (CEST)
Subject: Implementing SMP on a UTF-16 OS
In-Reply-To: <CABHh-8BA=To+1apsoKLOCfK8DAx0NCpY96Brj8QHLZEZKe7m7A@mail.gmail.com>
References: <CABHh-8A=resnzK0DHch9qAnHgd9bH9ihZi3or7fnJj7G60D=Tg@mail.gmail.com>
 <2079848060.15640.1439235999068.JavaMail.www@wwinf1d19>
 <20150810212503.58cc24d0@JRWUBU2>
 <1761086697.16696.1439239991108.JavaMail.www@wwinf1d19>
 <CABHh-8Bmj3+vD0cauCBTF3Dp9G2dxFvzDMGcDXiY3Tnc6y-0mw@mail.gmail.com>
 <BY2PR0301MB16088A377F41E395BB36993DD57E0@BY2PR0301MB1608.namprd03.prod.outlook.com>
 <CABHh-8BA=To+1apsoKLOCfK8DAx0NCpY96Brj8QHLZEZKe7m7A@mail.gmail.com>
Message-ID: <1667485856.16669.1439479439796.JavaMail.www@wwinf1c10>

On 13 Aug 2015 at 09:05, Max Truxa  wrote:

> Anyhow, Marcel could certainly declare the mapping in a .cpp (using
> extern "C" to ensure interoperability with C code) but that wouldn't
> change that surrogate pairs seem to be unsupported for keyboard
> drivers. (Like I said I have no experience writing keyboard drivers so
> I can't confirm this.)

Not only are surrogate pairs unsupported, even the possibility of having them in one deadtrans entry seems to be definitely blocked:
http://www.siao2.com/2004/12/17/323257.aspx

I would like to define a DEADTRANSEXT function delivering two code units instead of one; but it seems to me that I'm dreaming. There must be an API behind that wouldn't recognize this. I hope I'm wrong.

Thanks for the information.

Best regards,

Marcel 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150813/654acaa8/attachment.html>

From kenwhistler at att.net  Thu Aug 13 10:47:45 2015
From: kenwhistler at att.net (Ken Whistler)
Date: Thu, 13 Aug 2015 08:47:45 -0700
Subject: Update on flag tags (PRI #299)?
In-Reply-To: <20150813075840.665a7a7059d7ee80bb4d670165c8327d.bbcb98170c.wbe@email03.secureserver.net>
References: <20150813075840.665a7a7059d7ee80bb4d670165c8327d.bbcb98170c.wbe@email03.secureserver.net>
Message-ID: <55CCBC21.8090102@att.net>

Doug,

On 8/13/2015 7:58 AM, Doug Ewell wrote:
> The recently posted minutes from UTC #144
> <http://www.unicode.org/L2/L2015/15187.htm> include the following:
>
>> B.11.1.1.3 PRI 299 feedback and mailing list discussion [Edberg,
>> L2/15-210]
>>
>> Discussion. UTC took no action at this time.
> and:
>
>> [144-A93] Action Item for Rick McGowan: Close PRI #299, saying: All
>> feedback has been considered and will be part of the deliberations for
>> possible future extension mechanisms.
> This sounds like the entire flag-tag proposal described by PRI #299 has
> been put on indefinite hold. Is that accurate?
>
>

No, it is not. No recorded actions were taken by the UTC, but what
is missing from the recorded minutes are the hours of discussion
that took place both during plenary and during the various
ad hoc meetings listed during lunch hours. The ad hoc group did
not file a written report, but the upshot is basically that the Emoji SC
has general direction to take all the feedback and discussion and
work up a more detailed proposal that addresses all of the issues
involved. At some point that will appear as a new proposal for
further discussion and decision. So stay tuned.

--Ken


From doug at ewellic.org  Thu Aug 13 15:14:56 2015
From: doug at ewellic.org (Doug Ewell)
Date: Thu, 13 Aug 2015 13:14:56 -0700
Subject: Update on flag tags (PRI =?UTF-8?Q?=23=32=39=39=29=3F?=
Message-ID: <20150813131456.665a7a7059d7ee80bb4d670165c8327d.20f4c7c6f6.wbe@email03.secureserver.net>

Ken Whistler <kenwhistler at att dot net> wrote:

> but the upshot is basically that the Emoji SC
> has general direction to take all the feedback and discussion and
> work up a more detailed proposal that addresses all of the issues
> involved. At some point that will appear as a new proposal for
> further discussion and decision. So stay tuned.

Thanks.

On Wed, 01 Jul 2015 16:20:08 +0000, Noah Slater <nslater at tumbolia dot
org> wrote:

>>?<http://www.unicode.org/review/pri299/pri299-additional-flags-background.html>
>
> Can someone help me understand what this means for my rainbow flag
> proposal?

I can't speak for Noah, nor for others who might want to propose a
non-region, non-subdivision flag emoji, but it might be helpful if the
Emoji SC can at least say whether that type of flag is expected to be
within the scope of their more detailed proposal. That might help Noah
and others decide whether they need to invest the effort to write up a
proposal for a unitary character.

--
Doug Ewell | http://ewellic.org | Thornton, CO ????


From charupdate at orange.fr  Fri Aug 14 06:14:55 2015
From: charupdate at orange.fr (Marcel Schneider)
Date: Fri, 14 Aug 2015 13:14:55 +0200 (CEST)
Subject: Implementing SMP on a UTF-16 OS
Message-ID: <1713579761.8528.1439550895738.JavaMail.www@wwinf1j10>

As far as it remained pregnant, the issue is now resolved to some extent. Only five high surrogates are used for the 2,413?SMP characters that might most probably be wished to be available on a universal Latin layout, so that five key positions (for example at Shift+Kana) are enough to ensure input efficiency along with streamlined Compose sequences for the low surrogates.

This results from examining the NamesList in a spreadsheet with surrogates generated by Excel formulas. The five mentioned leading surrogates are:

U+D800 (12 Roman symbols U+10190?sqq);
U+D835 (996 mathematical letters U+1D400?sqq);
U+D83C (421 mathematical letters, symbols, and emojis U+1F100?sqq);
U+D83D (821 emoticons and stars U+1F400?sqq);
U+D83E (163 arrows U+1F800?sqq).

However, this workaround is far from optimal and demands from the user to learn?additionally to the Compose sequences?which leading surrogate he must type first. For example, U+1F16A RAISED MC SIGN, and U+1F16B RAISED MD SIGN (which being for use in Canada, are supposed to be on every universal Latin layout of any locale) should be input with Compose, m, c, and Compose, m, d, respectively. Now the user must type Shift+Kana+S, Compose, m, c, or Shift+Kana+S, Compose, m, d. It???s just less bad than not to have them anyhow. (Depending on the locale, one might wish to map them to (Shift+) Ctrl+Alt+C and Ctrl+Alt+D.)

I???m still hoping that there will be means to make DEADTRANS rendering two code units alternatively, or to define and use a DEADTRANSEXT function.

Best regards,

Marcel Schneider


From gwalla at gmail.com  Fri Aug 14 13:31:28 2015
From: gwalla at gmail.com (Garth Wallace)
Date: Fri, 14 Aug 2015 11:31:28 -0700
Subject: Chess symbol glyphs in code charts
Message-ID: <CA+p4_H3kt9utxQydBGSODfoH5JgjHZHeoYt4unvBdDJagN_mMg@mail.gmail.com>

Can anyone tell me what font is used for the chess symbols in the code
chart for the Miscellaneous Symbols block? It looks a lot like Chess
Merida but I can't be certain.

From kenwhistler at att.net  Fri Aug 14 13:50:50 2015
From: kenwhistler at att.net (Ken Whistler)
Date: Fri, 14 Aug 2015 11:50:50 -0700
Subject: Chess symbol glyphs in code charts
In-Reply-To: <CA+p4_H3kt9utxQydBGSODfoH5JgjHZHeoYt4unvBdDJagN_mMg@mail.gmail.com>
References: <CA+p4_H3kt9utxQydBGSODfoH5JgjHZHeoYt4unvBdDJagN_mMg@mail.gmail.com>
Message-ID: <55CE388A.3030000@att.net>

Garth,

The glyphs for the chess symbols in the 26XX block date from
Unicode 3.0. Most of the symbols redesigned for the Unicode 3.0
charts were done by John M. Fiscella. (See the font acknowledgements
on p. iv of Unicode 3.0.) I do not know which predecessor designs
Fiscella might ultimately have based his designs on.

The *actual* font used in the chart production is some in house
chart font, possibly tweaked over the years for various specific
glyphs, although it doesn't appear to me on first inspection that
any of the chess symbol glyphs per se have had any workover since the
Unicode 3.0 publication. The chart fonts are in house, used with
special licenses specific to Unicode chart production, and with all
sorts of chart-specific quirks. So even if I did attempt to track down
specifically which font was involved for the current Unicode 26XX
block for the 2654..265F range of glyphs, knowing that wouldn't
actually help much for your question, I think.

--Ken

On 8/14/2015 11:31 AM, Garth Wallace wrote:
> Can anyone tell me what font is used for the chess symbols in the code
> chart for the Miscellaneous Symbols block? It looks a lot like Chess
> Merida but I can't be certain.
>


From asmus-inc at ix.netcom.com  Fri Aug 14 14:31:10 2015
From: asmus-inc at ix.netcom.com (Asmus Freytag (t))
Date: Fri, 14 Aug 2015 12:31:10 -0700
Subject: Chess symbol glyphs in code charts
In-Reply-To: <55CE388A.3030000@att.net>
References: <CA+p4_H3kt9utxQydBGSODfoH5JgjHZHeoYt4unvBdDJagN_mMg@mail.gmail.com>
 <55CE388A.3030000@att.net>
Message-ID: <55CE41FE.5060503@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150814/89a163a4/attachment.html>

From gwalla at gmail.com  Fri Aug 14 15:26:45 2015
From: gwalla at gmail.com (Garth Wallace)
Date: Fri, 14 Aug 2015 13:26:45 -0700
Subject: Chess symbol glyphs in code charts
In-Reply-To: <55CE388A.3030000@att.net>
References: <CA+p4_H3kt9utxQydBGSODfoH5JgjHZHeoYt4unvBdDJagN_mMg@mail.gmail.com>
 <55CE388A.3030000@att.net>
Message-ID: <CA+p4_H2K+SeFcXHxhEcnve4bTakZzo6qiJisn=8vATCwyXbq5A@mail.gmail.com>

Would it be acceptable if I extracted the font from the code chart PDF
and used it as the basis for one in a proposal I'm working on? The
proposal covers rotated and half-black half-white chess symbols, which
should match the shapes of the existing ones, and compound symbols,
which should harmonize.

On Fri, Aug 14, 2015 at 11:50 AM, Ken Whistler <kenwhistler at att.net> wrote:
> Garth,
>
> The glyphs for the chess symbols in the 26XX block date from
> Unicode 3.0. Most of the symbols redesigned for the Unicode 3.0
> charts were done by John M. Fiscella. (See the font acknowledgements
> on p. iv of Unicode 3.0.) I do not know which predecessor designs
> Fiscella might ultimately have based his designs on.
>
> The *actual* font used in the chart production is some in house
> chart font, possibly tweaked over the years for various specific
> glyphs, although it doesn't appear to me on first inspection that
> any of the chess symbol glyphs per se have had any workover since the
> Unicode 3.0 publication. The chart fonts are in house, used with
> special licenses specific to Unicode chart production, and with all
> sorts of chart-specific quirks. So even if I did attempt to track down
> specifically which font was involved for the current Unicode 26XX
> block for the 2654..265F range of glyphs, knowing that wouldn't
> actually help much for your question, I think.
>
> --Ken
>
>
> On 8/14/2015 11:31 AM, Garth Wallace wrote:
>>
>> Can anyone tell me what font is used for the chess symbols in the code
>> chart for the Miscellaneous Symbols block? It looks a lot like Chess
>> Merida but I can't be certain.
>>
>

From haberg-1 at telia.com  Fri Aug 14 17:46:38 2015
From: haberg-1 at telia.com (=?utf-8?Q?Hans_=C3=85berg?=)
Date: Sat, 15 Aug 2015 00:46:38 +0200
Subject: Chess symbol glyphs in code charts
In-Reply-To: <CA+p4_H3kt9utxQydBGSODfoH5JgjHZHeoYt4unvBdDJagN_mMg@mail.gmail.com>
References: <CA+p4_H3kt9utxQydBGSODfoH5JgjHZHeoYt4unvBdDJagN_mMg@mail.gmail.com>
Message-ID: <97FEE862-58DD-405B-8D57-AF384928CFDB@telia.com>


> On 14 Aug 2015, at 20:31, Garth Wallace <gwalla at gmail.com> wrote:
> 
> Can anyone tell me what font is used for the chess symbols in the code
> chart for the Miscellaneous Symbols block? It looks a lot like Chess
> Merida but I can't be certain.

They are quite close to Apple Symbols, but not exactly the same.


From asmus-inc at ix.netcom.com  Fri Aug 14 19:43:57 2015
From: asmus-inc at ix.netcom.com (Asmus Freytag (t))
Date: Fri, 14 Aug 2015 17:43:57 -0700
Subject: Chess symbol glyphs in code charts
In-Reply-To: <CA+p4_H2K+SeFcXHxhEcnve4bTakZzo6qiJisn=8vATCwyXbq5A@mail.gmail.com>
References: <CA+p4_H3kt9utxQydBGSODfoH5JgjHZHeoYt4unvBdDJagN_mMg@mail.gmail.com>
 <55CE388A.3030000@att.net>
 <CA+p4_H2K+SeFcXHxhEcnve4bTakZzo6qiJisn=8vATCwyXbq5A@mail.gmail.com>
Message-ID: <55CE8B4D.2000500@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150814/420a9049/attachment.html>

From richard.wordingham at ntlworld.com  Sun Aug 16 05:20:24 2015
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Sun, 16 Aug 2015 11:20:24 +0100
Subject: Standardised Variation Sequences with Toggles
Message-ID: <20150816112024.1760f1a6@JRWUBU2>

The view of the Unicode Technical committee appears to be that the
Unicode Character Database (UCD) takes priority over the core text of
the Unicode Standard in case of conflict.  (Please advise if I have
misunderstood; I only have the core text and samples of past behaviour
to go on, neither of which appears to be binding.)

I am worried that this view may come to cause a redefinition of
sequences in which the variation selector is intended to toggle between
what are normally contextually determined forms. The clearest example is
<U+A856, U+FE00> 'phags-pa letter reversed shaping small a'.  Phags-pa
is a 'cursive' script, and this letter is dual-joining.  From just the
text of StandardizedVariants.txt and the text and pictures of
StandardizedVariants.html (the latter are in the processing of migrating
to the code charts, which will replace the HTML file in Unicode 9.0.0),
one could easily imagine that the usual forms of <U+A856> and <U+A856,
U+FE00> in authentic continuous text were different.  In fact,
*careful* reading of the core text shows that the commonest forms of
these two sequences in authentic text are identical!

The paradox arises because U+A856 PHAGS-PA LETTER SMALL A and several
other characters may be mirrored about the reading axis after certain
letters or flipped letters, and to avoid complications, the rule is
that by default they are in these extremely rare environments.  I
believe this mirroring is what is meant by the word 'shaping' in the
description of the variant; it is not a reflection of the 'cursive'
nature of the script.  U+FE00 toggles the mirroring state, and this is
what is meant by the word 'reversed', not that the letter is the other
way round to the form in the code chart. Unlike the other contextually
mirrored characters, it so happens that, more often than not, U+A856 is
not actually mirrored in the authentic extant text where the Unicode
rules call for mirroring.

I believe the Phags-pa code chart should have a normative statement that
U+FE00 is acting as a toggle, and refer back to the core text.  Now
Phags-pa is a relatively clean case - all standardised variants in
the block have the same behaviour, so a single sentence in the block's
code chart might suffice. However, I do not believe this is always the
case. One possibility would be to change the text from

~ A856 FE00 <U+A856, FE00> phags-pa letter reversed shaping small a

to

~ A856 FE00 phags-pa letter reversed shaping small a

  ? Toggles between <U+A586> and <U+A586, FE00>; see core text for
  contextual shaping. 

where text in '<...>' is rendered as a string, not echoed as ASCII.

However, that reads clumsily.  Can people suggest improvements?

Similar text would be needed for StandardizedVariants.txt in the UCD.
The relevant line currently reads

"A856 FE00; phags-pa letter reversed shaping small a; # PHAGS-PA LETTER
SMALL A"

Obviously this potential problem needs to be formally reported, but I
would first like to see other people's views.

There are other cases where variation selectors were intended as
toggles, but the ones I know of are not so clearly documented.

Richard.


From alexweiner at alexweiner.com  Sun Aug 16 09:35:17 2015
From: alexweiner at alexweiner.com (alexweiner at alexweiner.com)
Date: Sun, 16 Aug 2015 07:35:17 -0700
Subject: APL Under-bar Characters
Message-ID: <20150816073517.e74b0ce91403bfe413f98785c6a226af.37509199d0.wbe@email06.secureserver.net>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150816/ffe1c4be/attachment.html>

From khaledhosny at eglug.org  Sun Aug 16 10:17:06 2015
From: khaledhosny at eglug.org (Khaled Hosny)
Date: Sun, 16 Aug 2015 17:17:06 +0200
Subject: APL Under-bar Characters
In-Reply-To: <20150816073517.e74b0ce91403bfe413f98785c6a226af.37509199d0.wbe@email06.secureserver.net>
References: <20150816073517.e74b0ce91403bfe413f98785c6a226af.37509199d0.wbe@email06.secureserver.net>
Message-ID: <20150816151706.GA2553@khaled-laptop>

On Sun, Aug 16, 2015 at 07:35:17AM -0700, alexweiner at alexweiner.com wrote:
> Hello Unicode Mailing List,
> 
> There is significant discussion about the problems of adding capital letters
> with individual under-bars in this mailing list for GNU APL.
> 
> http://lists.gnu.org/archive/html/bug-apl/2015-08/msg00050.html
> 
> Pretty much it adds up to the following problem:
> 
> The string length functionality would view an 'A' code point combined with an
> '_' code point as an item that has two elements, while something that looks
> like 'A'  Should be atomic, and return a length of one.

I think what you need is better ?character? counting [1], rather than
new precomposed characters.

Regards,
Khaled

1. http://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries

From alexweiner at alexweiner.com  Sun Aug 16 11:31:25 2015
From: alexweiner at alexweiner.com (alexweiner at alexweiner.com)
Date: Sun, 16 Aug 2015 09:31:25 -0700
Subject: APL Under-bar Characters
Message-ID: <20150816093125.e74b0ce91403bfe413f98785c6a226af.654c59af0e.wbe@email06.secureserver.net>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150816/25ff5bb6/attachment.html>

From khaledhosny at eglug.org  Sun Aug 16 11:53:52 2015
From: khaledhosny at eglug.org (Khaled Hosny)
Date: Sun, 16 Aug 2015 18:53:52 +0200
Subject: APL Under-bar Characters
In-Reply-To: <20150816093125.e74b0ce91403bfe413f98785c6a226af.654c59af0e.wbe@email06.secureserver.net>
References: <20150816093125.e74b0ce91403bfe413f98785c6a226af.654c59af0e.wbe@email06.secureserver.net>
Message-ID: <20150816165351.GA10179@khaled-laptop>

On Sun, Aug 16, 2015 at 09:31:25AM -0700, alexweiner at alexweiner.com wrote:
> Khaled,
> Thank you for the link. The normalization methods were already discussed,
> specifically here:
> 
> http://lists.gnu.org/archive/html/bug-apl/2015-08/msg00047.html

Grapheme cluster boundaries detection is different from normalisation,
please read the link I provided.

> Where the problem of "how big" is ? is discussed. The answer being that this is
> one symbol, because the Unicode Consortium decided that it is also its own
> standalone character. From the thread:
> 
> I'll give you an example. What would you want ?,'?' to be?
> 
> Right now, that could return either 1 or 2 depending on whether the ? was using
> the precomposed character (U+00E4) or the combining mark (U+0061, U+0308).
> Visually, these are identical, and generally you'd expect them to compare
> equal.

If you are counting grapheme clusters, then the answer is one in both
cases.

> In Unicode, the comparison of equivalent (but with different characters)
> strings are done by performing a normalisation step prior to comparison. There
> are 4 different types of normalisation, with different behaviour.

Quoting from the link I provided:

    A key feature of default Unicode grapheme clusters (both legacy and
    extended) is that they remain unchanged across all canonically
    equivalent forms of the underlying text. Thus the boundaries remain
    unchanged whether the text is in NFC or NFD. Using a grapheme
    cluster as the fundamental unit of matching thus provides a very
    clear and easily explained basis for canonically equivalent
    matching. This is important for applications from searching to
    regular expressions.

See also: http://unicode.org/faq/char_combmark.html#7

> Now, the ? character has a precomposed form in Unicode, and if you couple that
> with the NFC normalisation form, you'd get the above _expression_ to return 1.
> 
> 
> So I'm not sure why the allowance was made for ? as well as other certain
> characters,  but not for other things (under-bar characters) that face
> similar representation issues. 

It was encoded for compatibility of pre-existing character sets AFAIK.

Regards,
Khaled


> 
> 
>     -------- Original Message --------
>     Subject: Re: APL Under-bar Characters
>     From: Khaled Hosny <khaledhosny at eglug.org>
>     Date: Sun, August 16, 2015 8:17 am
>     To: alexweiner at alexweiner.com
>     Cc: unicode at unicode.org
> 
>     On Sun, Aug 16, 2015 at 07:35:17AM -0700, alexweiner at alexweiner.com wrote:
>     > Hello Unicode Mailing List,
>     >
>     > There is significant discussion about the problems of adding capital
>     letters
>     > with individual under-bars in this mailing list for GNU APL.
>     >
>     > http://lists.gnu.org/archive/html/bug-apl/2015-08/msg00050.html
>     >
>     > Pretty much it adds up to the following problem:
>     >
>     > The string length functionality would view an 'A' code point combined
>     with an
>     > '_' code point as an item that has two elements, while something that
>     looks
>     > like 'A' Should be atomic, and return a length of one.
> 
>     I think what you need is better ?character? counting [1], rather than
>     new precomposed characters.
> 
>     Regards,
>     Khaled
> 
>     1. http://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries
> 

From richard.wordingham at ntlworld.com  Sun Aug 16 13:27:13 2015
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Sun, 16 Aug 2015 19:27:13 +0100
Subject: APL Under-bar Characters
In-Reply-To: <20150816165351.GA10179@khaled-laptop>
References: <20150816093125.e74b0ce91403bfe413f98785c6a226af.654c59af0e.wbe@email06.secureserver.net>
 <20150816165351.GA10179@khaled-laptop>
Message-ID: <20150816192713.56ab0841@JRWUBU2>

On Sun, 16 Aug 2015 18:53:52 +0200
Khaled Hosny <khaledhosny at eglug.org> wrote:

> On Sun, Aug 16, 2015 at 09:31:25AM -0700, alexweiner at alexweiner.com
> wrote:

> > Now, the ? character has a precomposed form in Unicode, and if you
> > couple that with the NFC normalisation form, you'd get the above
> > _expression_ to return 1.

> > So I'm not sure why the allowance was made for ? as well as other
> > certain characters,  but not for other things (under-bar
> > characters) that face similar representation issues. 

> It was encoded for compatibility of pre-existing character sets AFAIK.

Note that compatibility means allowing habits of treating the
precomposed characters as single characters to continue.  These habits
allowed simple transition, but now cause confusion.  Most rules work
better in NFD than NFC.  For string lengths in NFC, you
immediately lose the rule len(a + b) = len(a) + len(b).  For
NFC, you don't even have len(a + b) <= len(a) + len(b).  However, do
note that for the corresponding 'string' algebra, the mathematical
concept of a string no longer works - and this applies to both NFC and
NFD. Instead, you have to allow for pairs of characters commuting, and
so you get the concept of a 'trace'.

If all combinations of base character and non-spacing marks were
encoded, there'd be infinitely many.  Polytonic Greek has 36
*precomposed* combinations of base character and 3 combining marks, and
some languages frequently use base characters with 4 combining marks;
unexceptional words with 5 combining marks are less frequent.

Richard.


From kenwhistler at att.net  Sun Aug 16 13:37:50 2015
From: kenwhistler at att.net (Ken Whistler)
Date: Sun, 16 Aug 2015 11:37:50 -0700
Subject: APL Under-bar Characters
In-Reply-To: <20150816165351.GA10179@khaled-laptop>
References: <20150816093125.e74b0ce91403bfe413f98785c6a226af.654c59af0e.wbe@email06.secureserver.net>
 <20150816165351.GA10179@khaled-laptop>
Message-ID: <55D0D87E.9020709@att.net>

It seems to me that APL has some very deeply embedded (and ancient)
assumptions about fixed-width 8-bit characters, dating from ASCII days.
It only got as far as it did with the current assumptions because people
hacked up 8-bit fonts for all the special characters for the APL syntax,
and because IBM implemented those as dedicated special character sets with
matching specialized APL keyboards.

A built-in function like ? which returns the *size* of data is structurally
hand-in-hand with the definition of vectors and arrays. There seem to
be very deep assumptions in the APL data model that strings are simply
an array of *fixed-size* data elements, aka "characters".

So requiring ?,'?' and ?,'_A_' to "just work" is the moral equivalent of 
asking the
C library call strlen("?") or strlen("_A_") to "just work", regardless 
of the
representation of the data in the string. It is a nonsensical requirement
if applied to general Unicode strings outside the context of a very
carefully restricted subset designed to ensure one-to-one relationship
between "character" and "array element".

A Unicode-based APL implementation can (presumably) just up the size
of its "character" to 16-bits internally (actually a UTF-16 code *unit*)
and carefully restrict itself to the subset of ASCII & Latin-1, the APL
symbols and a few other operators needed to fill out the set.

Looking at the fonts people seem to actually be using in various 
implementations,
e.g.:

http://aplwiki.com/AplCharacters

the general choice seems to be to use both uppercase and lowercase Latin 
letters,
and forgo the old convention of underlined uppercase Latin letters. That 
seems a
small adjustment to make to not stay stuck in the 70's, frankly.

I can understand Alex's request that Unicode then effectively "solve the 
problem" by
providing a fixed-width 16-bit entity for "_A_" that could then just be 
added to
the restricted subset in the APL implementations. But that isn't going 
to happen --
because of the normalization stability guarantees for the Unicode Standard.

And in any case, if users of APL need something more sophisticated for 
actual
string handling than strictly limited subsets based on the assumption that
character=element_of_fixed_data_size_array, then rho and a limited subset
aren't going to handle it anyway. At that point, another layer of 
abstraction
would have to be built on top of the basic array and vector processing. And
then Khaled's points about character=grapheme_cluster become relevant.

--Ken

On 8/16/2015 9:53 AM, Khaled Hosny wrote:
> On Sun, Aug 16, 2015 at 09:31:25AM -0700, alexweiner at alexweiner.com wrote:
>
>
> So I'm not sure why the allowance was made for ? as well as other certain
> characters,  but not for other things (under-bar characters) that face
> similar representation issues.
> It was encoded for compatibility of pre-existing character sets AFAIK.
>
> Regards,
> Khaled
>
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150816/e31a00f1/attachment.html>

From kenwhistler at att.net  Sun Aug 16 14:08:34 2015
From: kenwhistler at att.net (Ken Whistler)
Date: Sun, 16 Aug 2015 12:08:34 -0700
Subject: Standardised Variation Sequences with Toggles
In-Reply-To: <20150816112024.1760f1a6@JRWUBU2>
References: <20150816112024.1760f1a6@JRWUBU2>
Message-ID: <55D0DFB2.7000806@att.net>


On 8/16/2015 3:20 AM, Richard Wordingham wrote:
> The view of the Unicode Technical committee appears to be that the
> Unicode Character Database (UCD) takes priority over the core text of
> the Unicode Standard in case of conflict.  (Please advise if I have
> misunderstood; I only have the core text and samples of past behaviour
> to go on, neither of which appears to be binding.)

Richard,

That means that if a data file states, e.g.,

200B;ZERO WIDTH SPACE;Cf;0;BN;;;;;N;;;;;

thus *defining* the General_Category of ZWSP to be Cf, but
that we find that due to some oversight in editing (of what is now
a very large core specification, plus over a dozen annexes), somebody
goofed up and happened to refer to ZWSP as gc=Zs, the data file *wins*.
Some editorial oversight or a typo in the text of the core specification 
cannot
be taken as legalistically somehow trumping the data file, just because
somebody finds it "written in the standard".

Capiche?

This should not, IMO, be taken as occasion for general worrying
about the status of data files and the core specification. (In most cases,
the core specification is simply underspecified because the research,
writing and editing for it is under-resourced.)

> One possibility would be to change the text from
>
> ~ A856 FE00 <U+A856, FE00> phags-pa letter reversed shaping small a
>
> to
>
> ~ A856 FE00 phags-pa letter reversed shaping small a
>
>    ? Toggles between <U+A586> and <U+A586, FE00>; see core text for
>    contextual shaping.
>
> where text in '<...>' is rendered as a string, not echoed as ASCII.
>
> However, that reads clumsily.  Can people suggest improvements?

Yes, a notice at the top:

@+ For details about the implementation of variation sequences in Phags-pa,
please refer to the Phags-pa section of the core specification.

--Ken


From alexweiner at alexweiner.com  Sun Aug 16 14:36:08 2015
From: alexweiner at alexweiner.com (alexweiner at alexweiner.com)
Date: Sun, 16 Aug 2015 12:36:08 -0700
Subject: APL Under-bar Characters
Message-ID: <20150816123608.e74b0ce91403bfe413f98785c6a226af.39c870b9bd.wbe@email06.secureserver.net>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150816/95b3bdaa/attachment.html>

From alexweiner at alexweiner.com  Sun Aug 16 14:41:58 2015
From: alexweiner at alexweiner.com (alexweiner at alexweiner.com)
Date: Sun, 16 Aug 2015 12:41:58 -0700
Subject: APL Under-bar Characters
Message-ID: <20150816124158.e74b0ce91403bfe413f98785c6a226af.6dce0e5645.wbe@email06.secureserver.net>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150816/0e9ac3da/attachment.html>

From richard.wordingham at ntlworld.com  Sun Aug 16 17:50:08 2015
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Sun, 16 Aug 2015 23:50:08 +0100
Subject: Standardised Variation Sequences with Toggles
In-Reply-To: <55D0DFB2.7000806@att.net>
References: <20150816112024.1760f1a6@JRWUBU2>
	<55D0DFB2.7000806@att.net>
Message-ID: <20150816235008.43bbd56d@JRWUBU2>

On Sun, 16 Aug 2015 12:08:34 -0700
Ken Whistler <kenwhistler at att.net> wrote:

> Some editorial oversight or a typo in the text of the core
> specification cannot
> be taken as legalistically somehow trumping the data file, just
> because somebody finds it "written in the standard".
> 
> Capiche?

No.  What about oversights and typos in the UCD?

Indeed, two variation sequences were removed because it was found that
their bases were decomposable, which contradicts the core
specification.  In this case, the UCD did not trump the rules for
variation sequences.

When there is a contradiction, it needs to be investigated and
resolved, with awareness that different people may be relying on
different parts of the specification.

> (In most
> cases, the core specification is simply underspecified because the
> research, writing and editing for it is under-resourced.)

That is also true of much of the UCD.  I suspect that much of it relies
on intelligent guesswork.  Some properties may simply be ignored
because nothing readily testable uses them (e.g. line and word-break
properties relevant for scriptio continua writing systems), and others
appear to be arbitrary. (Is the allocation of digits to L, AN or EN
actually anything but an encoding decision?)  Fortunately, most errors
in the UCD can be corrected when the settings don't work; casing
pairs, names, decompositions and canonical combining classes are the
main problems. I believe problems arising from codepoint assignments
could be fixed by created singleton decompositions, e.g. to change mere
numbers into decimal digits.

As an example of an effectively ignored line break property, I offer
the line-break property of the Thai repetition character U+0E46 THAI
CHARACTER MAIYAMOK. It is currently of general category Lm, and has the
line-break property SA 'South-East Asian line-breaking'. This means
that the Unicode line-breaking algorithm calls upon a non-standard
algorithm to assign each instance of the character a line-break
property.

Now I believe that it should have line break property EX.  I can find
a grammatical description that says it should be separated from the
preceding word by a space, and I have found no example in books of
U+0E46 starting a line.  Giving it line break property EX would
prevent a line break between the space and the repetition mark.
However, there is little point in trying to have it assigned line break
property EX, for the Unicode assignment is irrefutable.  My argument
has to be addressed to the specifications of the algorithms doing Thai
line-breaking.

A historical example of errors in the UCD is U+200B ZERO WIDTH SPACE
(ZWSP). It's primary use is as a word separator in scripts that don't
have visible word separators, though I'm currently finding it useful in
Word 2010 to split up excessively long path names without visible
hyphens being added.  When its general category was changed from Zs to
Cf, its Unicode word-break property became 'Format'; it no longer had
any effect on word-breaking. Its line-breaking behaviour was preserved,
so the control of text layout was unaffected.  For SE Asian languages,
the change had no direct effect, for their word-breaking rules are
largely outside the scope of the Unicode text segmentation algorithms.
All went well until someone decided that TUS text describing it as a
word-breaker was an 'editorial oversight'. A corrigendum removed this
word-breaking behaviour, and SE Asian word processors started to
misbehave as software maintainers caught up with the corrigendum.

For details see an email from Javier Sol?? 
http://unicode.org/mail-arch/unicode-ml/y2009-m01/0604.html .  The
referenced proposal gives the text of the erratum, dated May 2008.
Presumably corrigenda did not then have numbers, for there is no trace
of its former existence in
http://www.unicode.org/versions/corrigenda.html .

A similar process is now in progress for U+2060 WORD JOINER (WJ), which
is the opposite of ZWSP.  It is intended that WJ will cease to indicate
the absence of word boundaries.  In scripts that have visible
line-boundaries, the absence of an effect on word-breaking is of no
consequence for sequences of letters, for the mere juxtaposition of
letters prevents a word-break between them.  By contrast, SE Asian
word-boundary detectors largely rely on recognising words, and they can
make mistakes, or be given an impossible task.  The English analogue is
detecting the word boundary in 'humanevents' - is the last word
'events' or 'vents'?  A notable challenge is to persuade a Thai
spell-checker that a transliteration of 'Hemingway' is actually a
single word.  Delimiting the boundaries does not work - one has to
join the fragments into which the automatic word-breaker splits it.
The language proposed for ISO 10646, in
http://www.unicode.org/L2/L2015/15211-word-joiner.pdf , does not
actually state that it does not prevent a word break, though stronger
text denying that it suppresses word breaks has been proposed for
Unicode.

By contrast, U+202F NARROW NO-BREAK SPACE (NNBSP) looks set to regain
its originally intended purpose, that of a narrow space that does not
break words.  The script for which it was intended, Mongolian, will be
able to use the Unicode word-boundary detection algorithm once NNBSP is
allowed as part of a word.  However, the fact remains that NNBSP should
never have been allowed to break words.  The core text has long stated
that NNBSP does not break Mongolian words.  There remains, however, a
possibility that European usage of NNBSP will prevent it from
recovering its intended functionality.

> Yes, a notice at the top:
> 
> @+ For details about the implementation of variation sequences in
> Phags-pa, please refer to the Phags-pa section of the core
> specification.

a) This is likely to be ignored by someone who is just looking for the
*specification*.  I think replacing 'implementation' by 'rendering'
would be better.  I would be inclined to add, 'These sequences are more
complicated than they appear at first reading'.  Otherwise, someone
will just add them to the character to glyph conversion section of a
font and think, "Job done".

b) This won't work where the effort has not been expended on the core
text.

As to StandardizedVariants.txt, Section 23.4 needs to refer to the
Phags-pa section in the core text.  As that file points to the Section
23.4 of TUS, this should then at least suggest that the descriptions in
the file do not override the core specification.

Richard.


From richard.wordingham at ntlworld.com  Sun Aug 16 18:06:20 2015
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Mon, 17 Aug 2015 00:06:20 +0100
Subject: APL Under-bar Characters
In-Reply-To: <20150816073517.e74b0ce91403bfe413f98785c6a226af.37509199d0.wbe@email06.secureserver.net>
References: <20150816073517.e74b0ce91403bfe413f98785c6a226af.37509199d0.wbe@email06.secureserver.net>
Message-ID: <20150817000620.5cb6b869@JRWUBU2>

On Sun, 16 Aug 2015 07:35:17 -0700
<alexweiner at alexweiner.com> wrote:

> There is significant discussion about the problems of adding capital
> letters with individual under-bars in this mailing list for GNU APL.
> 
> http://lists.gnu.org/archive/html/bug-apl/2015-08/msg00050.html
<snip>
> Is there something I could do to make this addition to the Unicode
> standard? There is already a section for APL symbols. 

A possible compromise would be to use the Private Use Area (PUA).  If
you need single characters, it may be an appropriate solution.  It might
even be better to use the PUA (codepoints U+E000 to U+F8FF) than to be
assigned a block in the Supplementary Multilingual Plane (SMP) U+1xxxx
or the 'deprecated' plane U+Exxxx.

Richard.

From kenwhistler at att.net  Sun Aug 16 19:15:26 2015
From: kenwhistler at att.net (Ken Whistler)
Date: Sun, 16 Aug 2015 17:15:26 -0700
Subject: APL Under-bar Characters
In-Reply-To: <20150816124158.e74b0ce91403bfe413f98785c6a226af.6dce0e5645.wbe@email06.secureserver.net>
References: <20150816124158.e74b0ce91403bfe413f98785c6a226af.6dce0e5645.wbe@email06.secureserver.net>
Message-ID: <55D1279E.9020008@att.net>

Alex,

On 8/16/2015 12:41 PM, alexweiner at alexweiner.com wrote:
>
> As far as I know, APL definitely predates the Unicode consortium. Do 
> you think that The Consortium possibly overlooked the pre-existing 
> under-bar character set?
>
>

The answer to that is no.

Initially, Unicode 1.0 attempted to punt the entire APL complex 
functional symbol
problem by encoding U+2300 APL COMPOSE OPERATOR.

The concept was essentially that any of the combined symbols -- the old
rack of stuff that people complained about entering with 
symbol/backspace/symbol
keying, could simply be represented as sequences of existing symbols.
Think of 2300 as an early attempt to introduce an APL "script"-specific
conjunct-forming virama, a la much-later artificially introduced 
script-specific
joiners. Cf. U+2D7F TIFINAGH CONSONANT JOINER.

But U+2300 APL COMPOSE OPERATOR was an innovation that failed.
It was fiercely opposed *by the APL community*, who wanted it
out of 10646 and replaced with a explicit list of pre-formed complex
functional symbols. Presumably for the same reason we are talking
about here now: essentially that each symbol had to work as a "character",
and in an APL context that meant fixed width and the same data size as
all the other characters.

The removal of Unicode 1.0 U+2300 APL COMPOSE OPERATOR is documented
in Unicode 1.1 as of 1993:

http://www.unicode.org/versions/Unicode1.1.0/

(see page 3)

The addition of APL functional symbols is documented in Section 5.4.8, 
pp. 39-41.

The exact repertoire that ended up encoded in the standard was the 
result of meetings
between some Unicode representatives and some folks from the APL 
community. The names
escape me at the moment, although it might be possible to recover some
information eventually. (Documentation regarding Unicode events in late 
1991 is
sparse these days.) At any rate the agreed upon additional repertoire is 
probably
that included in:

X3L2/92-035, Unicode Request for Additional Characters in ISO/IEC 10646-1.2.
And the rest of the consequences and processing can be dug out of the 
ballot history record
for the voting on 10646 in 1992.

At any rate, a propos *this* discussion, we agreed that the repertoire 
would cover
all the complex functional symbols, but *not* the letters
with underscores. And it is not that they were simply overlooked.

How do I know? Well, first, there were APL specialists involved in coming up
(and promoting) the repertoire that was carried into the 10646 balloting at
the time. It isn't as if a bunch of ignorant Unicoders just grabbed one APL
book off the shelf and coded up the table, not noticing that some stuff was
missing.

Second, the text that is currently in the core specification about this 
issue,
to wit:

" ... All other APL extensions can e encoded by composition of other
Unicode characters. For example, the APL symbol a underbar can be
represented by U+0061 LATIN SMALL LETTER A + U+0332 COMBINING LOW LINE."
(Unicode 7.0, Section 22.7, p. 772)

is *ancient* text. It was first printed on p. 6-83 of Unicode 2.0 in 1996,
with exactly the same wording. And the only reason it took until 1996 to 
appear,
instead of 1993, was that the editing of Unicode 2.0 and its code charts
was such a massive task at the time.

So the clear intent in *1993* was to represent any APL letter with underbar
as a combining character sequence -- as noted. The only problem I see there
is that the text in the core spec mistakenly used U+0061 (the lowercase "a")
instead of U+0041 (the uppercase "A") for the exemplification.

Third, I can attest that at least some of us at the time -- as early as 
1989, had
printed copies of IBM EBCDIC code page 293 for APL, which had
the EBCDIC uppercase Latin letters with underscores (italicized, by the 
way),
together with the regular EBCDIC upper and lowercase letters. [Dates 
from 1984.]
*And* IBM EBCDIC code page 310 for APL, which dropped all the
regular upper- and lowercase letters but added more symbols.
*And* IBM PC code page 907 (with the underscored uppercase Latin
letters) and PC code page 909 (CP437 hacked up for APL, without the
underscored uppercase Latin letters), which was quickly superseded by
PC code page 910, which also did not use the uppercase Latin letters
with underscores.

So yeah, we knew about these. Encoding them as combining character
sequences instead of as atomic characters was a deliberate decision
taken in 1992. And that decision made it through both UTC and
international balloting for publication in 1993.

--Ken


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150816/fcf59099/attachment.html>

From alexweiner at alexweiner.com  Sun Aug 16 19:49:28 2015
From: alexweiner at alexweiner.com (alexweiner at alexweiner.com)
Date: Sun, 16 Aug 2015 17:49:28 -0700
Subject: APL Under-bar Characters
Message-ID: <20150816174928.e74b0ce91403bfe413f98785c6a226af.05d9ddafcb.wbe@email06.secureserver.net>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150816/be5b1977/attachment.html>

From prosfilaes at gmail.com  Sun Aug 16 19:59:28 2015
From: prosfilaes at gmail.com (David Starner)
Date: Mon, 17 Aug 2015 00:59:28 +0000
Subject: APL Under-bar Characters
In-Reply-To: <20150816174928.e74b0ce91403bfe413f98785c6a226af.05d9ddafcb.wbe@email06.secureserver.net>
References: <20150816174928.e74b0ce91403bfe413f98785c6a226af.05d9ddafcb.wbe@email06.secureserver.net>
Message-ID: <CAMZ=zj7UHrgLOYZAntNAwzinTEiz7jeXjGyy5MDL_Xy9tmYoQA@mail.gmail.com>

The standard is set here. The Unicode Consortium has declared that it won't
encode precomposed characters that can be created from characters in the
standard, because that would be destabilizing and potentially introduce
security holes in programs depending on Unicode. If you want, we can have a
vote on whether or not APL should use characters with underlines, since I
was unfairly locked out of that vote by not being born yet.

On Sun, Aug 16, 2015 at 5:52 PM <alexweiner at alexweiner.com> wrote:

> Ken,
> You pose a very strong, and well worded response. The historical element
> really helps to illuminate what I thought was lost knowledge: "Why are
> there no under-bars". To this I can only ask one thing:
>
> Can we put this to a vote again? To put things in perspective, I was thee
> years old at the time of the ballot in 1993 and had much larger issues to
> deal with (comprehending speech, learning to walk, etc.), and was unable to
> participate in this internationally binding vote.
>
> Perhaps feelings about the under-bar characters have changed since then. I
> know that the APL landscape is *very* different than it was in 1993.
>
> I have a copy of one of those IBM books that has the italicized upper-case
> under-bars. If my proposal for a new vote is well received, maybe we should
> include those as well, for completeness sake.
>
> -Alex
>
>
> -------- Original Message --------
> Subject: Re: APL Under-bar Characters
>
> From: Ken Whistler <kenwhistler at att.net>
> Date: Sun, August 16, 2015 5:15 pm
> To: alexweiner at alexweiner.com
> Cc: unicode at unicode.org
>
> Alex,
>
> On 8/16/2015 12:41 PM, alexweiner at alexweiner.com wrote:
>
>
> As far as I know, APL definitely predates the Unicode consortium. Do you
> think that The Consortium possibly overlooked the pre-existing under-bar
> character set?
>
>
>
> The answer to that is no.
>
> Initially, Unicode 1.0 attempted to punt the entire APL complex functional
> symbol
> problem by encoding U+2300 APL COMPOSE OPERATOR.
>
> The concept was essentially that any of the combined symbols -- the old
> rack of stuff that people complained about entering with
> symbol/backspace/symbol
> keying, could simply be represented as sequences of existing symbols.
> Think of 2300 as an early attempt to introduce an APL "script"-specific
> conjunct-forming virama, a la much-later artificially introduced
> script-specific
> joiners. Cf. U+2D7F TIFINAGH CONSONANT JOINER.
>
> But U+2300 APL COMPOSE OPERATOR was an innovation that failed.
> It was fiercely opposed *by the APL community*, who wanted it
> out of 10646 and replaced with a explicit list of pre-formed complex
> functional symbols. Presumably for the same reason we are talking
> about here now: essentially that each symbol had to work as a "character",
> and in an APL context that meant fixed width and the same data size as
> all the other characters.
>
> The removal of Unicode 1.0 U+2300 APL COMPOSE OPERATOR is documented
> in Unicode 1.1 as of 1993:
>
> http://www.unicode.org/versions/Unicode1.1.0/
>
> (see page 3)
>
> The addition of APL functional symbols is documented in Section 5.4.8, pp.
> 39-41.
>
> The exact repertoire that ended up encoded in the standard was the result
> of meetings
> between some Unicode representatives and some folks from the APL
> community. The names
> escape me at the moment, although it might be possible to recover some
> information eventually. (Documentation regarding Unicode events in late
> 1991 is
> sparse these days.) At any rate the agreed upon additional repertoire is
> probably
> that included in:
>
> X3L2/92-035, Unicode Request for Additional Characters in ISO/IEC
> 10646-1.2.
> And the rest of the consequences and processing can be dug out of the
> ballot history record
> for the voting on 10646 in 1992.
>
> At any rate, a propos *this* discussion, we agreed that the repertoire
> would cover
> all the complex functional symbols, but *not* the letters
> with underscores. And it is not that they were simply overlooked.
>
> How do I know? Well, first, there were APL specialists involved in coming
> up
> (and promoting) the repertoire that was carried into the 10646 balloting at
> the time. It isn't as if a bunch of ignorant Unicoders just grabbed one APL
> book off the shelf and coded up the table, not noticing that some stuff was
> missing.
>
> Second, the text that is currently in the core specification about this
> issue,
> to wit:
>
> " ... All other APL extensions can e encoded by composition of other
> Unicode characters. For example, the APL symbol a underbar can be
> represented by U+0061 LATIN SMALL LETTER A + U+0332 COMBINING LOW LINE."
> (Unicode 7.0, Section 22.7, p. 772)
>
> is *ancient* text. It was first printed on p. 6-83 of Unicode 2.0 in 1996,
> with exactly the same wording. And the only reason it took until 1996 to
> appear,
> instead of 1993, was that the editing of Unicode 2.0 and its code charts
> was such a massive task at the time.
>
> So the clear intent in *1993* was to represent any APL letter with underbar
> as a combining character sequence -- as noted. The only problem I see there
> is that the text in the core spec mistakenly used U+0061 (the lowercase
> "a")
> instead of U+0041 (the uppercase "A") for the exemplification.
>
> Third, I can attest that at least some of us at the time -- as early as
> 1989, had
> printed copies of IBM EBCDIC code page 293 for APL, which had
> the EBCDIC uppercase Latin letters with underscores (italicized, by the
> way),
> together with the regular EBCDIC upper and lowercase letters. [Dates from
> 1984.]
> *And* IBM EBCDIC code page 310 for APL, which dropped all the
> regular upper- and lowercase letters but added more symbols.
> *And* IBM PC code page 907 (with the underscored uppercase Latin
> letters) and PC code page 909 (CP437 hacked up for APL, without the
> underscored uppercase Latin letters), which was quickly superseded by
> PC code page 910, which also did not use the uppercase Latin letters
> with underscores.
>
> So yeah, we knew about these. Encoding them as combining character
> sequences instead of as atomic characters was a deliberate decision
> taken in 1992. And that decision made it through both UTC and
> international balloting for publication in 1993.
>
> --Ken
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150817/3e26e379/attachment.html>

From alexweiner at alexweiner.com  Sun Aug 16 20:16:09 2015
From: alexweiner at alexweiner.com (alexweiner at alexweiner.com)
Date: Sun, 16 Aug 2015 18:16:09 -0700
Subject: APL Under-bar Characters
Message-ID: <20150816181609.e74b0ce91403bfe413f98785c6a226af.61b5a510b6.wbe@email06.secureserver.net>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150816/d29942f9/attachment.html>

From prosfilaes at gmail.com  Sun Aug 16 20:27:17 2015
From: prosfilaes at gmail.com (David Starner)
Date: Mon, 17 Aug 2015 01:27:17 +0000
Subject: APL Under-bar Characters
In-Reply-To: <20150816181609.e74b0ce91403bfe413f98785c6a226af.61b5a510b6.wbe@email06.secureserver.net>
References: <20150816181609.e74b0ce91403bfe413f98785c6a226af.61b5a510b6.wbe@email06.secureserver.net>
Message-ID: <CAMZ=zj4oGZYXMHTwpDdd0+dZFXbqRB=8eY5qyntbMNsyKPgMgw@mail.gmail.com>

http://unicode.org/policies/stability_policy.html , in particular, the
Normalization Policy. The way the APL A with underscore is encoded is the
way we've been saying, and Unicode has promised its users that there's no
other way of writing it.

The current precedent is that when users ask for things like this is that
they are told they can't have them; for example, the Lithuanians were told
that the way to encode LATIN CAPITAL LETTER A WITH OGONEK AND ACUTE is
U+0104 U+0301, not any other way. They can be listed in
http://www.unicode.org/Public/UCD/latest/ucd/NamedSequences.txt so that
there can be a unique name to refer to them, but there will not be any new
codepoint.

On Sun, Aug 16, 2015 at 6:16 PM <alexweiner at alexweiner.com> wrote:

> David,
>
> I don't understand what you mean by saying that the standard is set. By
> Ken's account, The Consortium decided to create a policy specifically
> regarding this, by vote of APL (and I assume interested Unicode) users
> worldwide. The Standard itself is in version eight. Why does a vote seem so
> ridiculous, especially in the case of an addition, rather than a
> subtraction?
>
> What is the current precedent for this sort of thing?
>
> -Alex
>
> -------- Original Message --------
> Subject: Re: APL Under-bar Characters
>
> From: David Starner <prosfilaes at gmail.com>
> Date: Sun, August 16, 2015 5:59 pm
> To: alexweiner at alexweiner.com, Ken Whistler <kenwhistler at att.net>
> Cc: unicode at unicode.org
>
> The standard is set here. The Unicode Consortium has declared that it
> won't encode precomposed characters that can be created from characters in
> the standard, because that would be destabilizing and potentially introduce
> security holes in programs depending on Unicode. If you want, we can have a
> vote on whether or not APL should use characters with underlines, since I
> was unfairly locked out of that vote by not being born yet.
>
> On Sun, Aug 16, 2015 at 5:52 PM <alexweiner at alexweiner.com> wrote:
>
>> Ken,
>> You pose a very strong, and well worded response. The historical element
>> really helps to illuminate what I thought was lost knowledge: "Why are
>> there no under-bars". To this I can only ask one thing:
>>
>> Can we put this to a vote again? To put things in perspective, I was thee
>> years old at the time of the ballot in 1993 and had much larger issues to
>> deal with (comprehending speech, learning to walk, etc.), and was unable to
>> participate in this internationally binding vote.
>>
>> Perhaps feelings about the under-bar characters have changed since then.
>> I know that the APL landscape is *very* different than it was in 1993.
>>
>> I have a copy of one of those IBM books that has the italicized
>> upper-case under-bars. If my proposal for a new vote is well received,
>> maybe we should include those as well, for completeness sake.
>>
>> -Alex
>>
>>
>> -------- Original Message --------
>> Subject: Re: APL Under-bar Characters
>>
>> From: Ken Whistler <kenwhistler at att.net>
>> Date: Sun, August 16, 2015 5:15 pm
>> To: alexweiner at alexweiner.com
>> Cc: unicode at unicode.org
>>
>> Alex,
>>
>> On 8/16/2015 12:41 PM, alexweiner at alexweiner.com wrote:
>>
>>
>> As far as I know, APL definitely predates the Unicode consortium. Do you
>> think that The Consortium possibly overlooked the pre-existing under-bar
>> character set?
>>
>>
>>
>> The answer to that is no.
>>
>> Initially, Unicode 1.0 attempted to punt the entire APL complex
>> functional symbol
>> problem by encoding U+2300 APL COMPOSE OPERATOR.
>>
>> The concept was essentially that any of the combined symbols -- the old
>> rack of stuff that people complained about entering with
>> symbol/backspace/symbol
>> keying, could simply be represented as sequences of existing symbols.
>> Think of 2300 as an early attempt to introduce an APL "script"-specific
>> conjunct-forming virama, a la much-later artificially introduced
>> script-specific
>> joiners. Cf. U+2D7F TIFINAGH CONSONANT JOINER.
>>
>> But U+2300 APL COMPOSE OPERATOR was an innovation that failed.
>> It was fiercely opposed *by the APL community*, who wanted it
>> out of 10646 and replaced with a explicit list of pre-formed complex
>> functional symbols. Presumably for the same reason we are talking
>> about here now: essentially that each symbol had to work as a "character",
>> and in an APL context that meant fixed width and the same data size as
>> all the other characters.
>>
>> The removal of Unicode 1.0 U+2300 APL COMPOSE OPERATOR is documented
>> in Unicode 1.1 as of 1993:
>>
>> http://www.unicode.org/versions/Unicode1.1.0/
>>
>> (see page 3)
>>
>> The addition of APL functional symbols is documented in Section 5.4.8,
>> pp. 39-41.
>>
>> The exact repertoire that ended up encoded in the standard was the result
>> of meetings
>> between some Unicode representatives and some folks from the APL
>> community. The names
>> escape me at the moment, although it might be possible to recover some
>> information eventually. (Documentation regarding Unicode events in late
>> 1991 is
>> sparse these days.) At any rate the agreed upon additional repertoire is
>> probably
>> that included in:
>>
>> X3L2/92-035, Unicode Request for Additional Characters in ISO/IEC
>> 10646-1.2.
>> And the rest of the consequences and processing can be dug out of the
>> ballot history record
>> for the voting on 10646 in 1992.
>>
>> At any rate, a propos *this* discussion, we agreed that the repertoire
>> would cover
>> all the complex functional symbols, but *not* the letters
>> with underscores. And it is not that they were simply overlooked.
>>
>> How do I know? Well, first, there were APL specialists involved in coming
>> up
>> (and promoting) the repertoire that was carried into the 10646 balloting
>> at
>> the time. It isn't as if a bunch of ignorant Unicoders just grabbed one
>> APL
>> book off the shelf and coded up the table, not noticing that some stuff
>> was
>> missing.
>>
>> Second, the text that is currently in the core specification about this
>> issue,
>> to wit:
>>
>> " ... All other APL extensions can e encoded by composition of other
>> Unicode characters. For example, the APL symbol a underbar can be
>> represented by U+0061 LATIN SMALL LETTER A + U+0332 COMBINING LOW LINE."
>> (Unicode 7.0, Section 22.7, p. 772)
>>
>> is *ancient* text. It was first printed on p. 6-83 of Unicode 2.0 in 1996,
>> with exactly the same wording. And the only reason it took until 1996 to
>> appear,
>> instead of 1993, was that the editing of Unicode 2.0 and its code charts
>> was such a massive task at the time.
>>
>> So the clear intent in *1993* was to represent any APL letter with
>> underbar
>> as a combining character sequence -- as noted. The only problem I see
>> there
>> is that the text in the core spec mistakenly used U+0061 (the lowercase
>> "a")
>> instead of U+0041 (the uppercase "A") for the exemplification.
>>
>> Third, I can attest that at least some of us at the time -- as early as
>> 1989, had
>> printed copies of IBM EBCDIC code page 293 for APL, which had
>> the EBCDIC uppercase Latin letters with underscores (italicized, by the
>> way),
>> together with the regular EBCDIC upper and lowercase letters. [Dates from
>> 1984.]
>> *And* IBM EBCDIC code page 310 for APL, which dropped all the
>> regular upper- and lowercase letters but added more symbols.
>> *And* IBM PC code page 907 (with the underscored uppercase Latin
>> letters) and PC code page 909 (CP437 hacked up for APL, without the
>> underscored uppercase Latin letters), which was quickly superseded by
>> PC code page 910, which also did not use the uppercase Latin letters
>> with underscores.
>>
>> So yeah, we knew about these. Encoding them as combining character
>> sequences instead of as atomic characters was a deliberate decision
>> taken in 1992. And that decision made it through both UTC and
>> international balloting for publication in 1993.
>>
>> --Ken
>>
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150817/f9bd8ea6/attachment.html>

From alexweiner at alexweiner.com  Sun Aug 16 20:57:43 2015
From: alexweiner at alexweiner.com (alexweiner at alexweiner.com)
Date: Sun, 16 Aug 2015 18:57:43 -0700
Subject: APL Under-bar Characters
Message-ID: <20150816185743.e74b0ce91403bfe413f98785c6a226af.266e3b9a15.wbe@email06.secureserver.net>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150816/65d72d67/attachment.html>

From asmus-inc at ix.netcom.com  Sun Aug 16 21:16:17 2015
From: asmus-inc at ix.netcom.com (Asmus Freytag (t))
Date: Sun, 16 Aug 2015 19:16:17 -0700
Subject: APL Under-bar Characters
In-Reply-To: <20150816185743.e74b0ce91403bfe413f98785c6a226af.266e3b9a15.wbe@email06.secureserver.net>
References: <20150816185743.e74b0ce91403bfe413f98785c6a226af.266e3b9a15.wbe@email06.secureserver.net>
Message-ID: <55D143F1.9030604@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150816/e9cdcee1/attachment.html>

From kojiishi at gmail.com  Mon Aug 17 01:21:44 2015
From: kojiishi at gmail.com (Koji Ishii)
Date: Mon, 17 Aug 2015 15:21:44 +0900
Subject: Why doesn't Ideographic (ID) in UAX#14 have half-width katakana?
In-Reply-To: <4A7F11C4-2F5B-4156-844C-C21E543FCFC7@apple.com>
References: <553EEE6D.2020004@ga2.so-net.ne.jp>
 <553EFB2E.3010808@hiroshima-u.ac.jp>
 <CAP0dOsEMB=5v_5smD6x2YTmBGqConxfSrBxBrjX5Q8qaGAyqSA@mail.gmail.com>
 <55403267.9060202@att.net> <DF8A002F-C070-4663-874A-97478FEB9FCA@gmail.com>
 <55438AFB.6020000@att.net> <55439AE4.4020109@hiroshima-u.ac.jp>
 <5543AE4B.5020904@att.net> <5543D03B.80603@ix.netcom.com>
 <CAN9ydbUjuF1RC7gCoef0mOTx0W9eVOv4v07jCt2_a-_TFyAqRQ@mail.gmail.com>
 <55467CAF.4080401@ix.netcom.com>
 <4A7F11C4-2F5B-4156-844C-C21E543FCFC7@apple.com>
Message-ID: <CAN9ydbWrZH5L3YAM4i5g+9HAy2MYoRo8L5FuVgqyaUy0O0HFWg@mail.gmail.com>

Hi all,

I'm not in sync with publishing schedule, sorry about that, but is it
possible to consider this change for Unicode 9.0 time frame?

I believe all concerns were cleared in the discussion, but if any were
left, I'd be happy to discuss further.

And I hope I'm not too late this time?

/koji

On Tue, May 5, 2015 at 6:19 AM, Peter Edberg <pedberg at apple.com> wrote:

> I have been checking with various groups at Apple. The consensus here is
> that we would like to see the linebreak value for halfwidth katakana
> changed to ID.
>
> - Peter E
>
>
>
> On May 3, 2015, at 12:53 PM, Asmus Freytag (t) <asmus-inc at ix.netcom.com>
> wrote:
>
> On 5/3/2015 9:47 AM, Koji Ishii wrote:
>
> Thank you so much Ken and Asmus for the detailed guides and histories.
> This helps me a lot.
>
> In terms of time frame, I don't insist on specific time frame, Unicode 9
> is fine if that works well for all.
>
> I'm not sure how much history and postmortem has to be baked into the
> section of UAX#14, hope not much because I'm not familiar with how it was
> defined so other than what Ken and Asmus kindly provided in this thread.
> But from those information, I feel stronger than before that this was
> simply an unfortunate oversight. In the document Ken quoted, F and W are
> distinguished, but H and N are not. In '90, East Asian versions of Office
> and RichEdit were in my radar and all of them handled halfwidth Katakana as
> ID for the line breaking purposes. That's quite understandable given the
> amount of code points to work on, given the priority of halfwidth Katakana,
> and given the difference of "what line breaking should be" and UAX#14 as
> Ken noted, but writing it up as a document doesn't look an easy task
>
>
> Koji,
>
> kana are special in that they are not shared among languages. From that
> perspective, there's nothing wrong with having a "general purpose"
> algorithm support the rules of the target language (unless that would add
> undue complexity, which isn't a consideration here).
>
> Based on the data presented informally here in postings, I find your
> conclusion (oversight) quite believable. The task would therefore be to
> present the same data in a more organized fashion as part of a formal
> proposal. Should be doable.
>
> I think you'd want to focus on survey of modern practice in
> implementations (and if you have data on some of them going back to the
> '90s the better).
>
> From the historical analysis it's clear that there was a desire to create
> assignments that didn't introduce random inconsistencies between LB and EAW
> properties, but that kind of self-consistency check just makes sure that
> all characters of some group defined by the intersection of property
> subsets are treated the same (unless there's an overriding reason to
> differentiate within). It seems entirely plausible that this process
> misfired  for the characters in question, more likely so, given that the
> earliest drafts of the tables were based on an implementation also being
> created by MS around the same time. That makes any difference to other MS
> products even more likely to be an oversight.
>
> I do want to help UTC establish a precedent of getting changes like that
> endorsed by a representative sample of implementers and key external
> standards (where applicable, in this case that would be CSS), to avoid the
> chance of creating undue disruption (and to increase the chance that the
> resulting modified algorithm is actually usable off-the-shelf, for example
> for "default" or "unknown language" type scenarios.
>
> Hence my insistence that you go out and drum up support. But it looks like
> this should be relatively easy, as there seems to be no strong case for
> maintaining the status quo, other than that it is the status quo.
>
> A./
>
>
>
> I agree that implementers and CSS WG should be involved, but given IE and
> FF have already tailored, and all MS products as well, I guess it should
> not be too hard. I'm in Chrome team now, and the only problem for me to fix
> it in Chrome is to justify why Chrome wants to tailor rather than fixing
> UAX#14 (and the bug priority...)
>
> Either Makoto or I can bring it up to CSS WG to get back to you.
>
> /koji
>
>
> On Sat, May 2, 2015 at 4:12 AM, Asmus Freytag (t) <asmus-inc at ix.netcom.com
> > wrote:
>
>> Thank you, Ken, for your dedicated archeological efforts.
>>
>> I would like to emphasize that, at the time, UAX#14 reflected observed
>> behavior, in particular (but not exclusively) for MS products some of which
>> (at the time) used an LB algorithm that effectively matched an untailored
>> UAX#14.
>>
>> However, recently, the W3C has spent considerable effort to look into
>> different layout-related algorithms and specification. If, in that context,
>> a consensus approach is developed that would point to a better "default"
>> behavior for untailored UAX#14-style line breaking, I would regard that as
>> a critical mass of support to allow UTC to consider tinkering with such a
>> long-standing set of property assignments.
>>
>> This would be true, especially, if it can be demonstrated that (other
>> than matching legacy behavior) there's no context that would benefit from
>> the existing classification. I note that this was something several posters
>> implied.
>>
>> So, if implementers of the legacy behavior are amenable to achieve this
>> by tailoring, and if the change augments the number of situations where
>> untailored UAX#14-style line breaking can be used, that would be a win that
>> might offset the cost of a disruptive change.
>>
>> We've heard arguments why the proposed change is technically superior for
>> Japanese. We now need to find out whether there are contexts where a change
>> would adversely affect users/implementers. Following that, we would look
>> for endorsements of the proposal from implementers or other standards
>> organizations such as W3C (and, if at all possible, agreement from those
>> implementers who use the untailored algorithm now). With these three
>> preconditions in place, I would support an effort of the UTC to revisit
>> this question.
>>
>> A./
>>
>>
>> On 5/1/2015 9:48 AM, Ken Whistler wrote:
>>
>> Suzuki-san,
>>
>> On 5/1/2015 8:25 AM, suzuki toshiya wrote:
>>
>>
>> Excuse me, there is any discussion record how UAX#14 class for
>> halfwidth-katakana in 15 years ago? If there is such, I want to
>> see a sample text (of halfwidth-katakana) and expected layout
>> result for it.
>>
>>
>> The *founding* document for the UTC discussion of the initial
>> Line_Break property values 15 years ago was:
>>
>> http://www.unicode.org/L2/L1999/99179.pdf
>>
>> and the corresponding table draft (before approval and conversion
>> into the final format that was published with UTR #14 -- later
>> *UAX* #14) was:
>>
>> http://www.unicode.org/L2/L1999/99180.pdf
>>
>> There is nothing different or surprising in terms of values there. The
>> halfwidth
>> katakana were lb=AL and the fullwidth katakana were lb=ID in
>> that earliest draft, as of 1999.
>>
>> What is new information, perhaps, is the explicit correlation that can be
>> found
>> in those documents with the East_Asian_Width properties, and the
>> explanation
>> in L2/99-179 that the EAW property values were explicitly used to
>> make distinctions for the initial LB values.
>>
>> There is no sample text or expected layout results from that time period,
>> because that was not the basis for the original UTC decisions on any of
>> this.
>> Initial LB values were generated based on existing General_Category
>> and EAW values, using general principles. They were not generated by
>> examining and specifying in detail the line breaking behavior for
>> every single script in the standard, and then working back from those
>> detailed specifications to attempt to create a universal specification
>> that would replicate all of that detailed behavior. Such an approach
>> would have been nearly impossible, given the state of all the data,
>> and might have taken a decade to complete.
>>
>> That said, Japanese line breaking was no doubt considered as part of
>> the overall background, because the initial design for UTR #14 was
>> informed
>> by experience in implementation of line breaking algorithms at Microsoft
>> in the 90's.
>>
>>
>> You commented that the UAX#14 class should not be changed but
>> the tailoring of the line breaking behaviour would solve
>> the problem (as Firefox and IE11 did). However, some developers
>> may wonder "there might be a reason why UTC put halfwidth-katakana
>> to AL - without understanding it, we could not determine whether
>> the proposed tailoring should be enabled always, or enabled
>> only for a specific environment (e.g. locale, surrounding text)".
>>
>>
>> See above, in L2/99-179. *That* was the justification. It had nothing
>> to do with specific environment, locale, or surrounding text.
>>
>>
>> If UTC can supply the "expected layout result for halfwidth-
>> katakana (used to define the class in current UAX#14)", it
>> would be helpful for the developers to evaluate the proposed
>> tailoring algorithm.
>>
>>
>> UAX #14 was never intended to be a detailed, script-by-script
>> specification of line layout results. It is a default, generic, universal
>> algorithm for line breaking that does a decent, generic job of
>> line breaking in generic contexts without tailoring or specific
>> knowledge of language, locale, or typographical conventions in use.
>>
>> UAX #14 is not a replacement for full specification of kinsoku
>> rules for Japanese, in particular. Nor is it intended as any kind
>> of replacement for JIS X 4051.
>>
>> Please understand this: UAX #14 does *NOT* tell anyone how
>> Japanese text *should* line break. Instead, it is Japanese typographers,
>> users and standardizers who tell implementers of line break
>> algorithms for Japanese what the expectations for Japanese text should
>> be, in what contexts. It is then the job of the UTC and of the
>> platform and application vendors to negotiate the details of
>> which part of that expected behavior makes sense to try to
>> cover by tweaking the default line-breaking algorithm and the
>> Line_Break property values for Unicode characters, and which
>> part of that expected behavior makes sense to try to cover
>> by adjusting commonly accessible and agreed upon tailoring
>> behavior (or public standards like CSS), and finally which part of that
>> expected behavior should instead be addressed by value-added, proprietary
>> implementations of high end publishing software.
>>
>> Regards,
>>
>> --Ken
>>
>>
>>
>>
>>
>>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150817/cb549827/attachment-0001.html>

From andrewcwest at gmail.com  Mon Aug 17 03:16:45 2015
From: andrewcwest at gmail.com (Andrew West)
Date: Mon, 17 Aug 2015 09:16:45 +0100
Subject: Standardised Variation Sequences with Toggles
In-Reply-To: <20150816235008.43bbd56d@JRWUBU2>
References: <20150816112024.1760f1a6@JRWUBU2> <55D0DFB2.7000806@att.net>
 <20150816235008.43bbd56d@JRWUBU2>
Message-ID: <CALgEMhyimrJCTysf4Ww3_bfoascne6yS=s-LCx6kRSFKgxtPHQ@mail.gmail.com>

On 16 August 2015 at 23:50, Richard Wordingham
<richard.wordingham at ntlworld.com> wrote:
>
>> @+ For details about the implementation of variation sequences in
>> Phags-pa, please refer to the Phags-pa section of the core
>> specification.
>
> a) This is likely to be ignored by someone who is just looking for the
> *specification*.  I think replacing 'implementation' by 'rendering'
> would be better.  I would be inclined to add, 'These sequences are more
> complicated than they appear at first reading'.  Otherwise, someone
> will just add them to the character to glyph conversion section of a
> font and think, "Job done".

That's not a plausible scenario. Phags-pa has complex shaping and
joining requirements, and it is impossible for someone to create a
properly functioning Phags-pa font based on the code charts alone. If
anyone did implement Phags-pa in a font based solely on the Unicode or
10646 code charts, with no joining or shaping behaviour, for use as a
fallback font or as a code chart font then naively implementing U+A856
+ U+FE00 (VS1) as a mirrored glyph is not unreasonable.  If they want
to produce a Phags-pa font for displaying running Phags-pa text then
at a minimum they will need to read the appropriate section of the
core specification.

Andrew

From chris.fynn at gmail.com  Mon Aug 17 04:22:35 2015
From: chris.fynn at gmail.com (Christopher Fynn)
Date: Mon, 17 Aug 2015 14:52:35 +0530
Subject: Emoji characters for food allergens
In-Reply-To: <29292306.26076.1437842589469.JavaMail.defaultUser@defaultHost>
References: <29292306.26076.1437842589469.JavaMail.defaultUser@defaultHost>
Message-ID: <CAA_CYcKLFNvvzwVxy1oHrLFvVNSeTv20kL7imTpyv3Y9rXz8HQ@mail.gmail.com>

Surely there is already some international standards body or panel which
deals with food safety and labelling? (maybe ISO 22000 Food Safety
Management Systems)

If there is a real need for characters to represent food allergens,
wouldn't such a body be the right group to come up with appropriate glyphs
and then make a proposal to ISO 10646 / Unicode

- Chris

On 25 July 2015 at 22:13, William_J_G Overington <wjgo_10009 at btinternet.com>
wrote:

> Emoji characters for food allergens
>
> An interesting document entitled
>
> Preliminary proposal to add emoji characters for food allergens
>
> by Hiroyuki Komatsu
>
> was added into the UTC (Unicode Technical Committee) Document Register
> yesterday.
>
> http://www.unicode.org/L2/L2015/15197-emoji-food-allergens.pdf
>
> This is a welcome development.
>
> I suggest that, in view of the importance of precision in conveying
> information about food allergens, that the emoji characters for food
> allergens should be separate characters from other emoji characters. That
> is, encoded in a separate quite distinct block of code points far away in
> the character map from other emoji characters, with no dual meanings for
> any of the characters: a character for a food allergen should be quite
> separate and distinct from a character for any other meaning.
>
> I opine that having two separate meanings for the same character, one
> meaning as an everyday jolly good fun meaning in a text message and one
> meaning as a specialist food allergen meaning could be a source of
> confusion. Far better to encode a separate code block with separate
> characters right from the start than risk needless and perhaps medically
> dangerous confusion in the future.
>
> I suggest that for each allergen that there be two characters.
>
> The glyph for the first character of the pair goes from baseline to
> ascender.
>
> The glyph for the second character of the pair is a copy of the glyph for
> the first character of the pair augmented with a thick red line from lower
> left descender to higher right a little above the base line, the thick red
> line perhaps being at about thirty degrees from the horizontal. Thus the
> thick red line would go over the allergen part of the glyph yet just by
> clipping it a bit so that clarity is maintained.
>
> The glyphs are thus for the presence of the allergen and the absence of
> the allergen respectively.
>
> It is typical in the United Kingdom to label food packets not only with an
> ingredients list but also with a list of allergens in the food and also
> with a list of allergens not in the food.
>
> For example, a particular food may contain soya yet not gluten.
>
> Thus I opine that two characters are needed for each allergen.
>
> I have deliberately avoided a total strike through at forty-five degrees
> as I opine that that could lead to problems distinguishing clearly the
> glyph for the absence of one allergen from the glyph for the absence of
> another allergen.
>
> I have also wondered whether each glyph for an allergen should include
> within its glyph a number, maybe a three-digit number, so that clarity is
> precise.
>
> I opine that two separate characters for each allergen is desirable rather
> than some solution such as having one character for each allergen and a
> combining strike through character.
>
> The two separate characters approach keeps the system straightforward to
> use with many software packages. The matter of expressing food allergens is
> far too important to become entangled in problems for everyday users.
>
> For gluten, it might be necessary to have three distinct code points.
>
> In the United Kingdom there is a legal difference between "gluten-free"
> and "no gluten-containing ingredients".
>
> To be labelled gluten-free the product must have been tested. This is to
> ensure that there has been no cross-contamination of ingredients. For
> example, rice has no gluten, but was a particular load of rice transported
> in a lorry used for wheat on other days?
>
> Yet testing is not always possible in a restaurant situation.
>
> William Overington
>
> 25 July 2015
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150817/cd73ffb6/attachment.html>

From charupdate at orange.fr  Mon Aug 17 06:48:48 2015
From: charupdate at orange.fr (Marcel Schneider)
Date: Mon, 17 Aug 2015 13:48:48 +0200 (CEST)
Subject: APL Under-bar Characters
Message-ID: <817872584.8325.1439812128893.JavaMail.www@wwinf1j14>

On 16 Aug 2015, at 16:35, Alex Weiner  wrote:

> I have heard that the problem was brought to Unicode consortium before, and the answer was to just use the underline styling, as it is apparently equivalent, but I do not think it is. Underline styling usually connects the line from one letter to another like this. The under-bar characters do not do such connecting, and are actually only for capital letters. so It would look more? L I K E ? T H I S? (I added the spaces for dramatic effect).

?

[And I left out the underscore in the middle.]


This connecting behavior of the underline formatting can be disabled by checking the ?Words only? check box in LibreOffice Writer and surely in new versions of Microsoft Office Word, or in older versions by selecting ?Words? in the underline style dropdown menu. 

When the Words only feature is enabled, the underline skips all spaces, including no-break spaces.
____

I need to mention that I would have posted this yesterday, but refrained for having sent something too much these days, particularly in the threads about Michael Kaplan. I declare again to be very sorry, and I ask Michael to forgive my reactions to his blog post he wrote up ad hoc while he was angry about our discussion here.

____


On 17 Aug 2015 at 02:25, Ken Whistler  wrote:

> It isn't as if a bunch of ignorant Unicoders just grabbed one APL book off the shelf and coded up the table, not noticing that some stuff was missing.

The received false idea about underline formatting (which "usually connects"), that in this highly technical context (after all, a programming language) has been opposed to the advice of the Unicode consortium: I couldn't help thinking immediately that such reasonings are symptomatic of the Unicode contesting posture that seems to be the first reflex of many people (including myself) from the beginning on (historically as well as personally speaking). As I too, at my beginning on this mailing list (which at the same time has been my first mailing list participation), went on in such an immature attempt, I'm seeming to be well placed to state once and for all (I hope) that making trouble for little to no use is a bit like process garbage: it wasts the energy (mental/physical) that needs to be used/saved to save the planet, as far as extends to the life on it.

All the best,

Marcel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150817/97c0d6bd/attachment.html>

From charupdate at orange.fr  Mon Aug 17 06:51:26 2015
From: charupdate at orange.fr (Marcel Schneider)
Date: Mon, 17 Aug 2015 13:51:26 +0200 (CEST)
Subject: Custom keyboard source samples (was: Re: Windows keyboard
 restrictions)
Message-ID: <749936537.8372.1439812286240.JavaMail.www@wwinf1j14>

On 07 Aug 2015, at 20:54, Richard Wordingham  wrote:

> What we're waiting for is a guide we can follow, or some code we can ape.

Since yesterday I know a very simple way to get the source code (in C) of any MSKLC layout. 

While the build is done, we must wait for the four files appearing in an ad hoc created "amd64" subdirectory in the Temporary Files folder, in the hidden Application Data directory: C:\Users\user\AppData\Local\Temp\amd64
As soon as the four files are visible in the Explorer, we can press Ctrl+A, Ctrl+C, Ctrl+V. This must be done rapidly, in order to get a copy before their deletion by MSKLC a few seconds later.

If we notice that during the build, three other temporary folders are created by MSKLC and deleted if empty, we may wish to know that the four files are strictly identical in all four folders. This has been verified on a simple layout, using the (very useful) comparison tool of the ConTEXT text editor.

Best regards,

Marcel

From doug at ewellic.org  Mon Aug 17 11:23:15 2015
From: doug at ewellic.org (Doug Ewell)
Date: Mon, 17 Aug 2015 09:23:15 -0700
Subject: APL Under-bar Characters
Message-ID: <20150817092315.665a7a7059d7ee80bb4d670165c8327d.91850dae39.wbe@email03.secureserver.net>

<alexweiner at alexweiner dot com> wrote:

> I have heard that the problem was brought to Unicode consortium
> before, and the answer was to just use the underline styling, as it is
> apparently equivalent, but I do not think it is.

Combining character sequences are not "styling." Combining character
sequences are plain text. They are not the same as marking a letter or
word or paragraph in your word processor and clicking a button to make
that text bold or italic or underlined.

In layman's terms, each combining sequence (base character plus any
number of combining characters) should be treated as a unit, regardless
of whether the sequence has been assigned a name. So these sequences are
indeed equivalent to the APL-specific "underlined letter" characters
used in non-Unicode systems.

> Underline styling usually connects the line from one letter to another
> l?i?k?e? ?t?h?i?s?.? The under-bar characters do not do such connecting,
> and are actually only for capital letters. so It would look more
> L? I? K? E?  ? T? H? I? S?  (I added the spaces for dramatic effect).

TUS 7.0, Section 7.9 does say:

> The characters U+0332 COMBINING LOW LINE, U+0333 COMBINING DOUBLE LOW
> LINE, U+0305 COMBINING OVERLINE, and U+033F COMBINING DOUBLE OVERLINE
> are intended to connect on the left and right.

In that case, despite the text in Section 22.7 that Ken quoted, it seems
that U+0331 COMBINING MACRON might be a better choice for APL
"underlined letters" than U+0332 COMBINING LOW LINE. Compare A?B?C?
with A?B?C?, noting that your font and rendering engine mileage may
vary.

"Voting again" to change one of the basic rules of Unicode, on the basis
that "perhaps feelings about the under-bar characters have changed since
then," is not expected to be an option, as David said.

> Then maybe we could work off that as a pseudo-standard?

Neither named or unnamed character sequences are a "pseudo-standard."
Both are part of the Unicode Standard.

--
Doug Ewell | http://ewellic.org | Thornton, CO ????


From neil at tonal.clara.co.uk  Mon Aug 17 13:02:50 2015
From: neil at tonal.clara.co.uk (Neil Harris)
Date: Mon, 17 Aug 2015 19:02:50 +0100
Subject: APL Under-bar Characters
In-Reply-To: <20150817092315.665a7a7059d7ee80bb4d670165c8327d.91850dae39.wbe@email03.secureserver.net>
References: <20150817092315.665a7a7059d7ee80bb4d670165c8327d.91850dae39.wbe@email03.secureserver.net>
Message-ID: <55D221CA.5030807@tonal.clara.co.uk>

On 17/08/15 17:23, Doug Ewell wrote:
>
> In that case, despite the text in Section 22.7 that Ken quoted, it seems
> that U+0331 COMBINING MACRON might be a better choice for APL
> "underlined letters" than U+0332 COMBINING LOW LINE. Compare A?B?C?
> with A?B?C?, noting that your font and rendering engine mileage may
> vary.
>
> "Voting again" to change one of the basic rules of Unicode, on the basis
> that "perhaps feelings about the under-bar characters have changed since
> then," is not expected to be an option, as David said.
>

Doug is right. One small correction: U+0331 is COMBINING MACRON BELOW, 
not COMBINING MACRON.

Wikipedia has an excellent article on this topic:

https://en.wikipedia.org/wiki/Macron_below

-- Neil


From doug at ewellic.org  Mon Aug 17 13:14:37 2015
From: doug at ewellic.org (Doug Ewell)
Date: Mon, 17 Aug 2015 11:14:37 -0700
Subject: APL Under-bar Characters
Message-ID: <20150817111437.665a7a7059d7ee80bb4d670165c8327d.edd06ccc55.wbe@email03.secureserver.net>

Neil Harris <neil at tonal dot clara dot co dot uk> wrote:

> One small correction: U+0331 is COMBINING MACRON BELOW, not COMBINING
> MACRON. 

Yes, thank you.

--
Doug Ewell | http://ewellic.org | Thornton, CO ????


From charupdate at orange.fr  Mon Aug 17 15:02:04 2015
From: charupdate at orange.fr (Marcel Schneider)
Date: Mon, 17 Aug 2015 22:02:04 +0200 (CEST)
Subject: APL Under-bar Characters
Message-ID: <2040787231.15099.1439841724457.JavaMail.www@wwinf1g37>

On 17 Aug 2015 at 18:34, Doug Ewell  wrote [included subsequent exchange]:

> > I have heard that the problem was brought to Unicode consortium
> > before, and the answer was to just use the underline styling, as it is
> > apparently equivalent, but I do not think it is.
> 
> Combining character sequences are not "styling." Combining character
> sequences are plain text. They are not the same as marking a letter or
> word or paragraph in your word processor and clicking a button to make
> that text bold or italic or underlined.
> 
> In layman's terms, each combining sequence (base character plus any
> number of combining characters) should be treated as a unit, regardless
> of whether the sequence has been assigned a name. So these sequences are
> indeed equivalent to the APL-specific "underlined letter" characters
> used in non-Unicode systems.
> 
> > Underline styling usually connects the line from one letter to another
> > l?i?k?e? ?t?h?i?s?.? The under-bar characters do not do such connecting,
> > and are actually only for capital letters. so It would look more
> > L? I? K? E? ? T? H? I? S? (I added the spaces for dramatic effect).
> 
> TUS 7.0, Section 7.9 does say:
> 
> > The characters U+0332 COMBINING LOW LINE, U+0333 COMBINING DOUBLE LOW
> > LINE, U+0305 COMBINING OVERLINE, and U+033F COMBINING DOUBLE OVERLINE
> > are intended to connect on the left and right.
> 
> In that case, despite the text in Section 22.7 that Ken quoted, it seems
> that U+0331 COMBINING MACRON BELOW might be a better choice for APL
> "underlined letters" than U+0332 COMBINING LOW LINE. Compare A?B?C?
> with A?B?C?, noting that your font and rendering engine mileage may
> vary.

As Alex Weiner apparently used formatting in both examples, I didn't notice that the issue was to avoid having underlined capitals looking like if they were underlined with formatting. Indeed I see that APL must use the combining macron below (and good rendering engines).

Alex' concern is probably to be irked seeing all the mathematical alphabets in the SMP and no underlined for APL.

By contrast, following Khaled Hosny's soon reply, Alex' posted concern about character count resolves clearly as an unexpected software behavior, which is to be fixed at implementation level.

?

Marcel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150817/90def407/attachment.html>

From charupdate at orange.fr  Mon Aug 17 15:35:43 2015
From: charupdate at orange.fr (Marcel Schneider)
Date: Mon, 17 Aug 2015 22:35:43 +0200 (CEST)
Subject: Custom keyboard source samples
In-Reply-To: <749936537.8372.1439812286240.JavaMail.www@wwinf1j14>
References: <749936537.8372.1439812286240.JavaMail.www@wwinf1j14>
Message-ID: <787050305.15517.1439843743352.JavaMail.www@wwinf1g37>

I need to clarify why to get the C sources of an MSKLC layout we must work in the Temp folder.

It's not really working what I've done in Temp, it's rather going to fetch the data where MSKLC stores them for a few seconds. This is independent of the working directory. Creating a new folder in Temp helps getting the data. We can also wait for the amd64 folder or the i386, ia64 (which appear first) or wow64 (which comes last) and select it before copying and pasting somewhere else. I just fear we haven't enough time for this procedure. So creating the folder before is a way to ensure that we get the files within the imparted time.

These custom samples help to complete the WDK keyboard layout source samples, given that the WDK samples being for current keyboard layouts (US?English, Greek, French, German, Japanese), they don't include ligature tables. A working practice is IMHO to pack the maximum into a MSKLC layout, get the C sources and the installation package, then edit the sources and recompile using the WDK. Eventually we need to install the MSKLC layout first, then replace the driver in the System32 folder and reboot. This is a way to develop enhanced layouts, with chained dead keys, increased numerical keypad mapping with more code units and numpad accessed also while Fn is pressed (on compact keyboards), more or different modifiers, and so on.

To do this, no much knowledge in programming is needed. As Richard calls it, we can ape the code, looking up kbd.h and winuser.h in the WDK or the MSKLC for scancodes and virtual key names. However, aping the splitted allocation tables and the reduced numpad digits (without shift states) is strongly discouraged. The best way is to unify the allocation tables, with the modification of some other code lines which that implies.

I'm hopeful that this helps implementing more of Unicode at input level. But this is only *one* way to do so, which today certainly isn't any longer the most performative one.

Marcel

On 17 Aug 2015 at 13:59, I wrote:

> On 07 Aug 2015, at 20:54, Richard Wordingham wrote:
> 
> > What we're waiting for is a guide we can follow, or some code we can ape.
> 
> Since yesterday I know a very simple way to get the source code (in C) of any MSKLC layout. 
> 
> While the build is done, we must wait for the four files appearing in an ad hoc created "amd64" subdirectory in the Temporary Files folder, in the hidden Application Data directory: C:\Users\user\AppData\Local\Temp\amd64
> As soon as the four files are visible in the Explorer, we can press Ctrl+A, Ctrl+C, Ctrl+V. This must be done rapidly, in order to get a copy before their deletion by MSKLC a few seconds later.
> 
> If we notice that during the build, three other temporary folders are created by MSKLC and deleted if empty, we may wish to know that the four files are strictly identical in all four folders. This has been verified on a simple layout, using the (very useful) comparison tool of the ConTEXT text editor.
> 
> Best regards,
> 
> Marcel
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150817/7be7b9dc/attachment.html>

From alexweiner at alexweiner.com  Mon Aug 17 17:32:37 2015
From: alexweiner at alexweiner.com (alexweiner at alexweiner.com)
Date: Mon, 17 Aug 2015 15:32:37 -0700
Subject: APL Under-bar Characters
Message-ID: <20150817153237.e74b0ce91403bfe413f98785c6a226af.d64e5e188d.wbe@email06.secureserver.net>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150817/29c1aaab/attachment.html>

From olopierpa at gmail.com  Mon Aug 17 17:48:16 2015
From: olopierpa at gmail.com (Pierpaolo Bernardi)
Date: Tue, 18 Aug 2015 00:48:16 +0200
Subject: APL Under-bar Characters
In-Reply-To: <20150817153237.e74b0ce91403bfe413f98785c6a226af.d64e5e188d.wbe@email06.secureserver.net>
References: <20150817153237.e74b0ce91403bfe413f98785c6a226af.d64e5e188d.wbe@email06.secureserver.net>
Message-ID: <CANY8u7EGRDdpxbg_UKUyN4eR3jKwq3mkco4mmqEQztik2QhQeA@mail.gmail.com>

On Tue, Aug 18, 2015 at 12:32 AM,  <alexweiner at alexweiner.com> wrote:
> Hi Doug,
>
> I think I am going to suggest that GNUAPL use
> http://www.unicode.org/Public/UCD/latest/ucd/NamedSequences.txt
> as previously suggested as it seems like it may provide a way for GNUAPL to
> support characters with under-bars, and ease all our parsing problems.

How can giving a name to a sequence allow you to change your parser in
ways that you couldn't without an official name to the sequence?

From alexweiner at alexweiner.com  Mon Aug 17 21:57:22 2015
From: alexweiner at alexweiner.com (alexweiner at alexweiner.com)
Date: Mon, 17 Aug 2015 19:57:22 -0700
Subject: APL Under-bar Characters
Message-ID: <20150817195722.e74b0ce91403bfe413f98785c6a226af.489356c3b1.wbe@email06.secureserver.net>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150817/5dcd4db6/attachment.html>

From prosfilaes at gmail.com  Mon Aug 17 23:45:23 2015
From: prosfilaes at gmail.com (David Starner)
Date: Tue, 18 Aug 2015 04:45:23 +0000
Subject: APL Under-bar Characters
In-Reply-To: <20150817195722.e74b0ce91403bfe413f98785c6a226af.489356c3b1.wbe@email06.secureserver.net>
References: <20150817195722.e74b0ce91403bfe413f98785c6a226af.489356c3b1.wbe@email06.secureserver.net>
Message-ID: <CAMZ=zj6emrBnEGF41k8bmduYTUggPU2O4VZAcesyCx59T1a5bA@mail.gmail.com>

On Mon, Aug 17, 2015 at 8:03 PM <alexweiner at alexweiner.com> wrote:

> Pierpaolo,
>
> You make a very good observation. You are essentially asking the question
> that began the whole discussion. This is covered in depth in the gnuapl
> mailing list. You can go their archive, and just search my name :)
>
> Since it seems that all hope of adding characters is lost, I think the
> next best goal would be to try an reach some sort of semblance between the
> Unicode Consortium and a nebulous group of people (APLers) who really
> believe that the uppercase under-bar letters are atomic and different than
> an underlined uppercase letters.
>

There are many languages, particularly Native American languages, given
written form in the typewriter era that use letters with under-bar as part
of their alphabet. And the underbar is no different from the cedilla, the
acute and grave accents, the umlaut or many other modifiers used to make
new characters in languages across the globe. There are single code-point
versions of characters like ?, but that's historical coincidence, and they
are equivalent to the two code-point versions. Arguing atomicity is missing
the point; A? is as atomic as ? in Unicode's eyes.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150818/6846794a/attachment.html>

From charupdate at orange.fr  Tue Aug 18 02:32:01 2015
From: charupdate at orange.fr (Marcel Schneider)
Date: Tue, 18 Aug 2015 09:32:01 +0200 (CEST)
Subject: APL Under-bar Characters
In-Reply-To: <CAMZ=zj6emrBnEGF41k8bmduYTUggPU2O4VZAcesyCx59T1a5bA@mail.gmail.com>
References: <20150817195722.e74b0ce91403bfe413f98785c6a226af.489356c3b1.wbe@email06.secureserver.net>
 <CAMZ=zj6emrBnEGF41k8bmduYTUggPU2O4VZAcesyCx59T1a5bA@mail.gmail.com>
Message-ID: <2034236760.3262.1439883121516.JavaMail.www@wwinf1f13>

On 18 Aug 2015 at 06:56, David Starner 
wrote:

> There are many languages, particularly Native American languages, given written form in the typewriter era that use letters with under-bar as part of their alphabet. And the underbar is no different from the cedilla, the acute and grave accents, the umlaut or many other modifiers used to make new characters in languages across the globe. There are single code-point versions of characters like ?, but that's historical coincidence, and they are equivalent to the two code-point versions. Arguing atomicity is missing the point; A? is as atomic as ? in Unicode's eyes.

IMHO the problem was aroused from GNU APL being implementing Unicode but still hesitating (and seemingly even about to abandon). I just pick one e-mail out of the archives (following Alex Weiner's invitation) 
http://lists.gnu.org/archive/html/bug-apl/2015-08/msg00047.html
and have no time to browse them all but as I must implement APL on the keyboard along with universal Latin, I'm interested in decrypting how GNU APL view characters. IMO the way Unicode worked out to feasibly encode all characters on the world, with decomposition sequences and taking over precomposed characters only for backward compatibility's sake, opposes to GNU APL sticking with the inherited model. This antagonism may be exacerbated by GNU being a part of the Free Software Movement, as opposed to the business model of the companies Unicode is a consortium of. This may partly explain the tone of one part of this thread (except for my own comment).

So it could really be a good idea to make GNU APL at ease with Unicode. If underbar letters are for the sole use of GNU APL, their implementation and font support will be catered for by this organization, and it would be enough to discourage their use outside of APL to meet the security issues. 
However, Ken Whistler explained clearly [http://www.unicode.org/mail-arch/unicode-ml/y2015-m08/0122.html] that today, APL would take advantage from updating towards the up-to-date character model. To facilitate this by making it plausible, I?suggest to consider that free software and proprietary software, rather than antagonistic, should be considered as complementary.

I hope this (as are other people's contributions on this thread) to be a constructive view helping to clear the differends, given that particular requests cannot be dealt with entirely as long as the underlying philosophy isn't satisfactorily taken into account.

Marcel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150818/03f7f733/attachment.html>

From verdy_p at wanadoo.fr  Tue Aug 18 03:09:43 2015
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Tue, 18 Aug 2015 10:09:43 +0200
Subject: Custom keyboard source samples (was: Re: Windows keyboard
 restrictions)
In-Reply-To: <749936537.8372.1439812286240.JavaMail.www@wwinf1j14>
References: <749936537.8372.1439812286240.JavaMail.www@wwinf1j14>
Message-ID: <CAGa7JC3DFmc9rAisnTewsgbr=TuYRZNaJmiq0Rkt23AEKQorkA@mail.gmail.com>

it helps if hou reduce the processor frequency (if you don't have a tool
fir that,  use the energy control panel and set the power profile to max
energy saving) just before clicking the button to build the package.
i don't know why these c source files need to be deleted so fast when they
could just remain in the same folder as the saved.klc file.
nlte that the tool builds several packages for several processor types, not
just amd64.
Le 17 ao?t 2015 13:59, "Marcel Schneider" <charupdate at orange.fr> a ?crit :

> On 07 Aug 2015, at 20:54, Richard Wordingham  wrote:
>
> > What we're waiting for is a guide we can follow, or some code we can ape.
>
> Since yesterday I know a very simple way to get the source code (in C) of
> any MSKLC layout.
>
> While the build is done, we must wait for the four files appearing in an
> ad hoc created "amd64" subdirectory in the Temporary Files folder, in the
> hidden Application Data directory: C:\Users\user\AppData\Local\Temp\amd64
> As soon as the four files are visible in the Explorer, we can press
> Ctrl+A, Ctrl+C, Ctrl+V. This must be done rapidly, in order to get a copy
> before their deletion by MSKLC a few seconds later.
>
> If we notice that during the build, three other temporary folders are
> created by MSKLC and deleted if empty, we may wish to know that the four
> files are strictly identical in all four folders. This has been verified on
> a simple layout, using the (very useful) comparison tool of the ConTEXT
> text editor.
>
> Best regards,
>
> Marcel
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150818/55e05fa4/attachment.html>

From charupdate at orange.fr  Tue Aug 18 04:42:01 2015
From: charupdate at orange.fr (Marcel Schneider)
Date: Tue, 18 Aug 2015 11:42:01 +0200 (CEST)
Subject: Custom keyboard source samples (was: Re: Windows keyboard
 restrictions)
In-Reply-To: <CAGa7JC3DFmc9rAisnTewsgbr=TuYRZNaJmiq0Rkt23AEKQorkA@mail.gmail.com>
References: <749936537.8372.1439812286240.JavaMail.www@wwinf1j14>
 <CAGa7JC3DFmc9rAisnTewsgbr=TuYRZNaJmiq0Rkt23AEKQorkA@mail.gmail.com>
Message-ID: <1050352786.6768.1439890921356.JavaMail.www@wwinf1f13>

On 18 Aug 2015 at 10:09, Philippe Verdy  wrote:

> it helps if hou reduce the processor frequency (if you don't have a tool fir that, use the energy control panel and set the power profile to max energy saving) just before clicking the button to build the package.

That's a very good idea. I've currently throttled down the CPU to half performance (my computer being a netbook, and the processor stays heating around it). This is very important to note for users of desktop machines working at high performance. One really needs to slow down for this operation.

> i don't know why these c source files need to be deleted so fast when they could just remain in the same folder as the saved.klc file.

That's an?again, very good?question, and I often asked it to myself. Short answer is: There's no need. Long answer is: As it results from decrypting* the blog http://www.siao2.com, Michael Scott Kaplan, the author of MSKLC, tried to meet at maximum the users' needs* and what users were expected to wish to have on their keyboard, and to create a smooth UI with maximum security.* Well, editing oneself the C source and the header file is about the opposite. So Michael didn't want to have his users bother with (which would have inevitably occurred if he put the files into the scope of the end-user). Now once they're meant to stay hidden, the best is to delete them right after use but nevertheless, a long time I?couldn't help thinking something I won't post here anymore because it's about software policies.*

> nlte that the tool builds several packages for several processor types, not just amd64.

I'd to choose one folder. amd64 comes first in alphabet, and it comes third in the MSKLC build process so that we've enough time to switch back to the explorer window where the desired files are awaited.

* Like many other people wished that MSKLC support chained dead keys, I often wished that Windows support chained dead keys (like GNU/Linux), thinking that it doesn't, given that MSKLC doesn't offer this option. When I learned that Windows does, I deduced that MSKLC is purposely restricted to prevent the users from making any "too useful" keyboard layouts, until thanks to Doug Ewell drawing our attention to it, I learned the existence of Michael Kaplan's blog, and finally found this blog post on it: 
http://www.siao2.com/2004/12/17/323257.aspx
I still wonder why one should not like to type two dead keys to get double-diacriticized letters, but I?agree that the new way of typing text is with combining diacritics.

Now I see that it would have been enough to type "who is the author of MSKLC" into the Bing search bar to learn the name in the second result, and the story on his own blog as the fifth result... http://www.siao2.com/2013/10/04/10454264.aspx
I've missed that! October 2013 was before the time I?began to really bother with keyboard layouts. But it was the time I should have begun to.

Sorry, Michael!

Marcel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150818/c5ad823d/attachment.html>

From eik at iki.fi  Tue Aug 18 05:55:53 2015
From: eik at iki.fi (Erkki I Kolehmainen)
Date: Tue, 18 Aug 2015 13:55:53 +0300
Subject: APL Under-bar Characters
In-Reply-To: <2034236760.3262.1439883121516.JavaMail.www@wwinf1f13>
References: <20150817195722.e74b0ce91403bfe413f98785c6a226af.489356c3b1.wbe@email06.secureserver.net>
 <CAMZ=zj6emrBnEGF41k8bmduYTUggPU2O4VZAcesyCx59T1a5bA@mail.gmail.com>
 <2034236760.3262.1439883121516.JavaMail.www@wwinf1f13>
Message-ID: <000001d0d9a4$740e42e0$5c2ac8a0$@fi>

Mr. Schneider

 
Free Software Movement or not makes no difference. Furthermore, please consult the membership roster of Unicode before making statements on what Unicode is a consortium of.

 
You also state:

If underbar letters are for the sole use of GNU APL, their implementation and font support will be catered for by this organization, and it would be enough to discourage their use outside of APL to meet the security issues.

If composed letters are not acceptable for whatever and how non-understandable reason, there is a perfect solution: PUA.

 
Sincerely, 

 
Erkki I. Kolehmainen

Tilkankatu 12 A 3, 00300 Helsinki, Finland

Mob: +358400825943, Tel: +358943682643, Fax: +35813318116

 
L?hett?j?: Unicode [mailto:unicode-bounces at unicode.org] Puolesta Marcel Schneider
L?hetetty: 18. elokuuta 2015 10:32
Vastaanottaja: Unicode Mailing List
Kopio: alexweiner at alexweiner.com
Aihe: Re: APL Under-bar Characters

 
On 18 Aug 2015 at 06:56, David Starner < <mailto:prosfilaes at gmail.com> prosfilaes at gmail.com> wrote:

> There are many languages, particularly Native American languages, given written form in the typewriter era that use letters with under-bar as part of their alphabet. And the underbar is no different from the cedilla, the acute and grave accents, the umlaut or many other modifiers used to make new characters in languages across the globe. There are single code-point versions of characters like ?, but that's historical coincidence, and they are equivalent to the two code-point versions. Arguing atomicity is missing the point; A? is as atomic as ? in Unicode's eyes.

IMHO the problem was aroused from GNU APL being implementing Unicode but still hesitating (and seemingly even about to abandon). I just pick one e-mail out of the archives (following Alex Weiner's invitation) 
 <http://lists.gnu.org/archive/html/bug-apl/2015-08/msg00047.html> http://lists.gnu.org/archive/html/bug-apl/2015-08/msg00047.html
and have no time to browse them all but as I must implement APL on the keyboard along with universal Latin, I'm interested in decrypting how GNU APL view characters. IMO the way Unicode worked out to feasibly encode all characters on the world, with decomposition sequences and taking over precomposed characters only for backward compatibility's sake, opposes to GNU APL sticking with the inherited model. This antagonism may be exacerbated by GNU being a part of the Free Software Movement, as opposed to the business model of the companies Unicode is a consortium of. This may partly explain the tone of one part of this thread (except for my own comment).

So it could really be a good idea to make GNU APL at ease with Unicode. If underbar letters are for the sole use of GNU APL, their implementation and font support will be catered for by this organization, and it would be enough to discourage their use outside of APL to meet the security issues. 
However, Ken Whistler explained clearly [http://www.unicode.org/mail-arch/unicode-ml/y2015-m08/0122.html] that today, APL would take advantage from updating towards the up-to-date character model. To facilitate this by making it plausible, I suggest to consider that free software and proprietary software, rather than antagonistic, should be considered as complementary.

I hope this (as are other people's contributions on this thread) to be a constructive view helping to clear the differends, given that particular requests cannot be dealt with entirely as long as the underlying philosophy isn't satisfactorily taken into account.

Marcel

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150818/538e76b9/attachment.html>

From alexweiner at alexweiner.com  Tue Aug 18 08:18:47 2015
From: alexweiner at alexweiner.com (alexweiner at alexweiner.com)
Date: Tue, 18 Aug 2015 06:18:47 -0700
Subject: APL Under-bar Characters
Message-ID: <20150818061847.e74b0ce91403bfe413f98785c6a226af.fada311151.mailapi@mailapi06.secureserver.net>

"PUA"?

-------- Original Message --------
Subject: RE: APL Under-bar Characters
From: "Erkki I Kolehmainen" <eik at iki.fi>
Date: Aug 18, 2015 6:55 AM
To: "'Marcel Schneider'" <charupdate at orange.fr>,"'Unicode Mailing List'" <unicode at unicode.org>
CC: alexweiner at alexweiner.com

Mr. Schneider

 
Free Software Movement or not makes no difference. Furthermore, please consult the membership roster of Unicode before making statements on what Unicode is a consortium of.

 
You also state:

If underbar letters are for the sole use of GNU APL, their implementation and font support will be catered for by this organization, and it would be enough to discourage their use outside of APL to meet the security issues.

If composed letters are not acceptable for whatever and how non-understandable reason, there is a perfect solution: PUA.

 
Sincerely, 

 
Erkki I. Kolehmainen

Tilkankatu 12 A 3, 00300 Helsinki, Finland

Mob: +358400825943, Tel: +358943682643, Fax: +35813318116

 
L?hett?j?: Unicode [mailto:unicode-bounces at unicode.org] Puolesta Marcel Schneider
L?hetetty: 18. elokuuta 2015 10:32
Vastaanottaja: Unicode Mailing List
Kopio: alexweiner at alexweiner.com
Aihe: Re: APL Under-bar Characters

 
On 18 Aug 2015 at 06:56, David Starner < ><mailto:prosfilaes at gmail.com> prosfilaes at gmail.com> wrote:

> There are many languages, particularly Native American languages, given written form in the typewriter era that use letters with under-bar as part of their alphabet. And the underbar is no different from the cedilla, the acute and grave accents, the umlaut or many other modifiers used to make new characters in languages across the globe. There are single code-point versions of characters like ?, but that's historical coincidence, and they are equivalent to the two code-point versions. Arguing atomicity is missing the point; A? is as atomic as ? in Unicode's eyes.

IMHO the problem was aroused from GNU APL being implementing Unicode but still hesitating (and seemingly even about to abandon). I just pick one e-mail out of the archives (following Alex Weiner's invitation) 
 <http://lists.gnu.org/archive/html/bug-apl/2015-08/msg00047.html> http://lists.gnu.org/archive/html/bug-apl/2015-08/msg00047.html
and have no time to browse them all but as I must implement APL on the keyboard along with universal Latin, I'm interested in decrypting how GNU APL view characters. IMO the way Unicode worked out to feasibly encode all characters on the world, with decomposition sequences and taking over precomposed characters only for backward compatibility's sake, opposes to GNU APL sticking with the inherited model. This antagonism may be exacerbated by GNU being a part of the Free Software Movement, as opposed to the business model of the companies Unicode is a consortium of. This may partly explain the tone of one part of this thread (except for my own comment).

So it could really be a good idea to make GNU APL at ease with Unicode. If underbar letters are for the sole use of GNU APL, their implementation and font support will be catered for by this organization, and it would be enough to discourage their use outside of APL to meet the security issues. 
However, Ken Whistler explained clearly [http://www.unicode.org/mail-arch/unicode-ml/y2015-m08/0122.html] that today, APL would take advantage from updating towards the up-to-date character model. To facilitate this by making it plausible, I suggest to consider that free software and proprietary software, rather than antagonistic, should be considered as complementary.

I hope this (as are other people's contributions on this thread) to be a constructive view helping to clear the differends, given that particular requests cannot be dealt with entirely as long as the underlying philosophy isn't satisfactorily taken into account.

Marcel


From leob at mailcom.com  Tue Aug 18 09:38:42 2015
From: leob at mailcom.com (Leo Broukhis)
Date: Tue, 18 Aug 2015 16:38:42 +0200
Subject: APL Under-bar Characters
In-Reply-To: <20150818061847.e74b0ce91403bfe413f98785c6a226af.fada311151.mailapi@mailapi06.secureserver.net>
References: <20150818061847.e74b0ce91403bfe413f98785c6a226af.fada311151.mailapi@mailapi06.secureserver.net>
Message-ID: <CAFmvRsckwP5BW51r4PFRtivpsr9f7GFAC5_eUY5G_1f1byJChw@mail.gmail.com>

http://www.acronymfinder.com/Information-Technology/PUA.html

On Tue, Aug 18, 2015 at 3:18 PM,  <alexweiner at alexweiner.com> wrote:
> "PUA"?
>
> -------- Original Message --------
> Subject: RE: APL Under-bar Characters
> From: "Erkki I Kolehmainen" <eik at iki.fi>
> Date: Aug 18, 2015 6:55 AM
> To: "'Marcel Schneider'" <charupdate at orange.fr>,"'Unicode Mailing List'" <unicode at unicode.org>
> CC: alexweiner at alexweiner.com
>
> Mr. Schneider
>
>
>
> Free Software Movement or not makes no difference. Furthermore, please consult the membership roster of Unicode before making statements on what Unicode is a consortium of.
>
>
>
> You also state:
>
> If underbar letters are for the sole use of GNU APL, their implementation and font support will be catered for by this organization, and it would be enough to discourage their use outside of APL to meet the security issues.
>
> If composed letters are not acceptable for whatever and how non-understandable reason, there is a perfect solution: PUA.
>
>
>
> Sincerely,
>
>
>
> Erkki I. Kolehmainen
>
> Tilkankatu 12 A 3, 00300 Helsinki, Finland
>
> Mob: +358400825943, Tel: +358943682643, Fax: +35813318116
>
>
>
> L?hett?j?: Unicode [mailto:unicode-bounces at unicode.org] Puolesta Marcel Schneider
> L?hetetty: 18. elokuuta 2015 10:32
> Vastaanottaja: Unicode Mailing List
> Kopio: alexweiner at alexweiner.com
> Aihe: Re: APL Under-bar Characters
>
>
>
> On 18 Aug 2015 at 06:56, David Starner < ><mailto:prosfilaes at gmail.com> prosfilaes at gmail.com> wrote:
>
>> There are many languages, particularly Native American languages, given written form in the typewriter era that use letters with under-bar as part of their alphabet. And the underbar is no different from the cedilla, the acute and grave accents, the umlaut or many other modifiers used to make new characters in languages across the globe. There are single code-point versions of characters like ?, but that's historical coincidence, and they are equivalent to the two code-point versions. Arguing atomicity is missing the point; A? is as atomic as ? in Unicode's eyes.
>
> IMHO the problem was aroused from GNU APL being implementing Unicode but still hesitating (and seemingly even about to abandon). I just pick one e-mail out of the archives (following Alex Weiner's invitation)
>  <http://lists.gnu.org/archive/html/bug-apl/2015-08/msg00047.html> http://lists.gnu.org/archive/html/bug-apl/2015-08/msg00047.html
> and have no time to browse them all but as I must implement APL on the keyboard along with universal Latin, I'm interested in decrypting how GNU APL view characters. IMO the way Unicode worked out to feasibly encode all characters on the world, with decomposition sequences and taking over precomposed characters only for backward compatibility's sake, opposes to GNU APL sticking with the inherited model. This antagonism may be exacerbated by GNU being a part of the Free Software Movement, as opposed to the business model of the companies Unicode is a consortium of. This may partly explain the tone of one part of this thread (except for my own comment).
>
> So it could really be a good idea to make GNU APL at ease with Unicode. If underbar letters are for the sole use of GNU APL, their implementation and font support will be catered for by this organization, and it would be enough to discourage their use outside of APL to meet the security issues.
> However, Ken Whistler explained clearly [http://www.unicode.org/mail-arch/unicode-ml/y2015-m08/0122.html] that today, APL would take advantage from updating towards the up-to-date character model. To facilitate this by making it plausible, I suggest to consider that free software and proprietary software, rather than antagonistic, should be considered as complementary.
>
> I hope this (as are other people's contributions on this thread) to be a constructive view helping to clear the differends, given that particular requests cannot be dealt with entirely as long as the underlying philosophy isn't satisfactorily taken into account.
>
> Marcel
>
>
>


From alexweiner at alexweiner.com  Tue Aug 18 09:55:59 2015
From: alexweiner at alexweiner.com (alexweiner at alexweiner.com)
Date: Tue, 18 Aug 2015 07:55:59 -0700
Subject: APL Under-bar Characters
Message-ID: <20150818075559.e74b0ce91403bfe413f98785c6a226af.1587c769a1.mailapi@mailapi06.secureserver.net>

ah yes. I believe the "private use area" was also suggested and may provide a route to take

-Alex

-------- Original Message --------
Subject: Re: APL Under-bar Characters
From: Leo Broukhis <leob at mailcom.com>
Date: Aug 18, 2015 10:38 AM
To: alexweiner at alexweiner.com
CC: eik at iki.fi,charupdate at orange.fr,"unicode Unicode Discussion" <unicode at unicode.org>

http://www.acronymfinder.com/Information-Technology/PUA.html<br/><br/>On Tue, Aug 18, 2015 at 3:18 PM,  <alexweiner at alexweiner.com> wrote:<br/>> "PUA"?<br/>><br/>> -------- Original Message --------<br/>> Subject: RE: APL Under-bar Characters<br/>> From: "Erkki I Kolehmainen" <eik at iki.fi><br/>> Date: Aug 18, 2015 6:55 AM<br/>> To: "'Marcel Schneider'" <charupdate at orange.fr>,"'Unicode Mailing List'" <unicode at unicode.org><br/>> CC: alexweiner at alexweiner.com<br/>><br/>> Mr. Schneider<br/>><br/>><br/>><br/>> Free Software Movement or not makes no difference. Furthermore, please consult the membership roster of Unicode before making statements on what Unicode is a consortium of.<br/>><br/>><br/>><br/>> You also state:<br/>><br/>> If underbar letters are for the sole use of GNU APL, their implementation and font support will be catered for by this organization, and it would be enough to discourage their use outside of APL to meet the security issues.<br/>><br/>> If composed letters are not acceptable for whatever and how non-understandable reason, there is a perfect solution: PUA.<br/>><br/>><br/>><br/>> Sincerely,<br/>><br/>><br/>><br/>> Erkki I. Kolehmainen<br/>><br/>> Tilkankatu 12 A 3, 00300 Helsinki, Finland<br/>><br/>> Mob: +358400825943, Tel: +358943682643, Fax: +35813318116<br/>><br/>><br/>><br/>> L?hett?j?: Unicode [mailto:unicode-bounces at unicode.org] Puolesta Marcel Schneider<br/>> L?hetetty: 18. elokuuta 2015 10:32<br/>> Vastaanottaja: Unicode Mailing List<br/>> Kopio: alexweiner at alexweiner.com<br/>> Aihe: Re: APL Under-bar Characters<br/>><br/>><br/>><br/>> On 18 Aug 2015 at 06:56, David Starner < ><mailto:prosfilaes at gmail.com> prosfilaes at gmail.com> wrote:<br/>><br/>>> There are many languages, particularly Native American languages, given written form in the typewriter era that use letters with under-bar as part of their alphabet. And the underbar is no different from the cedilla, the acute and grave accents, the umlaut or many other modifiers used to make new characters in languages across the globe. There are single code-point versions of characters like ?, but that's historical coincidence, and they are equivalent to the two code-point versions. Arguing atomicity is missing the point; A? is as atomic as ? in Unicode's eyes.<br/>><br/>> IMHO the problem was aroused from GNU APL being implementing Unicode but still hesitating (and seemingly even about to abandon). I just pick one e-mail out of the archives (following Alex Weiner's invitation)<br/>>  <http://lists.gnu.org/archive/html/bug-apl/2015-08/msg00047.html> http://lists.gnu.org/archive/html/bug-apl/2015-08/msg00047.html<br/>> and have no time to browse them all but as I must implement APL on the keyboard along with universal Latin, I'm interested in decrypting how GNU APL view characters. IMO the way Unicode worked out to feasibly encode all characters on the world, with decomposition sequences and taking over precomposed characters only for backward compatibility's sake, opposes to GNU APL sticking with the inherited model. This antagonism may be exacerbated by GNU being a part of the Free Software Movement, as opposed to the business model of the companies Unicode is a consortium of. This may partly explain the tone of one part of this thread (except for my own comment).<br/>><br/>> So it could really be a good idea to make GNU APL at ease with Unicode. If underbar letters are for the sole use of GNU APL, their implementation and font support will be catered for by this organization, and it would be enough to discourage their use outside of APL to meet the security issues.<br/>> However, Ken Whistler explained clearly [http://www.unicode.org/mail-arch/unicode-ml/y2015-m08/0122.html] that today, APL would take advantage from updating towards the up-to-date character model. To facilitate this by making it plausible, I suggest to consider that free software and proprietary software, rather than antagonistic, should be considered as complementary.<br/>><br/>> I hope this (as are other people's contributions on this thread) to be a constructive view helping to clear the differends, given that particular requests cannot be dealt with entirely as long as the underlying philosophy isn't satisfactorily taken into account.<br/>><br/>> Marcel<br/>><br/>><br/>><br/>


From doug at ewellic.org  Tue Aug 18 10:11:33 2015
From: doug at ewellic.org (Doug Ewell)
Date: Tue, 18 Aug 2015 08:11:33 -0700
Subject: APL Under-bar Characters
Message-ID: <20150818081133.665a7a7059d7ee80bb4d670165c8327d.d3a7f92947.wbe@email03.secureserver.net>

<alexweiner at alexweiner dot com> wrote:

> Since it seems that all hope of adding characters is lost, I think the
> next best goal would be to try an reach some sort of semblance between
> the Unicode Consortium and a nebulous group of people (APLers) who
> really believe that the uppercase under-bar letters are atomic and
> different than an underlined uppercase letters.

This reminds me of an argument that has occasionally been made that
Unicode should encode Latin "majuscules" separately from "capital
letters" because the semantic functions of the two are different
(typographical choice vs. orthographic rule).

Unicode does not provide distinct encodings of "a" based on
pronunciation (chaos, cat, star). It does not provide distinct encodings
of "uppercase A" based on the reason for using the uppercase instead of
the lowercase (HAPPY, Adam). And it also does not provide distinct
encodings of "uppercase A with underline" based on the reason for
underlining the letter.

> Some sort of list, no matter how "unofficial", is better than no list
> at all, right? Wouldn't the Unicode Consortium be the place for such a
> list, such as in NamedSequences.txt ?

It is NOT necessary for a combining sequence to be assigned a name,
either by the Unicode Technical Committee or by anyone else, in order to
use it. Note that of the two sequences:

A? <0041, 0331>
A? <0041, 0332>

neither sequence is listed in NamedSequences.txt, yet I can use them
without limitation in this email and in plain text generally.

I'm not sure the general concept of combining sequences is well
understood in this thread.

--
Doug Ewell | http://ewellic.org | Thornton, CO ????


From asmus-inc at ix.netcom.com  Tue Aug 18 10:34:52 2015
From: asmus-inc at ix.netcom.com (Asmus Freytag (t))
Date: Tue, 18 Aug 2015 08:34:52 -0700
Subject: APL Under-bar Characters
In-Reply-To: <20150818061847.e74b0ce91403bfe413f98785c6a226af.fada311151.mailapi@mailapi06.secureserver.net>
References: <20150818061847.e74b0ce91403bfe413f98785c6a226af.fada311151.mailapi@mailapi06.secureserver.net>
Message-ID: <55D3509C.3050404@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150818/88396c4f/attachment.html>

From tom at bluesky.org  Tue Aug 18 10:45:01 2015
From: tom at bluesky.org (Tom Gewecke)
Date: Tue, 18 Aug 2015 11:45:01 -0400
Subject: APL Under-bar Characters
In-Reply-To: <20150818081133.665a7a7059d7ee80bb4d670165c8327d.d3a7f92947.wbe@email03.secureserver.net>
References: <20150818081133.665a7a7059d7ee80bb4d670165c8327d.d3a7f92947.wbe@email03.secureserver.net>
Message-ID: <3A4F5254-7C02-4ACC-BF4C-E975A20BB45D@bluesky.org>


On Aug 18, 2015, at 11:11 AM, Doug Ewell wrote:

> 
> It is NOT necessary for a combining sequence to be assigned a name,
> either by the Unicode Technical Committee or by anyone else, in order to
> use it. Note that of the two sequences:
> 
> A? <0041, 0331>
> A? <0041, 0332>
> 
> neither sequence is listed in NamedSequences.txt, yet I can use them
> without limitation in this email and in plain text generally.


I guess the question is whether having a named sequence would somehow make it easier for the gnu apl folks to add something to their system so that their string length function sees such a sequence as having a length of "1"?  

From kenwhistler at att.net  Tue Aug 18 11:22:53 2015
From: kenwhistler at att.net (Ken Whistler)
Date: Tue, 18 Aug 2015 09:22:53 -0700
Subject: APL Under-bar Characters
In-Reply-To: <20150817092315.665a7a7059d7ee80bb4d670165c8327d.91850dae39.wbe@email03.secureserver.net>
References: <20150817092315.665a7a7059d7ee80bb4d670165c8327d.91850dae39.wbe@email03.secureserver.net>
Message-ID: <55D35BDD.2020006@att.net>

Returning to a historical note on the glyphic forms and the question
of combining low lines or combining macrons below... admittedly a
side note on this thread, the *original* identification of these APL
uppercase Latin letters, at least in their IBM implementations, was
clearly as uppercase (italicized) Latin letters with *underscores* --
not with macrons below.

The identification of the entity we have been talking about, for example,
in IBM documentation is *LA480000*, described as:

"A Line Below Capital/A Underscore (APL)"

It is shown in the documentation with an *underscore*, with the
scoring reaching to match the outside serifs of the "A", and clearly
not with a macron below.

Furthermore, the glyph character identification system from that era
(late 1980's) contained a value for a "Line below" diacritic (that's the
"48" in the glyph identifier above), but no provision for macrons
*below* a letter.

Consider also that the keyboards and character sets involved in
the time had underscores (low lines), but macrons below were
rare diacritics, and not in anybody's character set at the time.

The appearance of underscored characters in printed material
at the time would typically involve a gap between the
underscoring on adjacent characters, but that was a result of
discrete type elements in the print trains, typically. It wasn't because
conceptually the underscores were being treated as deliberately short
diacritics that should *not* connect. The underscores were more
likely to connect on screens, but that was typically the result
of the very limited scale of the character generator pixel
rasters for the characters. You just turned on all the pixels in
the bottom row of the box -- and there you had your underscore!

The documentation that Doug cites from Section 7.9 of TUS
was written to clarify that the *general* intent, when people
use underscoring or overscoring diacritics, is that they should
connect laterally. That is to contrast with macron diacritics, above *or*
below, which of course do not connect laterally to adjacent
macrons. But without a very specialized font, it is very difficult
to do lateral connection correctly for variable width fonts. See 
examples below
for Helvetica, Times, and Courier (although your mileage
may differ, depending on your email client fonts, as Doug noted):

A?B?O?M?I?N?A?T?I?O?N? vs. _ABOMINATION_

A?B?O?M?I?N?A?T?I?O?N? vs. _ABOMINATION_

?A?B?O?M?I?N?A?T?I?O?N vs. _ABOMINATION_

Sequences of combining low lines after letters on the left, styling
with underscoring on the right. Only for fixed width fonts does
this really work "as designed", so to speak. Hence, the general
recommendation that if what you are trying to do is underscore
(or overscore) a sequence of text, by all means do it with styling,
and not with sequences of individual diacritics on letters.

But the fact that underscores used as diacritics on letters are
basically a 20th century typewriter hack that persisted into
early computer character sets -- and the fact that they don't
work very well, or look very elegant with most modern,
digital, variable width fonts, interestingly has led to the rise of the 
macron
below, very much along the lines of Doug's suggestion cited below.
What used to be a rare diacritic is gaining in popularity in
actual use, precisely because it "looks like" a diacritic on
the letter in most fonts nowadays, and because it *doesn't*
connect, more or less randomly, with neighboring diacritics
on adjacent letters, the way the low line diacritic can.

And while I am generally sympathetic with this changeover, and
suspect it is probably the best outcome for cases like the use of
line below diacritics on Latin letters in Semitic transliteration,
I don't think it is the best recommendation for this particular
case of a legacy usage for APL. The APL letters with underscores
clearly *are* historically connected precisely to the underscore,
and should probably stay represented accordingly. If APL
afficionados don't prefer the underscores visually connecting between
adjacent capital Latin letters in APL text material presented that
way, then that can be addressed in the APL specialist fonts. After
all, such fonts already exist, precisely to provide best display
for all the other specialized symbols of APL.

--Ken


On 8/17/2015 9:23 AM, Doug Ewell wrote:
> TUS 7.0, Section 7.9 does say:
>> The characters U+0332 COMBINING LOW LINE, U+0333 COMBINING DOUBLE LOW
>> LINE, U+0305 COMBINING OVERLINE, and U+033F COMBINING DOUBLE OVERLINE
>> are intended to connect on the left and right.
> In that case, despite the text in Section 22.7 that Ken quoted, it seems
> that U+0331 COMBINING MACRON [BELOW] might be a better choice for APL
> "underlined letters" than U+0332 COMBINING LOW LINE. Compare A?B?C?
> with A?B?C?, noting that your font and rendering engine mileage may
> vary.
>
>
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150818/7214d4eb/attachment.html>

From doug at ewellic.org  Tue Aug 18 11:23:53 2015
From: doug at ewellic.org (Doug Ewell)
Date: Tue, 18 Aug 2015 09:23:53 -0700
Subject: APL Under-bar Characters
Message-ID: <20150818092353.665a7a7059d7ee80bb4d670165c8327d.bc667e26d2.wbe@email03.secureserver.net>

Tom Gewecke <tom at bluesky dot org> wrote:

> I guess the question is whether having a named sequence would somehow
> make it easier for the gnu apl folks to add something to their system
> so that their string length function sees such a sequence as having a
> length of "1"?

I don't see why that would, or should, be the determining factor. A more
robust approach for their purposes might be to teach ? to exclude
combining characters (gc=Mn) when counting the "size" of a string.

--
Doug Ewell | http://ewellic.org | Thornton, CO ????


From doug at ewellic.org  Tue Aug 18 11:29:13 2015
From: doug at ewellic.org (Doug Ewell)
Date: Tue, 18 Aug 2015 09:29:13 -0700
Subject: APL Under-bar Characters
Message-ID: <20150818092913.665a7a7059d7ee80bb4d670165c8327d.e2ba804e00.wbe@email03.secureserver.net>

Ken Whistler <kenwhistler at att dot net> wrote:

> Returning to a historical note on the glyphic forms and the question
> of combining low lines or combining macrons below... admittedly a
> side note on this thread, the *original* identification of these APL
> uppercase Latin letters, at least in their IBM implementations, was
> clearly as uppercase (italicized) Latin letters with *underscores* --
> not with macrons below.
> ...

I absolutely stand corrected on this. U+0332 it is.

--
Doug Ewell | http://ewellic.org | Thornton, CO ????


From kenwhistler at att.net  Tue Aug 18 11:35:58 2015
From: kenwhistler at att.net (Ken Whistler)
Date: Tue, 18 Aug 2015 09:35:58 -0700
Subject: APL Under-bar Characters
In-Reply-To: <20150818092353.665a7a7059d7ee80bb4d670165c8327d.bc667e26d2.wbe@email03.secureserver.net>
References: <20150818092353.665a7a7059d7ee80bb4d670165c8327d.bc667e26d2.wbe@email03.secureserver.net>
Message-ID: <55D35EEE.70205@att.net>


On 8/18/2015 9:23 AM, Doug Ewell wrote:
> Tom Gewecke <tom at bluesky dot org> wrote:
>
>> I guess the question is whether having a named sequence would somehow
>> make it easier for the gnu apl folks to add something to their system
>> so that their string length function sees such a sequence as having a
>> length of "1"?
> I don't see why that would, or should, be the determining factor. A more
> robust approach for their purposes might be to teach ? to exclude
> combining characters (gc=Mn) when counting the "size" of a string.
>
>

And it seems to me that that is *very* unlikely to happen, precisely
because ? is so deeply embedded in the array and vector logic of APL.

That is counting the data size of arrays of "characters" (i.e., code units).
If somebody tried to somehow teach ? to do something different
about characters, changing the concept of array of code units into
something more akin to what we think of as Unicode strings, that
would end up being a *different* language -- not APL!

--Ken


From doug at ewellic.org  Tue Aug 18 11:45:17 2015
From: doug at ewellic.org (Doug Ewell)
Date: Tue, 18 Aug 2015 09:45:17 -0700
Subject: APL Under-bar Characters
Message-ID: <20150818094517.665a7a7059d7ee80bb4d670165c8327d.2f37adb81c.wbe@email03.secureserver.net>

Ken Whistler <kenwhistler at att dot net> wrote:

>> A more
>> robust approach for their purposes might be to teach ? to exclude
>> combining characters (gc=Mn) when counting the "size" of a string.
>
> And it seems to me that that is *very* unlikely to happen, precisely
> because ? is so deeply embedded in the array and vector logic of APL.
>
> That is counting the data size of arrays of "characters" (i.e., code
> units). If somebody tried to somehow teach ? to do something different
> about characters, changing the concept of array of code units into
> something more akin to what we think of as Unicode strings, that
> would end up being a *different* language -- not APL!

Then we're back to the central point that Alex Weiner originally
expressed, in arguing for the encoding of precomposed letters with
underbar:

> The string length functionality would view an 'A' code point combined
> with an '_' code point as an item that has two elements, while
> something that looks like 'A'  Should be atomic, and return a length
> of one.

--
Doug Ewell | http://ewellic.org | Thornton, CO ????


From kenwhistler at att.net  Tue Aug 18 12:13:15 2015
From: kenwhistler at att.net (Ken Whistler)
Date: Tue, 18 Aug 2015 10:13:15 -0700
Subject: APL Under-bar Characters
In-Reply-To: <20150818094517.665a7a7059d7ee80bb4d670165c8327d.2f37adb81c.wbe@email03.secureserver.net>
References: <20150818094517.665a7a7059d7ee80bb4d670165c8327d.2f37adb81c.wbe@email03.secureserver.net>
Message-ID: <55D367AB.40809@att.net>


On 8/18/2015 9:45 AM, Doug Ewell wrote:
> Ken Whistler <kenwhistler at att dot net> wrote:
>
> Then we're back to the central point that Alex Weiner originally 
> expressed, in arguing for the encoding of precomposed letters with 
> underbar:
>> The string length functionality would view an 'A' code point combined
>> with an '_' code point as an item that has two elements, while
>> something that looks like 'A'  Should be atomic, and return a length
>> of one.
>

Precisely.

And instead of pushing for the impossible, the correct solution here
involves dividing and conquering:

1. If the issue is just the *presentation* of legacy APL materials showing
the traditional IBM uppercase italic letters with underscores, then
fix some fonts, use the combining character sequences (or styling,
makes no matter), and edit away with existing characters, and with
no implications for APL implementations.

2. If the issue is *augmentation* of APL implementations to have an
additional A-Z set of character symbols, beyond the upper- and lowercase
ones apparently supported by most APL fonts and implementations,
then pick one of the existing, encoded, mathematical alphabets
and have done with it. There are 13 to choose from! The sans-serif
italic set might make a nice choice. And for the cherry on top, in
the APL fonts, draw a non-connecting underline beneath your
26 new letters to please traditionalists.

The reason to do #2 is that the implementations of APL, because of
the very nature of the language, need their "characters" to have
a fixed size, so that each element of a data array of "characters"
is exactly one "character".

The oopsie for #2, of course, is that if your APL implementation is
actually using 16-bit code *units* for your characters, it is still
stuck in a UCS-2 world, and can't handle UTF-16, because that
once again breaks the ironclad rule that 1 "character" equals
one data element in the array.

The fix for the oopsie is to upgrade the APL implementations to UTF-32.
At that point, the supplementary character problem goes away,
and APL could freely augment its sets of A-Z symbols with the
mathematical alphanumeric symbols without further ado.

What people should *not* be doing is insisting on being stuck
in 1970, as if everybody were still doing APL with IBM Selectric typewriter
terminals hooked up to IBM/360 mainframes using an EBCDIC
APL character set, and that everything in the APL program text
has to look precisely the way it did in 1970.

--Ken


From miszhan3ys at gmail.com  Tue Aug 18 18:20:01 2015
From: miszhan3ys at gmail.com (Emma Haneys)
Date: Wed, 19 Aug 2015 07:20:01 +0800
Subject: a suggestion new emoji .
Message-ID: <CAFBVQ=37KRUHvZwKECNUXn=Np_Pc5rSAhMik_BFBP1beKo_h4A@mail.gmail.com>

hello dear unicode , i just wondering if i can suggest a new emoji .
hoppefully you can respone to me . i suggest one and only for fruit
category . it is a durian .  thanx

From mark at kli.org  Tue Aug 18 19:53:22 2015
From: mark at kli.org (Mark E. Shoulson)
Date: Tue, 18 Aug 2015 20:53:22 -0400
Subject: a suggestion new emoji .
In-Reply-To: <CAFBVQ=37KRUHvZwKECNUXn=Np_Pc5rSAhMik_BFBP1beKo_h4A@mail.gmail.com>
References: <CAFBVQ=37KRUHvZwKECNUXn=Np_Pc5rSAhMik_BFBP1beKo_h4A@mail.gmail.com>
Message-ID: <55D3D382.60501@kli.org>

On 08/18/2015 07:20 PM, Emma Haneys wrote:
> hello dear unicode , i just wondering if i can suggest a new emoji .
> hoppefully you can respone to me . i suggest one and only for fruit
> category . it is a durian .  thanx
Ah, durians.  Kind of a cross between food and weaponry.

~mark

From Shawn.Steele at microsoft.com  Tue Aug 18 20:13:44 2015
From: Shawn.Steele at microsoft.com (Shawn Steele)
Date: Wed, 19 Aug 2015 01:13:44 +0000
Subject: a suggestion new emoji .
In-Reply-To: <55D3D382.60501@kli.org>
References: <CAFBVQ=37KRUHvZwKECNUXn=Np_Pc5rSAhMik_BFBP1beKo_h4A@mail.gmail.com>
 <55D3D382.60501@kli.org>
Message-ID: <BLUPR03MB1378CF571AC4E85AEE68B30882670@BLUPR03MB1378.namprd03.prod.outlook.com>

I'm sure Klingons love them!

-----Original Message-----
From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Mark E. Shoulson
Sent: Tuesday, August 18, 2015 5:53 PM
To: unicode at unicode.org
Subject: Re: a suggestion new emoji .

On 08/18/2015 07:20 PM, Emma Haneys wrote:
> hello dear unicode , i just wondering if i can suggest a new emoji .
> hoppefully you can respone to me . i suggest one and only for fruit 
> category . it is a durian .  thanx
Ah, durians.  Kind of a cross between food and weaponry.

~mark


From nikiselken at gmail.com  Tue Aug 18 20:26:30 2015
From: nikiselken at gmail.com (Niki Selken)
Date: Tue, 18 Aug 2015 18:26:30 -0700
Subject: a suggestion new emoji .
In-Reply-To: <BLUPR03MB1378CF571AC4E85AEE68B30882670@BLUPR03MB1378.namprd03.prod.outlook.com>
References: <CAFBVQ=37KRUHvZwKECNUXn=Np_Pc5rSAhMik_BFBP1beKo_h4A@mail.gmail.com>
 <55D3D382.60501@kli.org>
 <BLUPR03MB1378CF571AC4E85AEE68B30882670@BLUPR03MB1378.namprd03.prod.outlook.com>
Message-ID: <9BA50979-8626-42C4-99FE-84720E257FB2@gmail.com>

Touch?! ??

Thanks,
Niki

Excuse my spelling, this is sent from my iPhone

> On Aug 18, 2015, at 6:13 PM, Shawn Steele <Shawn.Steele at microsoft.com> wrote:
> 
> I'm sure Klingons love them!
> 
> -----Original Message-----
> From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Mark E. Shoulson
> Sent: Tuesday, August 18, 2015 5:53 PM
> To: unicode at unicode.org
> Subject: Re: a suggestion new emoji .
> 
>> On 08/18/2015 07:20 PM, Emma Haneys wrote:
>> hello dear unicode , i just wondering if i can suggest a new emoji .
>> hoppefully you can respone to me . i suggest one and only for fruit 
>> category . it is a durian .  thanx
> Ah, durians.  Kind of a cross between food and weaponry.
> 
> ~mark
> 


From otto.stolz at uni-konstanz.de  Wed Aug 19 06:36:46 2015
From: otto.stolz at uni-konstanz.de (Otto Stolz)
Date: Wed, 19 Aug 2015 13:36:46 +0200
Subject: a suggestion new emoji .
In-Reply-To: <CAFBVQ=37KRUHvZwKECNUXn=Np_Pc5rSAhMik_BFBP1beKo_h4A@mail.gmail.com>
References: <CAFBVQ=37KRUHvZwKECNUXn=Np_Pc5rSAhMik_BFBP1beKo_h4A@mail.gmail.com>
Message-ID: <55D46A4E.7090501@uni-konstanz.de>

Hello Emma Haneys,

Am 19.08.2015 um 01:20 schrieb Emma Haneys:
> i just wondering if i can suggest a new emoji .
> hoppefully you can respone to me .

So far, you have only received derisive responses from
the Unicode discussion list. This is because you have
not understood how suggestions for Unicode characters work.
Please read ?http://www.unicode.org/faq/?, in particular
?http://www.unicode.org/faq/char_proposal.html?.

> i suggest one and only for fruit
> category . it is a durian .

You cannot suggest a new character just because it would
be ?nice to have?. Rather, you have to supply evidence that
an additional character really needs to be encoded, e. g.
because it is already widely used in print and cannot be
represented in Unicode.

Best wishes,
   Otto Stolz


From andrewcwest at gmail.com  Wed Aug 19 07:23:22 2015
From: andrewcwest at gmail.com (Andrew West)
Date: Wed, 19 Aug 2015 13:23:22 +0100
Subject: a suggestion new emoji .
In-Reply-To: <55D46A4E.7090501@uni-konstanz.de>
References: <CAFBVQ=37KRUHvZwKECNUXn=Np_Pc5rSAhMik_BFBP1beKo_h4A@mail.gmail.com>
 <55D46A4E.7090501@uni-konstanz.de>
Message-ID: <CALgEMhz_e6hH7XSX43YwL7w2R2mR5NKrXwbyLd7_kQxex0htXw@mail.gmail.com>

On 19 August 2015 at 12:36, Otto Stolz <otto.stolz at uni-konstanz.de> wrote:
>
> You cannot suggest a new character just because it would
> be ?nice to have?. Rather, you have to supply evidence that
> an additional character really needs to be encoded, e. g.
> because it is already widely used in print and cannot be
> represented in Unicode.

Well that may once have been the case, but certainly isn't any longer
with respect to emoji, especially emoji representing food and drink.

I suggest Emma reads Unicode Technical Report 51
http://unicode.org/reports/tr51/ especially section 1.2 Encoding
Considerations and Annex C Selection Factors, then start a petition to
the Unicode Consortium on www.change.org, and when she has 10,000
signatures make a formal request to the UTC.  Petitions don't
guarantee acceptance, but widely-petitioned emoji such as taco, cheese
wedge, paella and whisky tumbler have been successful.

Andrew


From mark at macchiato.com  Wed Aug 19 09:19:27 2015
From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=)
Date: Wed, 19 Aug 2015 16:19:27 +0200
Subject: a suggestion new emoji .
In-Reply-To: <CALgEMhz_e6hH7XSX43YwL7w2R2mR5NKrXwbyLd7_kQxex0htXw@mail.gmail.com>
References: <CAFBVQ=37KRUHvZwKECNUXn=Np_Pc5rSAhMik_BFBP1beKo_h4A@mail.gmail.com>
 <55D46A4E.7090501@uni-konstanz.de>
 <CALgEMhz_e6hH7XSX43YwL7w2R2mR5NKrXwbyLd7_kQxex0htXw@mail.gmail.com>
Message-ID: <CAJ2xs_Fyqgb7MAAiV8Bd7pY2XNsBSogfTqgc3zbqoiJ80_=tMg@mail.gmail.com>

?I'd agree about reading and following
http://unicode.org/reports/tr51/#Selection_Factors.

As far as petitions go, we take them with a sizable grain of salt. See
http://unicode.org/reports/tr51/#Selection_Factors_Requested. In the
particular cases you cite, we had sufficient evidence about prospective
usage independent of petitions (which usually started after we had settled
on the character anyway). Paella was a bit of an exception; I think the
work that the petitioners did upfront helped to convince the subcommittee
that there would be sufficient usage, and the main issues were around
distinctiveness and generality.

Mark <https://google.com/+MarkDavis>

*? Il meglio ? l?inimico del bene ?*

On Wed, Aug 19, 2015 at 2:23 PM, Andrew West <andrewcwest at gmail.com> wrote:

> On 19 August 2015 at 12:36, Otto Stolz <otto.stolz at uni-konstanz.de> wrote:
> >
> > You cannot suggest a new character just because it would
> > be ?nice to have?. Rather, you have to supply evidence that
> > an additional character really needs to be encoded, e. g.
> > because it is already widely used in print and cannot be
> > represented in Unicode.
>
> Well that may once have been the case, but certainly isn't any longer
> with respect to emoji, especially emoji representing food and drink.
>
> I suggest Emma reads Unicode Technical Report 51
> http://unicode.org/reports/tr51/ especially section 1.2 Encoding
> Considerations and Annex C Selection Factors, then start a petition to
> the Unicode Consortium on www.change.org, and when she has 10,000
> signatures make a formal request to the UTC.  Petitions don't
> guarantee acceptance, but widely-petitioned emoji such as taco, cheese
> wedge, paella and whisky tumbler have been successful.
>
> Andrew
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150819/178e27e3/attachment.html>

From wjgo_10009 at btinternet.com  Wed Aug 19 07:55:23 2015
From: wjgo_10009 at btinternet.com (William_J_G Overington)
Date: Wed, 19 Aug 2015 13:55:23 +0100 (BST)
Subject: a suggestion new emoji .
In-Reply-To: <55D46A4E.7090501@uni-konstanz.de>
References: <CAFBVQ=37KRUHvZwKECNUXn=Np_Pc5rSAhMik_BFBP1beKo_h4A@mail.gmail.com>
 <55D46A4E.7090501@uni-konstanz.de>
Message-ID: <3524460.27958.1439988923889.JavaMail.defaultUser@defaultHost>

Otto Stolz wrote:

> You cannot suggest a new character just because it would be ?nice to have?.

Why not?

The UNICORN FACE character has been encoded into regular Unicode following such a suggestion.

I sent in a suggestion for an OKAPI to be encoded and was informed that my suggestion would be added to "the pile" of suggestions.

It would be interesting to read what is listed in "the pile".

It might be an interesting social history document of present times.

Not just a list of items each suggested by many people, but a list of everything that has been suggested.

Suppose that someone suggests encoding LOCOMOTIVE MALLARD WITH TENDER LIVERIED AS PRESERVED SEEN IN A DIRECT SIDEWAYS VIEW FROM HER LEFT with a glyph five times as wide as high.

I remember that there was a trains font.

Suppose that a mobile telephone manufacturer included in a mobile telephone a collection of characters one each for a number of specific steam locomotives.

Returning to Emma's suggestion.

I suggest to Emma the contacting of Unicode Inc. using the following form.

http://www.unicode.org/reporting.html

Hopefully Emma will receive an official reply and her suggestion will become added to "the pile".

What being added to "the pile" presently means, or what it may become to mean, I do not know.

Yet I suggest that making the suggestion on that form would be a potentially useful thing to do.

I hope that Emma's suggestion is successful.

William Overington

19 August 2015


From fantasai.lists at inkedblade.net  Wed Aug 19 11:21:36 2015
From: fantasai.lists at inkedblade.net (fantasai)
Date: Wed, 19 Aug 2015 09:21:36 -0700
Subject: Why doesn't Ideographic (ID) in UAX#14 have half-width katakana?
In-Reply-To: <4A7F11C4-2F5B-4156-844C-C21E543FCFC7@apple.com>
References: <553EEE6D.2020004@ga2.so-net.ne.jp>
 <553EFB2E.3010808@hiroshima-u.ac.jp>
 <CAP0dOsEMB=5v_5smD6x2YTmBGqConxfSrBxBrjX5Q8qaGAyqSA@mail.gmail.com>
 <55403267.9060202@att.net> <DF8A002F-C070-4663-874A-97478FEB9FCA@gmail.com>
 <55438AFB.6020000@att.net> <55439AE4.4020109@hiroshima-u.ac.jp>
 <5543AE4B.5020904@att.net> <5543D03B.80603@ix.netcom.com>
 <CAN9ydbUjuF1RC7gCoef0mOTx0W9eVOv4v07jCt2_a-_TFyAqRQ@mail.gmail.com>
 <55467CAF.4080401@ix.netcom.com>
 <4A7F11C4-2F5B-4156-844C-C21E543FCFC7@apple.com>
Message-ID: <55D4AD10.1080104@inkedblade.net>

On 05/04/2015 02:19 PM, Peter Edberg wrote:
> I have been checking with various groups at Apple. The consensus here is that we would like to see the linebreak value for
> halfwidth katakana changed to ID.

Do we want all halfwidth kana changed to ID, or should there
be some exception for the voicing marks (U+FF9E, U+FF9F) to
forbid breaks before?

~fantasai

From kenwhistler at att.net  Wed Aug 19 11:45:55 2015
From: kenwhistler at att.net (Ken Whistler)
Date: Wed, 19 Aug 2015 09:45:55 -0700
Subject: Why doesn't Ideographic (ID) in UAX#14 have half-width katakana?
In-Reply-To: <55D4AD10.1080104@inkedblade.net>
References: <553EEE6D.2020004@ga2.so-net.ne.jp>
 <553EFB2E.3010808@hiroshima-u.ac.jp>
 <CAP0dOsEMB=5v_5smD6x2YTmBGqConxfSrBxBrjX5Q8qaGAyqSA@mail.gmail.com>
 <55403267.9060202@att.net> <DF8A002F-C070-4663-874A-97478FEB9FCA@gmail.com>
 <55438AFB.6020000@att.net> <55439AE4.4020109@hiroshima-u.ac.jp>
 <5543AE4B.5020904@att.net> <5543D03B.80603@ix.netcom.com>
 <CAN9ydbUjuF1RC7gCoef0mOTx0W9eVOv4v07jCt2_a-_TFyAqRQ@mail.gmail.com>
 <55467CAF.4080401@ix.netcom.com>
 <4A7F11C4-2F5B-4156-844C-C21E543FCFC7@apple.com>
 <55D4AD10.1080104@inkedblade.net>
Message-ID: <55D4B2C3.2010000@att.net>

I don't think that is the issue. U+FF9E/F are already lb=NS, which prevents
line breaks before. The issue is instead loosening up the lb class for
the halfwidth katakana syllables (from lb=AL to lb=ID), so that they *can*
break the way the regular katakana syllables do.

--Ken

On 8/19/2015 9:21 AM, fantasai wrote:
> On 05/04/2015 02:19 PM, Peter Edberg wrote:
>> I have been checking with various groups at Apple. The consensus here 
>> is that we would like to see the linebreak value for
>> halfwidth katakana changed to ID.
>
> Do we want all halfwidth kana changed to ID, or should there
> be some exception for the voicing marks (U+FF9E, U+FF9F) to
> forbid breaks before?
>
> ~fantasai
>


From wjgo_10009 at btinternet.com  Wed Aug 19 11:38:56 2015
From: wjgo_10009 at btinternet.com (William_J_G Overington)
Date: Wed, 19 Aug 2015 17:38:56 +0100 (BST)
Subject: a suggestion new emoji .
In-Reply-To: <CALgEMhz_e6hH7XSX43YwL7w2R2mR5NKrXwbyLd7_kQxex0htXw@mail.gmail.com>
References: <CAFBVQ=37KRUHvZwKECNUXn=Np_Pc5rSAhMik_BFBP1beKo_h4A@mail.gmail.com>
 <55D46A4E.7090501@uni-konstanz.de>
 <CALgEMhz_e6hH7XSX43YwL7w2R2mR5NKrXwbyLd7_kQxex0htXw@mail.gmail.com>
Message-ID: <4356528.46093.1440002336411.JavaMail.defaultUser@defaultHost>

Andrew West wrote:

> ..., and when she has 10,000 signatures make a formal request to the UTC. 

Where does the figure of 10,000 for the number of signatures come from please?

That is a lot of people!

What exactly are the rules? Does anybody really know these days!

By comparison it needs ten signatures to stand in a United Kingdom Parliamentary Election.

William Overington

19 August 2015


From wjgo_10009 at btinternet.com  Wed Aug 19 12:10:40 2015
From: wjgo_10009 at btinternet.com (William_J_G Overington)
Date: Wed, 19 Aug 2015 18:10:40 +0100 (BST)
Subject: a suggestion new emoji .
In-Reply-To: <CAJ2xs_Fyqgb7MAAiV8Bd7pY2XNsBSogfTqgc3zbqoiJ80_=tMg@mail.gmail.com>
References: <CAFBVQ=37KRUHvZwKECNUXn=Np_Pc5rSAhMik_BFBP1beKo_h4A@mail.gmail.com>
 <55D46A4E.7090501@uni-konstanz.de>
 <CALgEMhz_e6hH7XSX43YwL7w2R2mR5NKrXwbyLd7_kQxex0htXw@mail.gmail.com>
 <CAJ2xs_Fyqgb7MAAiV8Bd7pY2XNsBSogfTqgc3zbqoiJ80_=tMg@mail.gmail.com>
Message-ID: <8509498.48370.1440004240920.JavaMail.defaultUser@defaultHost>

Mark Davis wrote:
> As far as petitions go, we take them with a sizable grain of salt. 
Who, exactly, precisely, is "we" please?
William Overington
19 August 2015
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150819/071f254f/attachment.html>

From charupdate at orange.fr  Wed Aug 19 13:22:59 2015
From: charupdate at orange.fr (Marcel Schneider)
Date: Wed, 19 Aug 2015 20:22:59 +0200 (CEST)
Subject: a suggestion new emoji .
In-Reply-To: <CAFBVQ=37KRUHvZwKECNUXn=Np_Pc5rSAhMik_BFBP1beKo_h4A@mail.gmail.com>
References: <CAFBVQ=37KRUHvZwKECNUXn=Np_Pc5rSAhMik_BFBP1beKo_h4A@mail.gmail.com>
Message-ID: <1743506480.18484.1440008579130.JavaMail.www@wwinf1j04>

On 19 Aug 2015 at 01:45, Emma Haneys  wrote:

> i suggest one and only for fruit category . it is a durian .

Emma, at small sizes, and especially in monochrome rendering, the glyph of a durian emoji would resemble closely to the glyph of an eventual lychee emoji. You should read also the detailed explanations from Dr.?Freytag on this mailing list, in the previous most recent emoji thread:
http://www.unicode.org/mail-arch/unicode-ml/y2015-m08/0014.html

All the best,

Marcel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150819/417113f3/attachment.html>

From charupdate at orange.fr  Wed Aug 19 13:40:12 2015
From: charupdate at orange.fr (Marcel Schneider)
Date: Wed, 19 Aug 2015 20:40:12 +0200 (CEST)
Subject: a suggestion new emoji .
In-Reply-To: <3524460.27958.1439988923889.JavaMail.defaultUser@defaultHost>
References: <CAFBVQ=37KRUHvZwKECNUXn=Np_Pc5rSAhMik_BFBP1beKo_h4A@mail.gmail.com>
 <55D46A4E.7090501@uni-konstanz.de>
 <3524460.27958.1439988923889.JavaMail.defaultUser@defaultHost>
Message-ID: <345852892.18984.1440009612606.JavaMail.www@wwinf1j04>

On 19 Aug 2015 at 17:18, William_J_G Overington  wrote:

> Suppose that someone suggests encoding LOCOMOTIVE MALLARD WITH TENDER LIVERIED AS PRESERVED SEEN IN A DIRECT SIDEWAYS VIEW FROM HER LEFT with a glyph five times as wide as high.

Do not forget that to be encoded in Unicode, an emoji must be highly iconic, Dr.?Freytag explained on 03?Aug?2015 at?12:38:
http://www.unicode.org/mail-arch/unicode-ml/y2015-m08/0014.html
So for *all* steam locomotives, we cannot have more emojis than the one we have at U+1F682 STEAM LOCOMOTIVE. This emoji's signification is polysemic and stays at least for all steam locomotives of any period and manufacturer, as well as for touristic oldtimer railway lines and stations.

> I remember that there was a trains font.

To implement the trains font in Unicode (or to implement Unicode in the trains font, I don't know well which way it goes round), the best would be to use the Private Use Area, as Mr. Kolehmainen recommended lastly for another purpose:
http://www.unicode.org/mail-arch/unicode-ml/y2015-m08/0152.html

Using the Contact form is a very good advice. This would have saved Emma from the sarcastic comments that came first in thread.

All the best,

Marcel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150819/310ad604/attachment.html>

From andrewcwest at gmail.com  Wed Aug 19 13:45:53 2015
From: andrewcwest at gmail.com (Andrew West)
Date: Wed, 19 Aug 2015 19:45:53 +0100
Subject: a suggestion new emoji .
In-Reply-To: <1743506480.18484.1440008579130.JavaMail.www@wwinf1j04>
References: <CAFBVQ=37KRUHvZwKECNUXn=Np_Pc5rSAhMik_BFBP1beKo_h4A@mail.gmail.com>
 <1743506480.18484.1440008579130.JavaMail.www@wwinf1j04>
Message-ID: <CALgEMhygta0UdghNHP9uV5rOCMt91Mi2DS+eBSPpTC-c-uo8eA@mail.gmail.com>

On 19 August 2015 at 19:22, Marcel Schneider <charupdate at orange.fr> wrote:
>
> On 19 Aug 2015 at 01:45, Emma Haneys <miszhan3ys at gmail.com> wrote:
>
> > i suggest one and only for fruit category . it is a durian .
>
> Emma, at small sizes, and especially in monochrome rendering, the glyph of a durian emoji would resemble closely to the glyph of an eventual lychee emoji.

I don't know, I think durian emoji would be quite distinctive, as
shown in the examples on this page (I am rather taken with the sad
durian which gets no hugs).

Andrew

From andrewcwest at gmail.com  Wed Aug 19 13:48:17 2015
From: andrewcwest at gmail.com (Andrew West)
Date: Wed, 19 Aug 2015 19:48:17 +0100
Subject: a suggestion new emoji .
In-Reply-To: <CALgEMhygta0UdghNHP9uV5rOCMt91Mi2DS+eBSPpTC-c-uo8eA@mail.gmail.com>
References: <CAFBVQ=37KRUHvZwKECNUXn=Np_Pc5rSAhMik_BFBP1beKo_h4A@mail.gmail.com>
 <1743506480.18484.1440008579130.JavaMail.www@wwinf1j04>
 <CALgEMhygta0UdghNHP9uV5rOCMt91Mi2DS+eBSPpTC-c-uo8eA@mail.gmail.com>
Message-ID: <CALgEMhy_dhnwwfE9sugdFJqpY+nB2ywJRpXZB2OmjPtdJcFsOQ@mail.gmail.com>

On 19 August 2015 at 19:45, Andrew West <andrewcwest at gmail.com> wrote:
> On 19 August 2015 at 19:22, Marcel Schneider <charupdate at orange.fr> wrote:
>>
>> On 19 Aug 2015 at 01:45, Emma Haneys <miszhan3ys at gmail.com> wrote:
>>
>> > i suggest one and only for fruit category . it is a durian .
>>
>> Emma, at small sizes, and especially in monochrome rendering, the glyph of a durian emoji would resemble closely to the glyph of an eventual lychee emoji.
>
> I don't know, I think durian emoji would be quite distinctive, as
> shown in the examples on this page (I am rather taken with the sad
> durian which gets no hugs).

Sorry, this page:

http://www.cafepress.co.uk/+durian+stickers

Andrew

From charupdate at orange.fr  Wed Aug 19 13:49:39 2015
From: charupdate at orange.fr (Marcel Schneider)
Date: Wed, 19 Aug 2015 20:49:39 +0200 (CEST)
Subject: a suggestion new emoji .
In-Reply-To: <CALgEMhygta0UdghNHP9uV5rOCMt91Mi2DS+eBSPpTC-c-uo8eA@mail.gmail.com>
References: <CAFBVQ=37KRUHvZwKECNUXn=Np_Pc5rSAhMik_BFBP1beKo_h4A@mail.gmail.com>
 <1743506480.18484.1440008579130.JavaMail.www@wwinf1j04>
 <CALgEMhygta0UdghNHP9uV5rOCMt91Mi2DS+eBSPpTC-c-uo8eA@mail.gmail.com>
Message-ID: <62460733.18847.1440010179831.JavaMail.www@wwinf1e33>

On 19 Aug 2015 at 20:45, Andrew West  wrote:

> I don't know, I think durian emoji would be quite distinctive, as
> shown in the examples on this page (I am rather taken with the sad
> durian which gets no hugs).

What page do you refer to, the hyperlink has got lost, please.

Marcel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150819/c9daf4a0/attachment.html>

From charupdate at orange.fr  Wed Aug 19 13:59:03 2015
From: charupdate at orange.fr (Marcel Schneider)
Date: Wed, 19 Aug 2015 20:59:03 +0200 (CEST)
Subject: a suggestion new emoji .
Message-ID: <398300494.19072.1440010743779.JavaMail.www@wwinf1e33>

On 19 Aug 2015 at 17:18, William_J_G Overington  wrote:

> I suggest to Emma the contacting of Unicode Inc. using the following form.
> 
> http://www.unicode.org/reporting.html

William is right. I strongly recommend you to first use the Contact form, as I did from the beginning on, long before e-mailing to the List. 
Using the Contact form you will always get a good answer (not always a *positive* response, but always a *good* answer).

Just ignore the arrogant bullying that has come first in thread!

Best wishes,

Marcel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150819/5d78b339/attachment.html>

From charupdate at orange.fr  Wed Aug 19 14:06:58 2015
From: charupdate at orange.fr (Marcel Schneider)
Date: Wed, 19 Aug 2015 21:06:58 +0200 (CEST)
Subject: a suggestion new emoji .
In-Reply-To: <CALgEMhy_dhnwwfE9sugdFJqpY+nB2ywJRpXZB2OmjPtdJcFsOQ@mail.gmail.com>
References: <CAFBVQ=37KRUHvZwKECNUXn=Np_Pc5rSAhMik_BFBP1beKo_h4A@mail.gmail.com>
 <1743506480.18484.1440008579130.JavaMail.www@wwinf1j04>
 <CALgEMhygta0UdghNHP9uV5rOCMt91Mi2DS+eBSPpTC-c-uo8eA@mail.gmail.com>
 <CALgEMhy_dhnwwfE9sugdFJqpY+nB2ywJRpXZB2OmjPtdJcFsOQ@mail.gmail.com>
Message-ID: <1360399199.19324.1440011218382.JavaMail.www@wwinf1e33>

On 19 Aug 2015 at 20:48, Andrew West  wrote:

> On 19 August 2015 at 19:45, Andrew West  wrote:
> > On 19 August 2015 at 19:22, Marcel Schneider  wrote:
> >>
> >> On 19 Aug 2015 at 01:45, Emma Haneys  wrote:
> >>
> >> > i suggest one and only for fruit category . it is a durian .
> >>
> >> Emma, at small sizes, and especially in monochrome rendering, the glyph of a durian emoji would resemble closely to the glyph of an eventual lychee emoji.
> >
> > I don't know, I think durian emoji would be quite distinctive, as
> > shown in the examples on this page (I am rather taken with the sad
> > durian which gets no hugs).
> 
> Sorry, this page:
> 
> http://www.cafepress.co.uk/+durian+stickers

That's nice. I see, durians have sharp tips and are represented in a corresponding way, whereas lychees have round tips.

Well seen!

Marcel?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150819/346f3690/attachment.html>

From charupdate at orange.fr  Wed Aug 19 14:48:38 2015
From: charupdate at orange.fr (Marcel Schneider)
Date: Wed, 19 Aug 2015 21:48:38 +0200 (CEST)
Subject: a suggestion new emoji .
In-Reply-To: <1360399199.19324.1440011218382.JavaMail.www@wwinf1e33>
References: <CAFBVQ=37KRUHvZwKECNUXn=Np_Pc5rSAhMik_BFBP1beKo_h4A@mail.gmail.com>
 <1743506480.18484.1440008579130.JavaMail.www@wwinf1j04>
 <CALgEMhygta0UdghNHP9uV5rOCMt91Mi2DS+eBSPpTC-c-uo8eA@mail.gmail.com>
 <CALgEMhy_dhnwwfE9sugdFJqpY+nB2ywJRpXZB2OmjPtdJcFsOQ@mail.gmail.com>
 <1360399199.19324.1440011218382.JavaMail.www@wwinf1e33>
Message-ID: <1456355655.15227.1440013718592.JavaMail.www@wwinf1d04>

On 19 Aug 2015 at 20:48, Andrew West  wrote:

> On 19 August 2015 at 19:45, Andrew West  wrote:
> > On 19 August 2015 at 19:22, Marcel Schneider  wrote:
> >>
> >> On 19 Aug 2015 at 01:45, Emma Haneys  wrote:
> >>
> >> > i suggest one and only for fruit category . it is a durian .
> >>
> >> Emma, at small sizes, and especially in monochrome rendering, the glyph of a durian emoji would resemble closely to the glyph of an eventual lychee emoji.
> >
> > I don't know, I think durian emoji would be quite distinctive, as
> > shown in the examples on this page (I am rather taken with the sad
> > durian which gets no hugs).
> 
> Sorry, this page:
> 
> http://www.cafepress.co.uk/+durian+stickers

To console the sad durian, Unicode should encode the durian! This could free the sad one from death thoughts (I've noticed the little skull).

I hope that Emma's petition will become successful!

Marcel?


From richard.wordingham at ntlworld.com  Wed Aug 19 20:19:39 2015
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Thu, 20 Aug 2015 02:19:39 +0100
Subject: Custom keyboard source samples (was: Re: Windows keyboard
 restrictions)
In-Reply-To: <749936537.8372.1439812286240.JavaMail.www@wwinf1j14>
References: <749936537.8372.1439812286240.JavaMail.www@wwinf1j14>
Message-ID: <20150820021939.64532cc7@JRWUBU2>

On Mon, 17 Aug 2015 13:51:26 +0200 (CEST)
Marcel Schneider <charupdate at orange.fr> wrote:

> Since yesterday I know a very simple way to get the source code (in
> C) of any MSKLC layout. 

Is this legal?  To me it smacks of reverse engineering, which is
prohibited under the MSKLC licence.

Richard.

From mark at kli.org  Wed Aug 19 21:18:37 2015
From: mark at kli.org (Mark E. Shoulson)
Date: Wed, 19 Aug 2015 22:18:37 -0400
Subject: a suggestion new emoji .
In-Reply-To: <8509498.48370.1440004240920.JavaMail.defaultUser@defaultHost>
References: <CAFBVQ=37KRUHvZwKECNUXn=Np_Pc5rSAhMik_BFBP1beKo_h4A@mail.gmail.com>
 <55D46A4E.7090501@uni-konstanz.de>
 <CALgEMhz_e6hH7XSX43YwL7w2R2mR5NKrXwbyLd7_kQxex0htXw@mail.gmail.com>
 <CAJ2xs_Fyqgb7MAAiV8Bd7pY2XNsBSogfTqgc3zbqoiJ80_=tMg@mail.gmail.com>
 <8509498.48370.1440004240920.JavaMail.defaultUser@defaultHost>
Message-ID: <55D538FD.8010103@kli.org>

And is there an emoji for GRAIN OF SALT?  (Actually, that could almost 
be useful... or even just a geometric CUBE...)

~mark

On 08/19/2015 01:10 PM, William_J_G Overington wrote:
>
> Mark Davis wrote:
>
> > As far as petitions go, we take them with a sizable grain of salt.
>
> Who, exactly, precisely, is "we" please?
>
> William Overington
>
> 19 August 2015
>
>
>


From miszhan3ys at gmail.com  Wed Aug 19 21:18:45 2015
From: miszhan3ys at gmail.com (Emma Haneys)
Date: Thu, 20 Aug 2015 10:18:45 +0800
Subject: a suggestion new emoji .
In-Reply-To: <3524460.27958.1439988923889.JavaMail.defaultUser@defaultHost>
References: <CAFBVQ=37KRUHvZwKECNUXn=Np_Pc5rSAhMik_BFBP1beKo_h4A@mail.gmail.com>
 <55D46A4E.7090501@uni-konstanz.de>
 <3524460.27958.1439988923889.JavaMail.defaultUser@defaultHost>
Message-ID: <CAFBVQ=2SzuwDsJjgQ_Eu5g+_y2maT1PtgQOJ6Qux7cDDWfCzdQ@mail.gmail.com>

thanx all for responding . and i 'ved sent the suggestion to the right
place . thanx for info
??
On Aug 19, 2015 8:55 PM, "William_J_G Overington" <wjgo_10009 at btinternet.com>
wrote:

> Otto Stolz wrote:
>
> > You cannot suggest a new character just because it would be ?nice to
> have?.
>
> Why not?
>
> The UNICORN FACE character has been encoded into regular Unicode following
> such a suggestion.
>
> I sent in a suggestion for an OKAPI to be encoded and was informed that my
> suggestion would be added to "the pile" of suggestions.
>
> It would be interesting to read what is listed in "the pile".
>
> It might be an interesting social history document of present times.
>
> Not just a list of items each suggested by many people, but a list of
> everything that has been suggested.
>
> Suppose that someone suggests encoding LOCOMOTIVE MALLARD WITH TENDER
> LIVERIED AS PRESERVED SEEN IN A DIRECT SIDEWAYS VIEW FROM HER LEFT with a
> glyph five times as wide as high.
>
> I remember that there was a trains font.
>
> Suppose that a mobile telephone manufacturer included in a mobile
> telephone a collection of characters one each for a number of specific
> steam locomotives.
>
> Returning to Emma's suggestion.
>
> I suggest to Emma the contacting of Unicode Inc. using the following form.
>
> http://www.unicode.org/reporting.html
>
> Hopefully Emma will receive an official reply and her suggestion will
> become added to "the pile".
>
> What being added to "the pile" presently means, or what it may become to
> mean, I do not know.
>
> Yet I suggest that making the suggestion on that form would be a
> potentially useful thing to do.
>
> I hope that Emma's suggestion is successful.
>
> William Overington
>
> 19 August 2015
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150820/d729d05a/attachment.html>

From kojiishi at gmail.com  Thu Aug 20 01:18:06 2015
From: kojiishi at gmail.com (Koji Ishii)
Date: Thu, 20 Aug 2015 15:18:06 +0900
Subject: Why doesn't Ideographic (ID) in UAX#14 have half-width katakana?
In-Reply-To: <55D4B2C3.2010000@att.net>
References: <553EEE6D.2020004@ga2.so-net.ne.jp>
 <553EFB2E.3010808@hiroshima-u.ac.jp>
 <CAP0dOsEMB=5v_5smD6x2YTmBGqConxfSrBxBrjX5Q8qaGAyqSA@mail.gmail.com>
 <55403267.9060202@att.net> <DF8A002F-C070-4663-874A-97478FEB9FCA@gmail.com>
 <55438AFB.6020000@att.net> <55439AE4.4020109@hiroshima-u.ac.jp>
 <5543AE4B.5020904@att.net> <5543D03B.80603@ix.netcom.com>
 <CAN9ydbUjuF1RC7gCoef0mOTx0W9eVOv4v07jCt2_a-_TFyAqRQ@mail.gmail.com>
 <55467CAF.4080401@ix.netcom.com>
 <4A7F11C4-2F5B-4156-844C-C21E543FCFC7@apple.com>
 <55D4AD10.1080104@inkedblade.net> <55D4B2C3.2010000@att.net>
Message-ID: <CAN9ydbXtN-_tTwOe2YOkXZX=aSJMTJgbxLxMsV-TdfxbOxcUWg@mail.gmail.com>

Right, this should be applied to only where currently AL.

The basic idea is that, full width is a concept to use a character in an
"imported" manner and thus different characteristics are applied, while
half width is a concept of saving screen real estate and/or for legacy
cultural usages so the characteristics should be the same as its full width
counterpart, except the width.

Roozbeh, thank you for the date, I'll work by then.

/koji

On Thu, Aug 20, 2015 at 1:45 AM, Ken Whistler <kenwhistler at att.net> wrote:

> I don't think that is the issue. U+FF9E/F are already lb=NS, which prevents
> line breaks before. The issue is instead loosening up the lb class for
> the halfwidth katakana syllables (from lb=AL to lb=ID), so that they *can*
> break the way the regular katakana syllables do.
>
> --Ken
>
>
> On 8/19/2015 9:21 AM, fantasai wrote:
>
>> On 05/04/2015 02:19 PM, Peter Edberg wrote:
>>
>>> I have been checking with various groups at Apple. The consensus here is
>>> that we would like to see the linebreak value for
>>> halfwidth katakana changed to ID.
>>>
>>
>> Do we want all halfwidth kana changed to ID, or should there
>> be some exception for the voicing marks (U+FF9E, U+FF9F) to
>> forbid breaks before?
>>
>> ~fantasai
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150820/5e78ba08/attachment.html>

From charupdate at orange.fr  Thu Aug 20 10:30:38 2015
From: charupdate at orange.fr (Marcel Schneider)
Date: Thu, 20 Aug 2015 17:30:38 +0200 (CEST)
Subject: Custom keyboard source samples (was: Re: Windows keyboard
 restrictions)
Message-ID: <1408769184.13369.1440084638814.JavaMail.www@wwinf1d11>

On 20 Aug 2015 at 03:19, Richard Wordingham  wrote:

> On Mon, 17 Aug 2015 13:51:26 +0200 (CEST)
> Marcel Schneider  wrote:

>> Since yesterday I know a very simple way to get the source code (in
>> C) of any MSKLC layout.

> Is this legal? To me it smacks of reverse engineering, which is
> prohibited under the MSKLC licence.

When I?d seen your question, I?felt somebody at Microsoft would be most qualified to answer it (the more as I?m not an addressee). 
But the point here is that we can answer it by ourselves, because the keyboard drivers are not covered by the MSKLC?licence. 
The licensed software is the MSKLC folder.

So it *is* legal.

Let?s look at the details however:

?You may not
? work around any technical limitations in the software;

? reverse engineer, decompile or disassemble the software, except and only to the extent that applicable law expressly permits, despite this limitation;

? make more copies of the software than specified in this agreement or allowed by applicable law, despite this limitation;

? publish the software for others to copy;

? rent, lease or lend the software;

? transfer the software or this agreement to any third party; 

? use the software for commercial software hosting services; or 

? Use the software for the sole purpose of repackaging a Microsoft provided keyboard layout to offer as a stand-alone commercial product for which you charge a fee.? 


Do I??work around any technical limitations in the software? by picking up the source code of the drivers it generates? This is my main concern about this practice. Are we allowed to use files generated by MSKLC that are not expressedly provided to the user? 
Further, are we allowed to use installation packages generated by MSKLC to install other keyboard drivers than those generated by MSKLC? To install keyboard drivers that exceed the limitations of MSKLC?

The questioning becomes even more troublesome when we remember that the WDK is mentioned in the MSKLC Help, and ask: 
When we accept the invitation to switch towards WDK, must we package the drivers with the resources the driver kit comes along with (while not knowing how to write an INF file!), or may we use the MSI and setup from MSKLC? 
BTW we may wonder why and how MSKLC compiles a Windows-On-Windows driver, while except for a few sparse mentions, nothing seems to be provided for WOW in the WDK.


Marcel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150820/9609212d/attachment.html>

From charupdate at orange.fr  Thu Aug 20 10:33:30 2015
From: charupdate at orange.fr (Marcel Schneider)
Date: Thu, 20 Aug 2015 17:33:30 +0200 (CEST)
Subject: Custom keyboard source samples (was: Re: Windows keyboard
 restrictions)
Message-ID: <133672282.13416.1440084810705.JavaMail.www@wwinf1d11>

On 18 Aug 2015 at 10:09, Philippe Verdy  wrote:

> i don't know why these c source files need to be deleted so fast when they could just remain in the same folder as the saved.klc file.

I?ve missed the point when I replied on 18?Aug?2015. In fact, there?s no short answer (which BTW would be ?There?s no use of ?em?). 
The point is: Why (the heck) are the C source folders stored in the hidden AppData Temp directory instead of appearing in the most straightforward place? 
(Which is, as Philippe notes, the same folder as the saved .klc file.)

We could even extend and ask: 
Why is there no option ???Keep the C sources????Delete the C sources?? 
Why are there no menu items ?Generate C source? and ?Build from C source?, or an option ???Build from KLC source????Build from C source?? 
That?s what I?ve wished to find in the MSKLC when I learned about. 
Figure that, before, I not even imagined that such sources could ever exist.

No, we must not disturb the author of MSKLC, we can answer for ourselves. And then we?ll probably fall back on what I wrote the day before yesterday.

Marcel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150820/1aeb75d9/attachment.html>

From charupdate at orange.fr  Sat Aug 22 07:21:20 2015
From: charupdate at orange.fr (Marcel Schneider)
Date: Sat, 22 Aug 2015 14:21:20 +0200 (CEST)
Subject: a suggestion new emoji .
Message-ID: <323520393.11425.1440246080996.JavaMail.www@wwinf1f34>

On 19 Aug 2015 at 20:59, I wrote:

> On 19 Aug 2015 at 17:18, William_J_G Overington  wrote:

>> I suggest to Emma the contacting of Unicode Inc. using the following form.
>> http://www.unicode.org/reporting.html

> William is right. I strongly recommend you to first use the Contact form, as I did from the beginning on, long before e-mailing to the List.
> Using the Contact form you will always get a good answer (not always a *positive* response, but always a *good* answer).

I forgot to add that of course you are always welcome on the Mailing List, where you equally get good answers. But you need to be patient, as best answers come naturally last in thread, like it occurred just six hours before you posted.


On 19 Aug 2015 at 20:48, Andrew West wrote:
> >
> > I don't know, I think durian emoji would be quite distinctive, as
> > shown in the examples on this page (I am rather taken with the sad
> > durian which gets no hugs).
> http://www.cafepress.co.uk/+durian+stickers

For a fruit, a vegetable, a cereal, a plant, an animal, having its emoji encoded in Unicode is like a big hug!
So we thank Mrs?Haneys for having suggested the DURIAN emoji!

Marcel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150822/bc33babc/attachment.html>

From richard.wordingham at ntlworld.com  Sat Aug 22 08:35:30 2015
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Sat, 22 Aug 2015 14:35:30 +0100
Subject: Thai Word Breaking
Message-ID: <20150822143530.29f1e883@JRWUBU2>

I'm trying to work out the meaning of TUS 8.0 Section 23.2.

To do Thai word breaking properly, one needs to do a semantic analysis
of the text to do the equivalent of resolving the equivalent of
'humanevents' into 'human events' rather than 'humane vents'.  One also
needs to cope with unknown and misspelt words.  (A lot of effort has
been devoted to avoid going to the extreme of doing semantic analysis.)
However, it is possible to read Section 23.2 as prohibiting the use of
certain information, and I would like to check whether this is the
intended meaning.

The opening paragraph seems clear enough on first reading:

"The effect of layout controls is specific to particular text processes.
As much as possible, lay-out controls are transparent to those text
processes for which they were not intended. In other words, their
effects are mutually orthogonal."

However, my first question is, "Are paragraph boundaries
directly admissible as evidence for or against word boundaries not
adjacent to them?".  For example, most Thai word breakers would not
regard a paragraph boundary as any more significant than a
phrase-delimiting space.  However, a paragraph boundary often indicates
a change of topic.

My second question is, "Are line breaks admissible as evidence for
or against word boundaries not adjacent to them?"  For example, if a
phrase makes heavy use of U+200B ZERO WIDTH SPACE (ZWSP), one may deduce
that it is likely that all word boundaries within it are marked
explicitly. This example is more useful for Khmer than to Thai, for
whereas Cambodians were once taught to mark word boundaries, Thais
rarely use ZWSP to mark word boundaries.

My third question is, "Is the absence of a line break opportunity
admissible as evidence for or against a word boundary?".  Here I
see conflicting signals.

There is a character U+2060 WORD JOINER (WJ) which *used* to be regarded
as the counterpart of ZWSP.  The understanding was that ZSWP marked a
word boundary and provided a line-break opportunity, while WJ denied
both.  This, however, is no longer the case.  To quote the TUS section
about WJ:

P1: (Ignored)

P2S1: The word joiner must not be confused with the zero width joiner
or the combining grapheme joiner, which have very different functions.

P2S2: In particular, inserting a word joiner between two characters has
no effect on their ligating and cursive joining behavior.

P2S3: The word joiner should be ignored in contexts other than line
breaking.

P2S4: Note in particular that the word joiner is ignored for word
segmentation.

P2S5: (See Unicode Standard Annex #29, ?Unicode Text Segmentation.?)

Paragraph 2 Sentence 3 (P2S3) appears to rule out its use in
word-breaking, but perhaps it does not if line-breaking is being used
as evidence for word boundaries.

P2S4 has three very different interpretations:

(i) This is an assertion of fact, and may therefore be incorrect.

(ii) The word 'is' is sloppy wording for 'should be'.  Section 23.2
contains much sloppier wording, as I have already advised members of
the UTC (4 July 2015).

(iii) This is a deduction from other parts of the specification.  Now,
if P2S4 said 'is normally ignored for word segmentation', that would
have made sense, for that applies to the default word boundary
specification in UAX#29.  However, just before Section 4.1, UAX#29
explains that it does not specify what happens for word boundary
determination in Thai!  (It does constrain what happens, though.)

At the end of UAX#29 Section 6.2, there is the provision, "The Ignore
rules should not be overridden by tailorings, with the possible
exception of remapping some of the Format characters to other
classes."  To accord with the user perceptions of Unicode-aware
people who work with SE Asian scripts, I am tempted to ask for CLDR
to tailor the word-breaking algorithms for the corresponding languages
so that the word-breaking classes of WJ (and ZWNBSP) are changed from
Format to MidLetter.  That would match the widespread old *perception*
that there should be no word break in a sequence <Thai letter, (Thai
mark,)* WJ, Thai letter>. However, there are several objections:

(a) Perhaps P2S3 and P2S4 prohibit this.

(b) If the word-break property of Thai letters falls back to Other,
there would still be a word break between them.

(c) If the word-break property of Thai letters fell back to ALetter,
an old suggestion, WJ would have no effect on the presence of a word
break.

(d) If Thai word breaking assigns word-break classes to each letter
(gc=Lo), then word boundaries can be suppressed by choosing the classes
appropriately.    Non-spacing Thai vowels are very relevant to Thai
word-breaking, but formally are 'ignored'.  WJ could be 'ignored' in
exactly the same way.

Richard.


From nigel at nigelsmall.com  Sat Aug 22 11:08:48 2015
From: nigel at nigelsmall.com (Nigel Small)
Date: Sat, 22 Aug 2015 17:08:48 +0100
Subject: Square Brackets with Tick
Message-ID: <CAFaXQDrQkoW=qkoP37iY6rvm0+gKdZahF1Dg99t1_e0F-sOnPA@mail.gmail.com>

Hi all

I am looking for clarification on an aspect of Unicode bracket pairing,
specifically in relation to the following four characters:

298D; 2990; o # LEFT SQUARE BRACKET WITH TICK IN TOP CORNER
298E; 298F; c # RIGHT SQUARE BRACKET WITH TICK IN BOTTOM CORNER
298F; 298E; o # LEFT SQUARE BRACKET WITH TICK IN BOTTOM CORNER
2990; 298D; c # RIGHT SQUARE BRACKET WITH TICK IN TOP CORNER

These stand out from all other brackets listed in *BidiBrackets.txt* due to
an inconsistency in pairing. I have looked for references online on where
these brackets are used in the wild as mathematical symbols but have been
unable to find anything useful.

All other bracket pairs are listed as opener followed by closer, sometimes
with several code points in between. According to the code point pairs in
the first and second columns of this file, these particular brackets should
be paired as the *first and fourth* and the *third and second*. Intuitively
however, these would actually be *first and second* and *third and fourth*
if one is to expect consistency.

My guess is that there are three possibilities here:
1. The current pairing information is correct and the sequence is irregular
for some historical reason
2. The pairing information is wrong and the sequence is consistent with
other brackets
3. Pairing can be mixed with either left bracket used as a valid opener and
either right bracket used as a valid closer; in this case, the pairing
information is incomplete

I'd be very grateful if anyone could clarify the situation here or if
anyone knows of a resource that describes where such brackets are used in
practice.

Many thanks
Nigel Small
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150822/3d8d600a/attachment.html>

From eliz at gnu.org  Sat Aug 22 11:26:25 2015
From: eliz at gnu.org (Eli Zaretskii)
Date: Sat, 22 Aug 2015 19:26:25 +0300
Subject: Square Brackets with Tick
In-Reply-To: <CAFaXQDrQkoW=qkoP37iY6rvm0+gKdZahF1Dg99t1_e0F-sOnPA@mail.gmail.com>
References: <CAFaXQDrQkoW=qkoP37iY6rvm0+gKdZahF1Dg99t1_e0F-sOnPA@mail.gmail.com>
Message-ID: <83bndzi91q.fsf@gnu.org>

> From: Nigel Small <nigel at nigelsmall.com>
> Date: Sat, 22 Aug 2015 17:08:48 +0100
> 
> I am looking for clarification on an aspect of Unicode bracket pairing,
> specifically in relation to the following four characters:
> 
> 298D; 2990; o # LEFT SQUARE BRACKET WITH TICK IN TOP CORNER
> 298E; 298F; c # RIGHT SQUARE BRACKET WITH TICK IN BOTTOM CORNER
> 298F; 298E; o # LEFT SQUARE BRACKET WITH TICK IN BOTTOM CORNER
> 2990; 298D; c # RIGHT SQUARE BRACKET WITH TICK IN TOP CORNER
> 
> These stand out from all other brackets listed in BidiBrackets.txt due to an
> inconsistency in pairing. I have looked for references online on where these
> brackets are used in the wild as mathematical symbols but have been unable to
> find anything useful.
> 
> All other bracket pairs are listed as opener followed by closer, sometimes with
> several code points in between.

I think the order in the file is by the codepoint in the leftmost
column.  All the rest is just a coincidence.

But I don't speak for the Unicode Consortium, so please wait for a
definitive reply.

From jcb+unicode at inf.ed.ac.uk  Sat Aug 22 11:35:19 2015
From: jcb+unicode at inf.ed.ac.uk (Julian Bradfield)
Date: Sat, 22 Aug 2015 17:35:19 +0100 (BST)
Subject: Square Brackets with Tick
References: <CAFaXQDrQkoW=qkoP37iY6rvm0+gKdZahF1Dg99t1_e0F-sOnPA@mail.gmail.com>
Message-ID: <slrnmth964.jk9.jcb@home.stevens-bradfield.com>

On 2015-08-22, Nigel Small <nigel at nigelsmall.com> wrote:
> 298D; 2990; o # LEFT SQUARE BRACKET WITH TICK IN TOP CORNER
> 298E; 298F; c # RIGHT SQUARE BRACKET WITH TICK IN BOTTOM CORNER
> 298F; 298E; o # LEFT SQUARE BRACKET WITH TICK IN BOTTOM CORNER
> 2990; 298D; c # RIGHT SQUARE BRACKET WITH TICK IN TOP CORNER

> with several code points in between. According to the code point pairs in
> the first and second columns of this file, these particular brackets should
> be paired as the *first and fourth* and the *third and second*. Intuitively
> however, these would actually be *first and second* and *third and fourth*
> if one is to expect consistency.

That's a strange intuition! Mathematical brackets are expected to pair
with left-right symmetry, not rotational symmetry. As in, for example,
floor and ceiling brackets. The pairing in the file is the natural one.

> 1. The current pairing information is correct and the sequence is irregular
> for some historical reason

That will be the explanation. There is no inherent meaning to the
order of codepoints, it's just convenience.
One of the experts here can probably tell us why these four brackets
happen to be coded in this order.

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.


From asmus-inc at ix.netcom.com  Sat Aug 22 12:32:45 2015
From: asmus-inc at ix.netcom.com (Asmus Freytag (t))
Date: Sat, 22 Aug 2015 10:32:45 -0700
Subject: Square Brackets with Tick
In-Reply-To: <slrnmth964.jk9.jcb@home.stevens-bradfield.com>
References: <CAFaXQDrQkoW=qkoP37iY6rvm0+gKdZahF1Dg99t1_e0F-sOnPA@mail.gmail.com>
 <slrnmth964.jk9.jcb@home.stevens-bradfield.com>
Message-ID: <55D8B23D.6070405@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150822/d60606b0/attachment.html>

From public at khwilliamson.com  Sat Aug 22 15:08:14 2015
From: public at khwilliamson.com (Karl Williamson)
Date: Sat, 22 Aug 2015 14:08:14 -0600
Subject: \b{wb}
Message-ID: <55D8D6AE.8080008@khwilliamson.com>

The concept of \b in a regular expression meaning to match the boundary 
between a word and non-word was invented by Larry Wall, for the Perl 
programming language.  This was before Unicode, and a word was defined 
as alphanumerics plus the underscore, which fit well with how 
identifiers in that computer language (and many others) were defined. 
Essentially \b is defined to break between runs of word characters 
versus runs of non-word characters.

The latest version of Perl 5 (recently released) has added \b{w} based 
on Unicode's definition.  The typical expectation of its programmers is 
that it would be a drop-in replacement for the old \b, with much better 
results in parsing natural languages.

But it isn't such a replacement, creating some consternation, and the 
main reason is that, unlike \b, it treats the boundary between white 
space characters as a breaking opportunity, so that it doesn't create 
runs of them.  Thus if you have two spaces after a full stop, it treats 
each as an individual word.

My question is "Was this intentional, and if so, Why?"

TR18 says \b{w} is a"Zero-width match at a Unicode word boundary. Note 
that this is different than \b alone, which corresponds to \w and \W."

And UAX29 says "adjacent spaces are collapsed to a single space" in 
intelligent cut and paste using the WB property.


From richard.wordingham at ntlworld.com  Sat Aug 22 16:46:08 2015
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Sat, 22 Aug 2015 22:46:08 +0100
Subject: \b{wb}
In-Reply-To: <55D8D6AE.8080008@khwilliamson.com>
References: <55D8D6AE.8080008@khwilliamson.com>
Message-ID: <20150822224608.384dacaf@JRWUBU2>

On Sat, 22 Aug 2015 14:08:14 -0600
Karl Williamson <public at khwilliamson.com> wrote:

> But it isn't such a replacement, creating some consternation, and the 
> main reason is that, unlike \b, it treats the boundary between white 
> space characters as a breaking opportunity, so that it doesn't create 
> runs of them.  Thus if you have two spaces after a full stop, it
> treats each as an individual word.
> 
> My question is "Was this intentional, and if so, Why?"

See below.

> TR18 says \b{w} is a"Zero-width match at a Unicode word boundary.
> Note that this is different than \b alone, which corresponds to \w
> and \W."

Unless I'm being stupid, \b and \b{w} are indeed vary different.
Consider a sequence <U+0020, U+1F1EB REGIONAL INDICATOR SYMBOL LETTER F,
U+1F1F7 REGIONAL INDICATOR SYMBOL LETTER R, U+0041 LATIN CAPITAL
LETTER A, U+0062 LATIN SMALL LETTER B>

That has two internal word boundaries, splitting it into a space, a
flag, and the word "Ab".  Is this what you want?

Worse, consider a short Thai sentence ???????????????????????.  That
gets split by ICU into |??|?????|???????????|???|??| - 5 words and
4 internal word boundaries.  Note that there's a word or two between
each boundary.  Is this what you want?

> My question is "Was this intentional, and if so, Why?"

Take a look at the rules in UAX#29 Section 4.1.1.  Apart from the first
two and the last, they all identify where word boundaries aren't.  This
is tidy - the algorithm concentrates on working out where a word
continues.

In principle, you could, I believe, extend the rules so that characters
outside words and regional indicator runs were not divided, but it
would make for a more complicated algorithm with plenty of
opportunities for error.  I think the thought was that word-free runs
did not need to be assembled into runs of non-word material.

The short answer, of course, is that the regular expression engine
could do this final step of post-processing itself.  This may get
tricky with customised word-breaking.

Richard.


From richard.wordingham at ntlworld.com  Sat Aug 22 16:47:06 2015
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Sat, 22 Aug 2015 22:47:06 +0100
Subject: Square Brackets with Tick
In-Reply-To: <55D8B23D.6070405@ix.netcom.com>
References: <CAFaXQDrQkoW=qkoP37iY6rvm0+gKdZahF1Dg99t1_e0F-sOnPA@mail.gmail.com>
 <slrnmth964.jk9.jcb@home.stevens-bradfield.com>
 <55D8B23D.6070405@ix.netcom.com>
Message-ID: <20150822224706.5680b7d3@JRWUBU2>

On Sat, 22 Aug 2015 10:32:45 -0700
"Asmus Freytag (t)" <asmus-inc at ix.netcom.com> wrote:

> On 8/22/2015 9:35 AM, Julian Bradfield wrote:

>> There is no inherent meaning to the
>> order of codepoints, it's just convenience.

> And for that reason, we have property files to explicitly give the
> properties rather than asking the user to "glean" them from code
> point order.

But codepoints are normally orderly until they enter the ISO approval
process.  Thereafter, disorder creeps in, and becomes ever more likely
as blocks fill up.  The concern here is that the opening-closing
pairing information, which used not to be a property, has been deduced
wrongly.  The code chart is prima facie evidence that whoever drew the
order up conceived of U+298D and U+298E as a pair.

I've traced the character as far back as
http://www.unicode.org/L2/L1999/99159.pdf . Unfortunately, its meaning
therein is implicitly described as unknown! It looks as though someone
somewhere fashioned type for it - or perhaps another of the set of four
- but no-one remembers what it was used for!

Now, *if* no-one is using it, it doesn't really matter if the pair is
wrong.

Richard.

From asmusf at ix.netcom.com  Sat Aug 22 19:53:14 2015
From: asmusf at ix.netcom.com (Asmus Freytag)
Date: Sat, 22 Aug 2015 17:53:14 -0700
Subject: Square Brackets with Tick
In-Reply-To: <20150822224706.5680b7d3@JRWUBU2>
References: <CAFaXQDrQkoW=qkoP37iY6rvm0+gKdZahF1Dg99t1_e0F-sOnPA@mail.gmail.com>
 <slrnmth964.jk9.jcb@home.stevens-bradfield.com>
 <55D8B23D.6070405@ix.netcom.com> <20150822224706.5680b7d3@JRWUBU2>
Message-ID: <55D9197A.1040700@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150822/b544e434/attachment.html>

From nigel at nigelsmall.com  Sun Aug 23 04:50:57 2015
From: nigel at nigelsmall.com (Nigel Small)
Date: Sun, 23 Aug 2015 10:50:57 +0100
Subject: Square Brackets with Tick
In-Reply-To: <55D9197A.1040700@ix.netcom.com>
References: <CAFaXQDrQkoW=qkoP37iY6rvm0+gKdZahF1Dg99t1_e0F-sOnPA@mail.gmail.com>
 <slrnmth964.jk9.jcb@home.stevens-bradfield.com>
 <55D8B23D.6070405@ix.netcom.com> <20150822224706.5680b7d3@JRWUBU2>
 <55D9197A.1040700@ix.netcom.com>
Message-ID: <CAFaXQDqXFeHC6FOdS1J-QeOQbuwVpnTEwQbFBnP0fTaetbryHw@mail.gmail.com>

Thanks to everyone for your responses so far. In terms of my comment on
which brackets make intuitive pairs, I should perhaps have explained my
thought process more clearly. If one is to consider the possible origins of
these symbols, one likely idea is that they could be used to symbolise a
bracketed expression that has been "slashed through". In that context,
pairing the top left tick with the bottom right tick makes sense, as does
the pairing of the other two. Then, the original code point order remains
consistent (though I understand this need not have any relevance). This
appears to mirror Asmus' observation.
On 23 Aug 2015 1:58 am, "Asmus Freytag" <asmusf at ix.netcom.com> wrote:

> On 8/22/2015 2:47 PM, Richard Wordingham wrote:
>
> But codepoints are normally orderly until they enter the ISO approval
> process.  Thereafter, disorder creeps in, and becomes ever more likely
> as blocks fill up
>
>
> Haha, good one.
>
> .  The concern here is that the opening-closing
> pairing information, which used not to be a property, has been deduced
> wrongly.  The code chart is prima facie evidence that whoever drew the
> order up conceived of U+298D and U+298E as a pair.
>
>
> Not necessarily. Code charts are sometimes ordered in mysterious ways.
> However, read on.
>
>
>
> I've traced the character as far back ashttp://www.unicode.org/L2/L1999/99159.pdf . Unfortunately, its meaning
> therein is implicitly described as unknown! It looks as though someone
> somewhere fashioned type for it - or perhaps another of the set of four
> - but no-one remembers what it was used for!
>
>
> This document doesn't tell you what the pairing is supposed to be, only
> that which
> ones are opening and closing (so we know that they are intended to be
> arranged [ ]
> and not ] [ (ticks omitted), but we don't know which of the two [[ go with
> which of
> the two ]], other than the - natural - assumptions that pairs are listed
> adjacently).
>
> For the first document that gives the pairing information, see:
>
> http://www.unicode.org/L2/L2012/12173r-bidi-paren.pdf
>
> There is no note or other indication in this document that shows that any
> thought
> was put into the different ordering.
>
> However, it is notable that all other bracket pairings follow the bidi
> mirroring glyph
> relation, so I would put my money on that that file was used to create the
> pairs using
> a script, rather than manual editing.
>
> This is corroborated in section 3.2 of that document.
>
> Nigel was the first to notice that these were not encoded as left-right
> glyph pairs,
> but with the diagonal "tick" (originally called a solidus) having the same
> orientation
> in a pair (as if intended to bracket something in either diagonal or
> anti-diagonal
> direction).
>
> Given that L2/12-173 states that the property was derived via algorithm
> that is based
> on left-right mirroring and not via matching open/close pairs based on
> other factors,
> (including adjacency in the charts) I'm happy to join the growing chorus
> that declares
> this to be a bug.
>
> Luckily there seems to be no stability policy that would prevent fixing
> this one.
>
> A./
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150823/45ad25aa/attachment.html>

From richard.wordingham at ntlworld.com  Sun Aug 23 09:15:39 2015
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Sun, 23 Aug 2015 15:15:39 +0100
Subject: UAX#29 Word-Breaking Interface for Complex Context
Message-ID: <20150823151539.63a73b9a@JRWUBU2>

The word-breaking algorithm defines an apparently innocuous interface
for word breaking of 'complex context' scripts such as Thai, Lao and
Myanmar.  The complex context part, whose internals are deliberately
and reasonably not defined by Unicode, assigns word break property
values to the characters.  Are there any implementations that work that
way? Negative answers such as 'xxx does not work that way' would also be
useful.

For example, ICU does not work this way.  Instead, the complex context
parts deliver word boundaries rather than character properties to the
part of the algorithm working in accordance with a tailoring of the
algorithm in UAX#29.

It seems that in general the assignments may be a little complicated.
For example, in the usual case of interest, Thai script word
characters delimited by white space, it seems to me that the characters
of alternate words should be assigned to 'ALetter' and 'Katakana'.
Have I missed a trick here?  'RI' is a new alternative to 'ALetter' and
'Katakana', but that seems even more bizarre, and I'd worry about its
stability.

I'm finding some interesting constraints arisng from the interface.
For example, *within* x?y (that's a Thai letter flanked by two English
letters), there are either no or two word boundaries.  By contrast,
there may be no, one or two linebreak opportunities *within* the string.

Richard.

 
From jknappen at web.de  Mon Aug 24 04:39:49 2015
From: jknappen at web.de (=?UTF-8?Q?=22J=C3=B6rg_Knappen=22?=)
Date: Mon, 24 Aug 2015 11:39:49 +0200
Subject: Aw: Re: Square Brackets with Tick
In-Reply-To: <slrnmth964.jk9.jcb@home.stevens-bradfield.com>
References: <CAFaXQDrQkoW=qkoP37iY6rvm0+gKdZahF1Dg99t1_e0F-sOnPA@mail.gmail.com>, 
 <slrnmth964.jk9.jcb@home.stevens-bradfield.com>
Message-ID: <trinity-862cb1cb-b420-4cea-9afc-34b9de4da318-1440409189370@3capp-webde-bs29>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150824/b4c4aebd/attachment.html>

From wjgo_10009 at btinternet.com  Mon Aug 24 05:00:32 2015
From: wjgo_10009 at btinternet.com (William_J_G Overington)
Date: Mon, 24 Aug 2015 11:00:32 +0100 (BST)
Subject: Square Brackets with Tick
In-Reply-To: <CAFaXQDqXFeHC6FOdS1J-QeOQbuwVpnTEwQbFBnP0fTaetbryHw@mail.gmail.com>
References: <CAFaXQDrQkoW=qkoP37iY6rvm0+gKdZahF1Dg99t1_e0F-sOnPA@mail.gmail.com>
 <slrnmth964.jk9.jcb@home.stevens-bradfield.com>
 <55D8B23D.6070405@ix.netcom.com> <20150822224706.5680b7d3@JRWUBU2>
 <55D9197A.1040700@ix.netcom.com>
 <CAFaXQDqXFeHC6FOdS1J-QeOQbuwVpnTEwQbFBnP0fTaetbryHw@mail.gmail.com>
Message-ID: <20720148.17379.1440410432467.JavaMail.defaultUser@defaultHost>

Looking at the document
http://www.unicode.org/L2/L1999/99159.pdf
that has been mentioned, the four bracket characters are therein described as follows.
4X1F O LEFT BRACKET, REVERSE SOLIDUS TOP CORNER
4X20 C RIGHT BRACKET, REVERSE SOLIDUS BOTTOM CORNER
4X21 O LEFT BRACKET, SOLIDUS BOTTOM CORNER
4X22 C RIGHT BRACKET, SOLIDUS TOP CORNER
So it looks like the pairings in Unicode today are as originally intended.
May I suggest a (possibly new) use for the brackets please?
When a person is transcribing a document into a computer, perhaps a historical document, the first pair could be used to indicate a transcriber's note that the text between the brackets was crossed out in the original document, and the second pair could be used to indicate a transcriber's note that the text between the brackets was crossed out in the original document yet had been reinstated in the original document, either by the word stet being placed next to the crossed-out text or otherwise.
William Overington
24 August 2015
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150824/be028fe6/attachment.html>

From shawnlandden at tuta.io  Mon Aug 24 14:03:54 2015
From: shawnlandden at tuta.io (Shawn Landden)
Date: Mon, 24 Aug 2015 19:03:54 +0000 (UTC)
Subject: Arabic ligitures
Message-ID: <JxW55kV----0@tuta.io>

 From github. https://github.com/golang/go/issues/12298

Arabic ligitures have been deprecated[1], despite a need for both ligitures 
and non-ligature versions of the same glyphs. Amiri uses contextual 
alternatives for ????.? These ligatures are used in religious documents[2] 
via pictures, which seems to be what the current Unicode standard recommends. 
Unlike the presentation forms, there is case for these phrases and formulas 
to be available both in ligature and non-ligature form.

These ligatures should be non-deprecated and subject to canonical 
decomposition, rather than compatibility decomposition. 
http://www.unicode.org/reports/tr15

[1] https://en.wikipedia.org/wiki/Arabic_script_in_Unicode#Word_ligatures
[2] http://www.mujahideenryder.net/pdf/WhoAretheDisbelievers.pdf

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150824/0504c0b1/attachment.html>

From richard.wordingham at ntlworld.com  Mon Aug 24 14:35:14 2015
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Mon, 24 Aug 2015 20:35:14 +0100
Subject: Square Brackets with Tick
In-Reply-To: <20720148.17379.1440410432467.JavaMail.defaultUser@defaultHost>
References: <CAFaXQDrQkoW=qkoP37iY6rvm0+gKdZahF1Dg99t1_e0F-sOnPA@mail.gmail.com>
 <slrnmth964.jk9.jcb@home.stevens-bradfield.com>
 <55D8B23D.6070405@ix.netcom.com> <20150822224706.5680b7d3@JRWUBU2>
 <55D9197A.1040700@ix.netcom.com>
 <CAFaXQDqXFeHC6FOdS1J-QeOQbuwVpnTEwQbFBnP0fTaetbryHw@mail.gmail.com>
 <20720148.17379.1440410432467.JavaMail.defaultUser@defaultHost>
Message-ID: <20150824203514.58ceb74d@JRWUBU2>

On Mon, 24 Aug 2015 11:00:32 +0100 (BST)
William_J_G Overington <wjgo_10009 at btinternet.com> wrote:

> Looking at the document
> http://www.unicode.org/L2/L1999/99159.pdf
> that has been mentioned, the four bracket characters are therein
> described as follows. 

> 4X1F O LEFT BRACKET, REVERSE SOLIDUS TOP CORNER
> 4X20 C RIGHT BRACKET, REVERSE SOLIDUS BOTTOM CORNER
> 4X21 O LEFT BRACKET, SOLIDUS BOTTOM CORNER
> 4X22 C RIGHT BRACKET, SOLIDUS TOP CORNER
> So it looks like the pairings in Unicode today are as originally
> intended.

How so?

There are two relevant pairings in Unicode - the Bidi_Mirroring_Glyph
and Bidi_Paired_Bracket.  Both pair the 1st and the 4th together and
the 2nd and the 3rd together.  Now, Bidi_Mirroring_Glyph is based mainly
on appearance (or have I missed a caveat?), and that seems to be
correct.  Bid_Paired_Bracket is based on semantics, which are
difficult to be sure of when we have no examples of use.  Indeed, some
quote marks are notoriously inconsistent from language to language.

I am assuming that it is better to render reversed ? U+2284 NOT A SUBSET
OF (= <U+2282, U+0338 COMBINING LONG SOLIDUS OVERLAY) using an
unreversed glyph for ?? <U+2283 SUPERSET OF, U+20E5 COMBINING REVERSE
SOLIDUS OVERLAY>, rather than the unreversed glyph of ? U+2285 NOT A
SUPERSET OF = <U+2283, U+0338>, despite U+2284 and U+2285 being a
bidi-mirroring pair.  If one took the view that a combining solidus
didn't mirror (as indeed, it doesn't according to the UCD), and that
the ticks are unhidden parts of solidi, then the Bidi_Mirroring_Glyph
properties would be wrong!  Good taste is probably the only way through
the bidi mirroring maze.

Richard.


From fantasai.lists at inkedblade.net  Sun Aug 23 11:13:44 2015
From: fantasai.lists at inkedblade.net (fantasai)
Date: Sun, 23 Aug 2015 18:13:44 +0200
Subject: Why doesn't Ideographic (ID) in UAX#14 have half-width katakana?
In-Reply-To: <CAN9ydbXtN-_tTwOe2YOkXZX=aSJMTJgbxLxMsV-TdfxbOxcUWg@mail.gmail.com>
References: <553EEE6D.2020004@ga2.so-net.ne.jp>
 <553EFB2E.3010808@hiroshima-u.ac.jp>
 <CAP0dOsEMB=5v_5smD6x2YTmBGqConxfSrBxBrjX5Q8qaGAyqSA@mail.gmail.com>
 <55403267.9060202@att.net> <DF8A002F-C070-4663-874A-97478FEB9FCA@gmail.com>
 <55438AFB.6020000@att.net> <55439AE4.4020109@hiroshima-u.ac.jp>
 <5543AE4B.5020904@att.net> <5543D03B.80603@ix.netcom.com>
 <CAN9ydbUjuF1RC7gCoef0mOTx0W9eVOv4v07jCt2_a-_TFyAqRQ@mail.gmail.com>
 <55467CAF.4080401@ix.netcom.com>
 <4A7F11C4-2F5B-4156-844C-C21E543FCFC7@apple.com>
 <55D4AD10.1080104@inkedblade.net> <55D4B2C3.2010000@att.net>
 <CAN9ydbXtN-_tTwOe2YOkXZX=aSJMTJgbxLxMsV-TdfxbOxcUWg@mail.gmail.com>
Message-ID: <55D9F138.4090201@inkedblade.net>

On 08/20/2015 08:18 AM, Koji Ishii wrote:
> Right, this should be applied to only where currently AL.
>
> The basic idea is that, full width is a concept to use a character in an "imported" manner and thus different characteristics
> are applied, while half width is a concept of saving screen real estate and/or for legacy cultural usages so the
> characteristics should be the same as its full width counterpart, except the width.

This sounds good to me. Let me know if the UTC wants an official
CSSWG resolution of support on this change, I can arrange for
that if necessary.

~fantasai

From wjgo_10009 at btinternet.com  Tue Aug 25 03:54:29 2015
From: wjgo_10009 at btinternet.com (William_J_G Overington)
Date: Tue, 25 Aug 2015 09:54:29 +0100 (BST)
Subject: Square Brackets with Tick
In-Reply-To: <20150824203514.58ceb74d@JRWUBU2>
References: <CAFaXQDrQkoW=qkoP37iY6rvm0+gKdZahF1Dg99t1_e0F-sOnPA@mail.gmail.com>
 <slrnmth964.jk9.jcb@home.stevens-bradfield.com>
 <55D8B23D.6070405@ix.netcom.com> <20150822224706.5680b7d3@JRWUBU2>
 <55D9197A.1040700@ix.netcom.com>
 <CAFaXQDqXFeHC6FOdS1J-QeOQbuwVpnTEwQbFBnP0fTaetbryHw@mail.gmail.com>
 <20720148.17379.1440410432467.JavaMail.defaultUser@defaultHost>
 <20150824203514.58ceb74d@JRWUBU2>
Message-ID: <26422784.15492.1440492869448.JavaMail.defaultUser@defaultHost>

Richard Wordingham wrote:

> On Mon, 24 Aug 2015 11:00:32 +0100 (BST)
William_J_G Overington <wjgo_10009 at btinternet.com> wrote:

>> Looking at the document
>> http://www.unicode.org/L2/L1999/99159.pdf
>> that has been mentioned, the four bracket characters are therein described as follows. 

>> 4X1F O LEFT BRACKET, REVERSE SOLIDUS TOP CORNER
>> 4X20 C RIGHT BRACKET, REVERSE SOLIDUS BOTTOM CORNER
>> 4X21 O LEFT BRACKET, SOLIDUS BOTTOM CORNER
>> 4X22 C RIGHT BRACKET, SOLIDUS TOP CORNER
>> So it looks like the pairings in Unicode today are as originally intended.

> How so?

I was simply observing that the original pairings had the first-listed pair of brackets listed using REVERSE SOLIDUS and had the second-listed pair of brackets listed using SOLIDUS contrasting that clear pairing of the brackets with the use, in the encoding into Unicode, of TICK in the listing for each of the four of the bracket characters that are being discussed in this thread.

William Overington

25 August 2015

From doug at ewellic.org  Tue Aug 25 10:31:09 2015
From: doug at ewellic.org (Doug Ewell)
Date: Tue, 25 Aug 2015 08:31:09 -0700
Subject: Arabic ligatures
Message-ID: <20150825083109.665a7a7059d7ee80bb4d670165c8327d.f0b345d455.wbe@email03.secureserver.net>

Shawn Landden <shawnlandden at tuta dot io> wrote:

> Arabic ligitures have been deprecated[1], despite a need for both
> ligitures and non-ligature versions of the same glyphs.

The only Arabic character that is deprecated in the standard is U+0673
ARABIC LETTER ALEF WITH WAVY HAMZA BELOW. The Wikipedia article cited as
"[1]" does not claim otherwise.

> Amiri uses contextual alternatives for ????.  These ligatures are
> used in religious documents[2] via pictures, which seems to be what
> the current Unicode standard recommends.

What is your source for this?

> Unlike the presentation forms, there is case for these phrases and
> formulas to be available both in ligature and non-ligature form.

All Arabic letters and combinations can be rendered in ligated or
non-ligated forms as needed using some combination of ZWJ and ZWNJ. See
TUS 8.0, Section 9.2.

> These ligatures should be non-deprecated and subject to canonical
> decomposition, rather than compatibility decomposition.

Section 9.2 (page 386 ff.) explains the Arabic Presentation Forms-A
block (U+FB50?U+FDFF) in greater detail.

--
Doug Ewell | http://ewellic.org | Thornton, CO ????


From richard.wordingham at ntlworld.com  Tue Aug 25 14:07:35 2015
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Tue, 25 Aug 2015 20:07:35 +0100
Subject: Square Brackets with Tick
In-Reply-To: <26422784.15492.1440492869448.JavaMail.defaultUser@defaultHost>
References: <CAFaXQDrQkoW=qkoP37iY6rvm0+gKdZahF1Dg99t1_e0F-sOnPA@mail.gmail.com>
 <slrnmth964.jk9.jcb@home.stevens-bradfield.com>
 <55D8B23D.6070405@ix.netcom.com> <20150822224706.5680b7d3@JRWUBU2>
 <55D9197A.1040700@ix.netcom.com>
 <CAFaXQDqXFeHC6FOdS1J-QeOQbuwVpnTEwQbFBnP0fTaetbryHw@mail.gmail.com>
 <20720148.17379.1440410432467.JavaMail.defaultUser@defaultHost>
 <20150824203514.58ceb74d@JRWUBU2>
 <26422784.15492.1440492869448.JavaMail.defaultUser@defaultHost>
Message-ID: <20150825200735.6c88cda8@JRWUBU2>

On Tue, 25 Aug 2015 09:54:29 +0100 (BST)
William_J_G Overington <wjgo_10009 at btinternet.com> wrote:

> Richard Wordingham wrote:
> 
> > On Mon, 24 Aug 2015 11:00:32 +0100 (BST)
> William_J_G Overington <wjgo_10009 at btinternet.com> wrote:
> 
> >> Looking at the document
> >> http://www.unicode.org/L2/L1999/99159.pdf
> >> that has been mentioned, the four bracket characters are therein
> >> described as follows. 
> 
> >> 4X1F O LEFT BRACKET, REVERSE SOLIDUS TOP CORNER
> >> 4X20 C RIGHT BRACKET, REVERSE SOLIDUS BOTTOM CORNER
> >> 4X21 O LEFT BRACKET, SOLIDUS BOTTOM CORNER
> >> 4X22 C RIGHT BRACKET, SOLIDUS TOP CORNER
> >> So it looks like the pairings in Unicode today are as originally
> >> intended.

> > How so?

> I was simply observing that the original pairings had the
> first-listed pair of brackets listed using REVERSE SOLIDUS and had
> the second-listed pair of brackets listed using SOLIDUS contrasting
> that clear pairing of the brackets with the use, in the encoding into
> Unicode, of TICK in the listing for each of the four of the bracket
> characters that are being discussed in this thread.

You said the 'pairings in Unicode'.  With the exception of decimal
digits, the scalar values of assigned characters have no *formal*
relationship to their interpretation.  The scalar values are about as
significant as the difference between canonically equivalent
non-Greek, non-Korean sequences.  At best the different sequences give a
hint of what the author thinks about the character.  For example U+00E9
LATIN SMALL LETTER E WITH ACUTE suggests it may be though of as a
character, while <U+0065, U+0301> suggests that it may be two
characters - the diacritic could be a length mark or a tone.  The
distinction is not to be relied upon - normalisation would obliterate
it.

Richard.

From asmus-inc at ix.netcom.com  Tue Aug 25 17:26:44 2015
From: asmus-inc at ix.netcom.com (Asmus Freytag (t))
Date: Tue, 25 Aug 2015 15:26:44 -0700
Subject: Square Brackets with Tick
In-Reply-To: <20150825200735.6c88cda8@JRWUBU2>
References: <CAFaXQDrQkoW=qkoP37iY6rvm0+gKdZahF1Dg99t1_e0F-sOnPA@mail.gmail.com>
 <slrnmth964.jk9.jcb@home.stevens-bradfield.com>
 <55D8B23D.6070405@ix.netcom.com> <20150822224706.5680b7d3@JRWUBU2>
 <55D9197A.1040700@ix.netcom.com>
 <CAFaXQDqXFeHC6FOdS1J-QeOQbuwVpnTEwQbFBnP0fTaetbryHw@mail.gmail.com>
 <20720148.17379.1440410432467.JavaMail.defaultUser@defaultHost>
 <20150824203514.58ceb74d@JRWUBU2>
 <26422784.15492.1440492869448.JavaMail.defaultUser@defaultHost>
 <20150825200735.6c88cda8@JRWUBU2>
Message-ID: <55DCEBA4.6020703@ix.netcom.com>

On 8/25/2015 12:07 PM, Richard Wordingham wrote:
> On Tue, 25 Aug 2015 09:54:29 +0100 (BST)
> William_J_G Overington <wjgo_10009 at btinternet.com> wrote:
>
>> Richard Wordingham wrote:
>>
>>> On Mon, 24 Aug 2015 11:00:32 +0100 (BST)
>> William_J_G Overington <wjgo_10009 at btinternet.com> wrote:
>>
>>>> Looking at the document
>>>> http://www.unicode.org/L2/L1999/99159.pdf
>>>> that has been mentioned, the four bracket characters are therein
>>>> described as follows.
>>>> 4X1F O LEFT BRACKET, REVERSE SOLIDUS TOP CORNER
>>>> 4X20 C RIGHT BRACKET, REVERSE SOLIDUS BOTTOM CORNER
>>>> 4X21 O LEFT BRACKET, SOLIDUS BOTTOM CORNER
>>>> 4X22 C RIGHT BRACKET, SOLIDUS TOP CORNER
>>>> So it looks like the pairings in Unicode today are as originally
>>>> intended.
>>> How so?
>> I was simply observing that the original pairings had the
>> first-listed pair of brackets listed using REVERSE SOLIDUS and had
>> the second-listed pair of brackets listed using SOLIDUS contrasting
>> that clear pairing of the brackets with the use, in the encoding into
>> Unicode, of TICK in the listing for each of the four of the bracket
>> characters that are being discussed in this thread.
> You said the 'pairings in Unicode'.  With the exception of decimal
> digits, the scalar values of assigned characters have no *formal*
> relationship to their interpretation.  The scalar values are about as
> significant as the difference between canonically equivalent
> non-Greek, non-Korean sequences.  At best the different sequences give a
> hint of what the author thinks about the character.  For example U+00E9
> LATIN SMALL LETTER E WITH ACUTE suggests it may be though of as a
> character, while <U+0065, U+0301> suggests that it may be two
> characters - the diacritic could be a length mark or a tone.  The
> distinction is not to be relied upon - normalisation would obliterate
> it.
>

I think William makes a reasonable point that conceiving of the "ticks" 
as angled lines
and then naming their direction in pairs potentially reinforces the 
notion that the sets
with matching naming were intended as pairs.

While this is being bandied about here on the list, an offline effort is 
underway to
see whether it's possible to find out more about the origin and 
potential use of
these marks - prior to their encoding in Unicode. We know they came from 
some
SGML entity sets, but how and why they got into those is still a bit of 
a mystery;
locked away in the heads of the original creator of these sets.

We may never get a more definite answer, unless someone here is conversant
with whatever field of mathematics uses these brackets, or knows someone who
is.

A./

From duerst at it.aoyama.ac.jp  Thu Aug 27 02:39:31 2015
From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J._D=c3=bcrst?=)
Date: Thu, 27 Aug 2015 16:39:31 +0900
Subject: Why doesn't Ideographic (ID) in UAX#14 have half-width katakana?
In-Reply-To: <CAP0dOsEMB=5v_5smD6x2YTmBGqConxfSrBxBrjX5Q8qaGAyqSA@mail.gmail.com>
References: <553EEE6D.2020004@ga2.so-net.ne.jp>
 <553EFB2E.3010808@hiroshima-u.ac.jp>
 <CAP0dOsEMB=5v_5smD6x2YTmBGqConxfSrBxBrjX5Q8qaGAyqSA@mail.gmail.com>
Message-ID: <55DEBEB3.5090201@it.aoyama.ac.jp>

Sorry to be late. Just some background information.

On 2015/04/28 14:57, Makoto Kato wrote:

> Although I read JIS X 4051, it doesn't define that half-width katakana
> and full-width katakana are differently.

I was on the committee that updated JIS X 4015 (mostly liaison/observer 
role). The chair of that committee was Prof. Shibano (?? ??), who 
was also chair of the committee responsible for the Japanese character 
standards as well as chair of ISO/IEC JTC1 SC2.

I very well remember that he explained at one point that as far as the 
standards were concerned, full-width and half-width versions were 
considered one and the same character. In modern terms, the standards' 
view was that single-byte and double-byte encodings of these characters 
were just different "encoding forms" of one and the same abstract character.

This view is confirmed e.g. by the character names used in the 1997 
version (confirmed 2002) of JIS X 0201, which are just "KATAKANA LETTER 
A",... Anybody interested can dig deeper, JIS X 0201 was just what was 
most easily accessible to me now.

The justification behind this is that they are linguistically not 
different at all, and that they were intended just as a fallback due to 
technology (memory, display resolution) limitations.

In practice, technical restrictions in early limitations (one byte == 
one (half-width) character cell) led to a typographic distinction. The 
fact that half-width Kana used less space was exploited in fixed-pitch 
screen design. That lead to a desire to keep the distinction when 
round-tripping via Unicode, and thus to different character names.


Regards,   Martin.

From duerst at it.aoyama.ac.jp  Thu Aug 27 03:04:16 2015
From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J._D=c3=bcrst?=)
Date: Thu, 27 Aug 2015 17:04:16 +0900
Subject: Why doesn't Ideographic (ID) in UAX#14 have half-width katakana?
In-Reply-To: <55DEBEB3.5090201@it.aoyama.ac.jp>
References: <553EEE6D.2020004@ga2.so-net.ne.jp>
 <553EFB2E.3010808@hiroshima-u.ac.jp>
 <CAP0dOsEMB=5v_5smD6x2YTmBGqConxfSrBxBrjX5Q8qaGAyqSA@mail.gmail.com>
 <55DEBEB3.5090201@it.aoyama.ac.jp>
Message-ID: <55DEC480.20705@it.aoyama.ac.jp>

Sorry, one correction:

On 2015/08/27 16:39, Martin J. D?rst wrote:

> In practice, technical restrictions in early limitations (one byte ==
> one (half-width) character cell) led to a typographic distinction. The
> fact that half-width Kana used less space was exploited in fixed-pitch
> screen design. That lead to a desire to keep the distinction when
> round-tripping via Unicode, and thus to different character names.

"early limitations" -> "early technologies.

Regards,   Martin.

From charupdate at orange.fr  Thu Aug 27 14:49:45 2015
From: charupdate at orange.fr (Marcel Schneider)
Date: Thu, 27 Aug 2015 21:49:45 +0200 (CEST)
Subject: Thai Word Breaking
In-Reply-To: <20150822143530.29f1e883@JRWUBU2>
References: <20150822143530.29f1e883@JRWUBU2>
Message-ID: <2062952758.26825.1440704985138.JavaMail.www@wwinf1g16>

On 22 Aug 2015 at 15:47, Richard Wordingham  wrote:

> I'm trying to work out the meaning of TUS 8.0 Section 23.2.
> 
> To do Thai word breaking properly, one needs to do a semantic analysis
> of the text to do the equivalent of resolving the equivalent of
> 'humanevents' into 'human events' rather than 'humane vents'. One also
> needs to cope with unknown and misspelt words. (A lot of effort has
> been devoted to avoid going to the extreme of doing semantic analysis.)
> However, it is possible to read Section 23.2 as prohibiting the use of
> certain information, and I would like to check whether this is the
> intended meaning.
> 
> The opening paragraph seems clear enough on first reading:
> 
> "The effect of layout controls is specific to particular text processes.
> As much as possible, lay-out controls are transparent to those text
> processes for which they were not intended. In other words, their
> effects are mutually orthogonal."
> 
> However, my first question is, "Are paragraph boundaries
> directly admissible as evidence for or against word boundaries not
> adjacent to them?". For example, most Thai word breakers would not
> regard a paragraph boundary as any more significant than a
> phrase-delimiting space. However, a paragraph boundary often indicates
> a change of topic.
> 
> My second question is, "Are line breaks admissible as evidence for
> or against word boundaries not adjacent to them?" For example, if a
> phrase makes heavy use of U+200B ZERO WIDTH SPACE (ZWSP), one may deduce
> that it is likely that all word boundaries within it are marked
> explicitly. This example is more useful for Khmer than to Thai, for
> whereas Cambodians were once taught to mark word boundaries, Thais
> rarely use ZWSP to mark word boundaries.
> 
> My third question is, "Is the absence of a line break opportunity
> admissible as evidence for or against a word boundary?". Here I
> see conflicting signals.
> 
> There is a character U+2060 WORD JOINER (WJ) which *used* to be regarded
> as the counterpart of ZWSP. The understanding was that ZSWP marked a
> word boundary and provided a line-break opportunity, while WJ denied
> both. This, however, is no longer the case. To quote the TUS section
> about WJ:
> 
> P1: (Ignored)
> 
> P2S1: The word joiner must not be confused with the zero width joiner
> or the combining grapheme joiner, which have very different functions.
> 
> P2S2: In particular, inserting a word joiner between two characters has
> no effect on their ligating and cursive joining behavior.
> 
> P2S3: The word joiner should be ignored in contexts other than line
> breaking.
> 
> P2S4: Note in particular that the word joiner is ignored for word
> segmentation.
> 
> P2S5: (See Unicode Standard Annex #29, ?Unicode Text Segmentation.?)
> 
> Paragraph 2 Sentence 3 (P2S3) appears to rule out its use in
> word-breaking, but perhaps it does not if line-breaking is being used
> as evidence for word boundaries.
> 
> P2S4 has three very different interpretations:
> 
> (i) This is an assertion of fact, and may therefore be incorrect.
> 
> (ii) The word 'is' is sloppy wording for 'should be'. Section 23.2
> contains much sloppier wording, as I have already advised members of
> the UTC (4 July 2015).
> 
> (iii) This is a deduction from other parts of the specification. Now,
> if P2S4 said 'is normally ignored for word segmentation', that would
> have made sense, for that applies to the default word boundary
> specification in UAX#29. However, just before Section 4.1, UAX#29
> explains that it does not specify what happens for word boundary
> determination in Thai! (It does constrain what happens, though.)
> 
> At the end of UAX#29 Section 6.2, there is the provision, "The Ignore
> rules should not be overridden by tailorings, with the possible
> exception of remapping some of the Format characters to other
> classes." To accord with the user perceptions of Unicode-aware
> people who work with SE Asian scripts, I am tempted to ask for CLDR
> to tailor the word-breaking algorithms for the corresponding languages
> so that the word-breaking classes of WJ (and ZWNBSP) are changed from
> Format to MidLetter. That would match the widespread old *perception*
> that there should be no word break in a sequence > mark,)* WJ, Thai letter>. However, there are several objections:
> 
> (a) Perhaps P2S3 and P2S4 prohibit this.
> 
> (b) If the word-break property of Thai letters falls back to Other,
> there would still be a word break between them.
> 
> (c) If the word-break property of Thai letters fell back to ALetter,
> an old suggestion, WJ would have no effect on the presence of a word
> break.
> 
> (d) If Thai word breaking assigns word-break classes to each letter
> (gc=Lo), then word boundaries can be suppressed by choosing the classes
> appropriately. Non-spacing Thai vowels are very relevant to Thai
> word-breaking, but formally are 'ignored'. WJ could be 'ignored' in
> exactly the same way.

Still nobody answered the questions Richard Wordingham raised five days ago. I'm very busy and can hardly channel off any time for concerns not related so far, except when I believe there's some need, as this is a discussion list.

However the Word Joiner topic made me launch a thread too, which has been thankfully answered. Now I feel that even if the WJ is apparently tailored to delimit words in mainstream word processors, the Standard denies this property, and Richard agrees if I've well understood. The criticism he works out should IMHO be fed into the 9.0 workflow.

Any comments from Thai users, implementers, and scientists?

Best regards,

Marcel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150827/090c12f1/attachment.html>

From charupdate at orange.fr  Thu Aug 27 15:59:14 2015
From: charupdate at orange.fr (Marcel Schneider)
Date: Thu, 27 Aug 2015 22:59:14 +0200 (CEST)
Subject: Thai Word Breaking
Message-ID: <628722595.28314.1440709154153.JavaMail.www@wwinf1h13>

On 27 Aug 2015 et 21:49, I wrote:

> However the Word Joiner topic made me launch a thread too, which has been thankfully answered.

Please read: [...] which has been answered and I'm thankful.

Apart, I've an off-topic: Keyboard source files can be converted and compiled with the Kbdutool.exe of MSKLC even when they have not entirely been generated by the software. In other words, we are invited to add chained dead keys directly in the .klc file, because they are supported by Kbdutool, and run this tool thanks to its command line UI. Among the switches, we find also one to get the C sources only.

I would have e-mailed this the day the whole process is working (to date, I can just include my custom header, via an #include at the end of the kbd.h in the \inc\ directory), as there is no such switch to get Kbdtool.exe compile from the C sources.

IMHO what we must not do, is to insist to have graphic UIs for the whole keyboard layout creation, because experience shows that keyboard editing, especially dead key repertories, as well as the allocation table and ligature table, are best done in spreadsheets (where we can also have the diagrams), with the whole NamesList (or the part containing identifiers and heads/subheads), and the surrogate pairs beside in two formula-generated columns (using little hex conversion tables because Excel can AFAIK not handle the >> and << operators (this is >>, << in the case it disappears).

Best regards,

Marcel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150827/582c5e2b/attachment.html>

From richard.wordingham at ntlworld.com  Thu Aug 27 18:09:52 2015
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Fri, 28 Aug 2015 00:09:52 +0100
Subject: Thai Word Breaking
In-Reply-To: <2062952758.26825.1440704985138.JavaMail.www@wwinf1g16>
References: <20150822143530.29f1e883@JRWUBU2>
 <2062952758.26825.1440704985138.JavaMail.www@wwinf1g16>
Message-ID: <20150828000952.14f6ca50@JRWUBU2>

On Thu, 27 Aug 2015 21:49:45 +0200 (CEST)
Marcel Schneider <charupdate at orange.fr> wrote:

> On 22 Aug 2015 at 15:47, Richard Wordingham  wrote:

> Still nobody answered the questions Richard Wordingham raised five
> days ago.

There are not many people who are in a position to say what unclear
sections of TUS are intended to mean.  I may have scared them into
silence by noting that people changing code because of one particular
*new* sentence in Section 23.2, namely:

> > P2S4: Note in particular that the word joiner is ignored for word
> > segmentation.

are at risk (but see below) of putting themselves in breach of the UK's
'Equality Act 2010'; more generally, they may be in breach of
transpositions of the EU Racial Equality Directive (2000/43/EC).  You
don't need to have racialist intentions to be in breach.

> > (ii) The word 'is' is sloppy wording for 'should be'. Section 23.2
> > contains much sloppier wording, as I have already advised members of
> > the UTC (4 July 2015).

This comment applies to the part of Section 23.2 referring to U+FEFF
ZERO WIDTH NO-BREAK SPACE (ZWNBSP).  UTC members were advised that to be
consistent, it should have changes corresponding to those made for WJ.
Such changes weren't made to the section on ZWNBSP, and so I can read
Section 23.2 as saying that ZWNBSP can be used to mark word boundaries
whereas WJ cannot. Reading the standard this way would probably protect
the writers of text editors (including word processors) from the
European legislation against indirect discrimination.  It's still a
shame about the degradation of old text that uses WJ instead of ZWNBSP,
but it should still render fine if one switches spell-checking off.
Word counts will change, though.

Richard.

From charupdate at orange.fr  Sat Aug 29 15:33:57 2015
From: charupdate at orange.fr (Marcel Schneider)
Date: Sat, 29 Aug 2015 22:33:57 +0200 (CEST)
Subject: Thai Word Breaking
In-Reply-To: <20150828000952.14f6ca50@JRWUBU2>
References: <20150822143530.29f1e883@JRWUBU2>
 <2062952758.26825.1440704985138.JavaMail.www@wwinf1g16>
 <20150828000952.14f6ca50@JRWUBU2>
Message-ID: <1275427180.13085.1440880437214.JavaMail.www@wwinf1n31>

On 28 Aug 2015 at 01:19, Richard Wordingham  wrote:

> I may have scared them into
> silence by noting that people changing code because of one particular
> *new* sentence in Section 23.2, namely:
> 
> > > P2S4: Note in particular that the word joiner is ignored for word
> > > segmentation.
> 
> are at risk (but see below) of putting themselves in breach of the UK's
> 'Equality Act 2010'; more generally, they may be in breach of
> transpositions of the EU Racial Equality Directive (2000/43/EC). You
> don't need to have racialist intentions to be in breach.

[?]
> so I can read
> Section 23.2 as saying that ZWNBSP can be used to mark word boundaries
> whereas WJ cannot. Reading the standard this way would probably protect
> the writers of text editors (including word processors) from the
> European legislation against indirect discrimination.

That?s awesome!

So when I have the ordinal indicators both on *one* key because I?need the A and O for German precomposed, and have the ? in the Base shift state and the ? in the Shift shift state (because the primary locale is French, which does use ? but not ?, and BTW the ?? is on N, too), may I be accused of discrimination? If so, I must remove the ordinal indicators from the [I] key and have them in Compose only (Compose, a, _; Compose, o, _).

Marcel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150829/2637822c/attachment.html>

From charupdate at orange.fr  Sat Aug 29 15:45:50 2015
From: charupdate at orange.fr (Marcel Schneider)
Date: Sat, 29 Aug 2015 22:45:50 +0200 (CEST)
Subject: Streamline keyboard programming (was: Re: Thai Word Breaking)
In-Reply-To: <628722595.28314.1440709154153.JavaMail.www@wwinf1h13>
References: <628722595.28314.1440709154153.JavaMail.www@wwinf1h13>
Message-ID: <1323552145.13187.1440881150607.JavaMail.www@wwinf1n31>

On 27 Aug 2015 at 23:09, I wrote:

> Keyboard source files can be converted and compiled with the Kbdutool.exe of MSKLC even when they have not entirely been generated by the software. In other words, we are invited to add chained dead keys directly in the .klc file, because they are supported by Kbdutool, and run this tool thanks to its command line UI. Among the switches, we find also one to get the C sources only.

> I would have e-mailed this the day the whole process is working (to date, I can just include my custom header, via an #include at the end of the kbd.h in the \inc\ directory), as there is no such switch to get Kbdtool.exe compile from the C sources.

> IMHO what we must not do, is to insist to have graphic UIs for the whole keyboard layout creation, because experience shows that keyboard editing, especially dead key repertories, as well as the allocation table and ligature table, are best done in spreadsheets (where we can also have the diagrams), with the whole NamesList (or the part containing identifiers and heads/subheads), and the surrogate pairs beside in two formula-generated columns (using little hex conversion tables because Excel can AFAIK not handle the >> and << operators (this is >>, << in the case it disappears).

Philippe Verdy kindly made me aware that the binary right shift is an integer division. Yep, I didn't notice, and clumsily removed two hex digits and figured out how to get the next one... Thanks to Philippe's advice, the C formulas from the Unicode Frequently Asked Question stand now as short Excel formulas in my NamesList spreadsheet.

Marcel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150829/fd228434/attachment.html>

From charupdate at orange.fr  Sat Aug 29 15:55:31 2015
From: charupdate at orange.fr (Marcel Schneider)
Date: Sat, 29 Aug 2015 22:55:31 +0200 (CEST)
Subject: Thai Word Breaking
Message-ID: <424376774.13294.1440881731423.JavaMail.www@wwinf1n31>

On 29 Aug 2015 at 22:33 (twenty minutes ago), I wrote:

> So when I have the ordinal indicators both on *one* key because I?need the A and O for German precomposed, and have the ? in the Base shift state and the ? in the Shift shift state (because the primary locale is French, which does use ? but not ?, and BTW the ?? is on N, too), may I be accused of discrimination? If so, I must remove the ordinal indicators from the [I] key and have them in Compose only (Compose, a, _; Compose, o, _).

Arghhh. On the [I] key, I've ??? on Kana, and ??? on Shift+Kana. Sorry.


?

I've some other news but they must wait till tomorrow. 

Seemingly I'm too tired this evening.


?

Marcel 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150829/94aacd90/attachment.html>

From richard.wordingham at ntlworld.com  Sat Aug 29 18:09:00 2015
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Sun, 30 Aug 2015 00:09:00 +0100
Subject: Thai Word Breaking
In-Reply-To: <1275427180.13085.1440880437214.JavaMail.www@wwinf1n31>
References: <20150822143530.29f1e883@JRWUBU2>
 <2062952758.26825.1440704985138.JavaMail.www@wwinf1g16>
 <20150828000952.14f6ca50@JRWUBU2>
 <1275427180.13085.1440880437214.JavaMail.www@wwinf1n31>
Message-ID: <20150830000900.4e2339ad@JRWUBU2>

On Sat, 29 Aug 2015 22:33:57 +0200 (CEST)
Marcel Schneider <charupdate at orange.fr> wrote:

> So when I have the ordinal indicators both on *one* key because
> I?need the A and O for German precomposed, and have the ? in the Base
> shift state and the ? in the Shift shift state (because the primary
> locale is French, which does use ? but not ?, and BTW the ?? is on N,
> too), may I be accused of discrimination?

Your defence would be that that "practice is objectively justified by a
legitimate aim and the means of achieving that aim are appropriate and
necessary" - 2000/43/EC Article 1 Paragraph 2(b).  Mock not.  In the UK,
needlessly requiring that a job applicant have a driving licence is
unlawful discrimination against women.  Not making provision for the
hard of hearing at a query desk can be unlawful discrimination - I don't
remember whether it was by disability or simply on the basis of age.
I'm not sure to what extent these are common EU law and to what extent
these are just British law.

I've got some web pages where colour-coding is used.  It looks as
though I've now supposed to find a way of switching the colours to help
those with impaired colour vision.  Perhaps I'll just have to withdraw
the pages.

Richard.


From tgwizard at gmail.com  Sat Aug 29 17:47:12 2015
From: tgwizard at gmail.com (Adam Renberg)
Date: Sat, 29 Aug 2015 22:47:12 +0000
Subject: Wrong character code for HELM SYMBOL in TR 51 Unicode Emoji?
Message-ID: <CABzQGhncui+zx523-sLEd6oTVL_sjaLg_61bNvsezF2V_TQFzg@mail.gmail.com>

Hi,

I've just read through the Unicode Technical Report #51 Unicode Emoji [1],
and I have a question. In section 3.3 Methodology [2], third paragraph, it
says:

"This document takes a functional view regarding the identification of
emoji: pictographs are categorized as emoji when it is reasonable to give
them an emoji presentation, and where they are sufficiently distinct from
other emoji characters. Symbols with a graphical form that people may treat
as pictographs, such as U+2615 HELM SYMBOL (introduced in Unicode 3.0) may
be included."

However, when I look up the HELM SYMBOL, it seems to have code U+2388
[3][4][5]. The character with code U+2615 is HOT BEVERAGE [6][7].

Is this a mistake in the technical report?

Best regards,
Adam Renberg

[1]: http://www.unicode.org/reports/tr51/index.html
[2]: http://www.unicode.org/reports/tr51/index.html#Methodology
[3]: http://www.unicode.org/charts/PDF/U2300.pdf
[4]: http://unicode-table.com/en/2388/
[5]: http://www.unicode.org/Public/3.0-Update/UnicodeData-3.0.0.txt
[6]: http://www.unicode.org/charts/PDF/U2600.pdf
[7]: http://unicode-table.com/en/2615/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150829/b99c77ff/attachment.html>

From gwalla at gmail.com  Sat Aug 29 19:33:48 2015
From: gwalla at gmail.com (Garth Wallace)
Date: Sat, 29 Aug 2015 17:33:48 -0700
Subject: Wrong character code for HELM SYMBOL in TR 51 Unicode Emoji?
In-Reply-To: <CABzQGhncui+zx523-sLEd6oTVL_sjaLg_61bNvsezF2V_TQFzg@mail.gmail.com>
References: <CABzQGhncui+zx523-sLEd6oTVL_sjaLg_61bNvsezF2V_TQFzg@mail.gmail.com>
Message-ID: <CA+p4_H0XgzhRGxXb4huVYQdo32YWxJNTFTsy1oyN=iy3A2fPsw@mail.gmail.com>

It certainly looks that way. In just the next paragraph it mentions
"U+2615 HOT BEVERAGE (introduced in Unicode 4.0)"

On Sat, Aug 29, 2015 at 3:47 PM, Adam Renberg <tgwizard at gmail.com> wrote:
> Hi,
>
> I've just read through the Unicode Technical Report #51 Unicode Emoji [1],
> and I have a question. In section 3.3 Methodology [2], third paragraph, it
> says:
>
> "This document takes a functional view regarding the identification of
> emoji: pictographs are categorized as emoji when it is reasonable to give
> them an emoji presentation, and where they are sufficiently distinct from
> other emoji characters. Symbols with a graphical form that people may treat
> as pictographs, such as U+2615 HELM SYMBOL (introduced in Unicode 3.0) may
> be included."
>
> However, when I look up the HELM SYMBOL, it seems to have code U+2388
> [3][4][5]. The character with code U+2615 is HOT BEVERAGE [6][7].
>
> Is this a mistake in the technical report?
>
> Best regards,
> Adam Renberg
>
> [1]: http://www.unicode.org/reports/tr51/index.html
> [2]: http://www.unicode.org/reports/tr51/index.html#Methodology
> [3]: http://www.unicode.org/charts/PDF/U2300.pdf
> [4]: http://unicode-table.com/en/2388/
> [5]: http://www.unicode.org/Public/3.0-Update/UnicodeData-3.0.0.txt
> [6]: http://www.unicode.org/charts/PDF/U2600.pdf
> [7]: http://unicode-table.com/en/2615/

From mark at macchiato.com  Sun Aug 30 05:21:14 2015
From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=)
Date: Sun, 30 Aug 2015 12:21:14 +0200
Subject: Wrong character code for HELM SYMBOL in TR 51 Unicode Emoji?
In-Reply-To: <CA+p4_H0XgzhRGxXb4huVYQdo32YWxJNTFTsy1oyN=iy3A2fPsw@mail.gmail.com>
References: <CABzQGhncui+zx523-sLEd6oTVL_sjaLg_61bNvsezF2V_TQFzg@mail.gmail.com>
 <CA+p4_H0XgzhRGxXb4huVYQdo32YWxJNTFTsy1oyN=iy3A2fPsw@mail.gmail.com>
Message-ID: <CAJ2xs_GS4T3PFNUJXrNztmBtFOeVtvCq7hOuDkkkraEpE_x7OQ@mail.gmail.com>

Thanks, that's a mis-edit. The following text should have been removed:
". Symbols with a graphical form that people may treat as pictographs, ... are
categorized as emoji"


Mark <https://google.com/+MarkDavis>

*? Il meglio ? l?inimico del bene ?*

On Sun, Aug 30, 2015 at 2:33 AM, Garth Wallace <gwalla at gmail.com> wrote:

> It certainly looks that way. In just the next paragraph it mentions
> "U+2615 HOT BEVERAGE (introduced in Unicode 4.0)"
>
> On Sat, Aug 29, 2015 at 3:47 PM, Adam Renberg <tgwizard at gmail.com> wrote:
> > Hi,
> >
> > I've just read through the Unicode Technical Report #51 Unicode Emoji
> [1],
> > and I have a question. In section 3.3 Methodology [2], third paragraph,
> it
> > says:
> >
> > "This document takes a functional view regarding the identification of
> > emoji: pictographs are categorized as emoji when it is reasonable to give
> > them an emoji presentation, and where they are sufficiently distinct from
> > other emoji characters. Symbols with a graphical form that people may
> treat
> > as pictographs, such as U+2615 HELM SYMBOL (introduced in Unicode 3.0)
> may
> > be included."
> >
> > However, when I look up the HELM SYMBOL, it seems to have code U+2388
> > [3][4][5]. The character with code U+2615 is HOT BEVERAGE [6][7].
> >
> > Is this a mistake in the technical report?
> >
> > Best regards,
> > Adam Renberg
> >
> > [1]: http://www.unicode.org/reports/tr51/index.html
> > [2]: http://www.unicode.org/reports/tr51/index.html#Methodology
> > [3]: http://www.unicode.org/charts/PDF/U2300.pdf
> > [4]: http://unicode-table.com/en/2388/
> > [5]: http://www.unicode.org/Public/3.0-Update/UnicodeData-3.0.0.txt
> > [6]: http://www.unicode.org/charts/PDF/U2600.pdf
> > [7]: http://unicode-table.com/en/2615/
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150830/54610231/attachment.html>

From charupdate at orange.fr  Mon Aug 31 08:27:17 2015
From: charupdate at orange.fr (Marcel Schneider)
Date: Mon, 31 Aug 2015 15:27:17 +0200 (CEST)
Subject: Thai Word Breaking
In-Reply-To: <20150830000900.4e2339ad@JRWUBU2>
References: <20150822143530.29f1e883@JRWUBU2>
 <2062952758.26825.1440704985138.JavaMail.www@wwinf1g16>
 <20150828000952.14f6ca50@JRWUBU2>
 <1275427180.13085.1440880437214.JavaMail.www@wwinf1n31>
 <20150830000900.4e2339ad@JRWUBU2>
Message-ID: <1216480868.9953.1441027637361.JavaMail.www@wwinf1f13>

On 30 Aug 2015 at 01:17, Richard Wordingham  wrote:

> On Sat, 29 Aug 2015 22:33:57 +0200 (CEST)
> Marcel Schneider  wrote:
> 
> > So when I have the ordinal indicators both on *one* key because
> > I?need the A and O for German precomposed, and have the ? in the Base
> > shift state and the ? in the Shift shift state 
[sorry: ? in Kana=i, ? in Shift+Kana+i]
> > (because the primary
> > locale is French, which does use ? but not ?, and BTW the ?? is on N,
> > too), may I be accused of discrimination?
> 
> Your defence would be that that "practice is objectively justified by a
> legitimate aim and the means of achieving that aim are appropriate and
> necessary" - 2000/43/EC Article 1 Paragraph 2(b). Mock not.

This is IMHO a very good defence. It exactly matches the situation on the quoted keyboard layout.

However, if there is any risk that anybody could take offence from having the feminine ordinal indicator more hardly accessed than the masculine one, may that be in a unlucky moment of fatigue or personal disappointment, it will be wiser to take them away from that spot. That's what i've done in the wake, and I thank you for having made us aware of the existence of these problems.

(To lessen the damage, I've added a couple of additional Compose sequences: 
Compose, i, a ? ?
Compose, i, o ? ?)

> In the UK,
> needlessly requiring that a job applicant have a driving licence is
> unlawful discrimination against women. Not making provision for the
> hard of hearing at a query desk can be unlawful discrimination - I don't
> remember whether it was by disability or simply on the basis of age.

These legal provisions have considerable merit. IMHO one even could sum them up on the theme of technology, saying that nobody must neither require needless technological skill from others, nor neglect providing needed technological devices to relieve those suffering from age and/or disability.

> I'm not sure to what extent these are common EU law and to what extent
> these are just British law.

I hope they are common EU law, otherwise they'll have to be implemented.

In the last food allergen emoji thread, William Overington already reported some British legal provisions that I found to be superior to those applicable in other G8 countries:

http://www.unicode.org/mail-arch/unicode-ml/y2015-m07/0227.html

> 
> I've got some web pages where colour-coding is used. It looks as
> though I've now supposed to find a way of switching the colours to help
> those with impaired colour vision. Perhaps I'll just have to withdraw
> the pages.

Yet another point I must monitor, as I too use colour-coding in the layout overview where some formatting styles are defined in Excel, one for CapsLock sensitive key positions, one for KanaLock sensitive key positions, one for dead keys, and so on. It's hard for me to work out what colours I must combine with what other colours to meet disability. Perhaps there must be several patterns along with one in black and white with grey tones. For PDF?this should be feasible in Excel by setting the styles' background and foreground colours. (The layout not being finished, it's still off line.)

I hope you will get a solution allowing to maintain your pages. (BTW I'm quite curious but that's not a matter.)

Marcel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150831/0b66c968/attachment.html>

From charupdate at orange.fr  Mon Aug 31 09:12:08 2015
From: charupdate at orange.fr (Marcel Schneider)
Date: Mon, 31 Aug 2015 16:12:08 +0200 (CEST)
Subject: Custom keyboard source samples
In-Reply-To: <1408769184.13369.1440084638814.JavaMail.www@wwinf1d11>
References: <1408769184.13369.1440084638814.JavaMail.www@wwinf1d11>
Message-ID: <1409194975.10963.1441030328840.JavaMail.www@wwinf1f13>

> On 20 Aug 2015 at 03:19, Richard Wordingham  wrote:

> On Mon, 17 Aug 2015 13:51:26 +0200 (CEST)
> Marcel Schneider  wrote:

>> Since yesterday I know a very simple way to get the source code (in
>> C) of any MSKLC layout.

> Is this legal? To me it smacks of reverse engineering, which is
> prohibited under the MSKLC licence.

On 20 Aug 2015 at 17:41 I'd already discarded the licence as not applying but found another prohibited practice, the illegal use of the software, that must be used "only in certain ways", no working around being allowed:

> Do I??work around any technical limitations in the software? by picking up the source code of the drivers it generates? This is my main concern about this practice. Are we allowed to use files generated by MSKLC that are not expressedly provided to the user? 
> Further, are we allowed to use installation packages generated by MSKLC to install other keyboard drivers than those generated by MSKLC? To install keyboard drivers that exceed the limitations of MSKLC?
> The questioning becomes even more troublesome when we remember that the WDK is mentioned in the MSKLC Help, and ask: 
> When we accept the invitation to switch towards WDK, must we package the drivers with the resources the driver kit comes along with (while not knowing how to write an INF file!), or may we use the MSI and setup from MSKLC? 
> BTW we may wonder why and how MSKLC compiles a Windows-On-Windows driver, while except for a few sparse mentions, nothing seems to be provided for WOW in the WDK.

These questions are now about to be going to be answered:

http://www.siao2.com/2011/04/09/10151666.aspx

?It believe it is even technically be a EULA violation, though I can?t imagine it ever being enforced in this case, I mean unless someone tried to sue Microsoft for negative consequences of using a keyboard created by this means. In which case the use of the hack would be a pretty reasonable indemnification of Microsoft here, since someone delved into the land of the specifically unsupported?.?

We note that Michael Kaplan does not give legal advice, as he *believes* things are so. But I'm pretty likely to take it for a reality, as it makes Microsoft more a user-friendly company that cares for everybody being at ease with the computer, whatever the keyboard he prefers may look like.

Right technically, there is a very straightforward way to get the source code of any keyboard from its .klc file.

http://www.siao2.com/2011/04/16/10154700.aspx
we are showed how to edit the klc file and have Kbdutool running on it to get the drivers, to put these enhanced drivers into the package, and install.
These include the WoW driver, which we can't compile in the WinDDK from the era. (Windows-on-Windows is the 32 bit subsystem running on 64 bit machines for support of 32 bit applications. This requires a second keyboard driver, the one that is in the wow64 folder. If this dll is missing while required, keyboard layout installation fails.)

When we ask Kbdutool for its switches, it shows us also the -s switch for generating sources without compiling.
So to get the C sources, we need to use the two switches -u and -s.

Marcel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150831/9f6cc4f6/attachment.html>

From charupdate at orange.fr  Mon Aug 31 13:52:53 2015
From: charupdate at orange.fr (Marcel Schneider)
Date: Mon, 31 Aug 2015 20:52:53 +0200 (CEST)
Subject: Custom keyboard source samples
Message-ID: <112795365.21952.1441047173312.JavaMail.www@wwinf1c24>

> On 18 Aug 2015 at 10:09, Philippe Verdy  wrote:

> i don't know why these c source files need to be deleted so fast when they could just remain in the same folder as the saved.klc file.

On 20 Aug 2015 at 17:42, I still didn?t understand (my lenifying answer of 18 Aug 2015 at 11:42 consisted of mere suppositions), and angry at the idea of being outlawed when trying to fix some issues on keyboard customization practice, I even threw a bunch of subsidiary questions in the air:

> Why is there no option ?? Keep the C sources ? Delete the C sources?? 
> Why are there no menu items ?Generate C source? and ?Build from C source?, or an option ?? Build from KLC source ? Build from C source?? 
> That?s what I?ve wished to find in the MSKLC when I learned about. 
> Figure that, before, I not even imagined that such sources could ever exist.

Let's face it: Kbdutool stores the intermediate files?among which the much-coveted C sources?right in the working directory. You see them appearing and being deleted. And?as I wrote a short wile ago?if we wish to take a glance, we?re welcome: there?s an extra switch for that. It?s not sooner than when Kbdutool is driven by MSKLC, that all those files are kept out of sight.

To decrypt why this is so, the secret key is IMHO found in these two blog posts:
http://www.siao2.com/2013/04/19/10409187.aspx
http://www.siao2.com/2013/04/23/10413216.aspx
It?s as if everybody at Microsoft was traumatized of seeing one nation preferring another kind of keyboard than the usual one. Here we must instantly deliver the hidden part of the story, I mean, the part that is not taken into consideration: The Canadian Multilingual Standard keyboard is a genuine implementation of all parts of the ISO 9995 keyboard standard of the era. And say it loud: Nobody can be blamed of following an ISO standard.

Today, the willingness of Canada to implement ISO 9995 proves having been a very good idea. It?s not only a hard-wired logic. Apple?Inc. is proposing their products with a physical Canadian Multilingual Standard keyboard. And they?re gaining market parts! All people using the Canadian Standard keyboard are full of praise?even if Canadians themselves find some details to improve. As about the cited utterings, this is utter shame (not for the targeted Standards body!). Never look for the Right Ctrl key elsewhere than where it?s expected... (That?s nothing about the layout, and all about ISO keyboard symbols.)

Now back to topic. 

?People ask me al the time how can they add shift states like this in MSKLC, but I always refuse to answer. I don?t want to encourage anyone else to author such layouts!?

It's this ?strange keyboard? trauma, following the kbd.h authors? terminology, which explains for me the great fear of leaving to everybody the keys of keyboard programming. They don?t miss the point: Once you get the C sources and have means to compile them to drivers, you can do what you want, or nearly what you want. You have no other limits than Windows?. And these allow for pretty much.

Consequently, the reason why much is done to keep C sources out of reach, is the care for the user. Users must be guided, they must be prevented from following their folly (please don?t misunderstand: I?m talking of their *supposed* folly).

Well, that?s?to date?my *very long* answer. I don?t know well what is best to hope this time: to be right, or to be wrong :-|

Marcel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150831/45d90d7d/attachment.html>

From charupdate at orange.fr  Mon Aug 31 14:00:31 2015
From: charupdate at orange.fr (Marcel Schneider)
Date: Mon, 31 Aug 2015 21:00:31 +0200 (CEST)
Subject: Custom keyboard source samples
Message-ID: <854246856.22207.1441047631099.JavaMail.www@wwinf1c24>

[Note: I didn't see the announcement when I sent my e-mail. This has cost me a lot of time and searches, which I concede as for the topic I'm working on, and trying to deliver a useful answer. I never wanted to interfere with on-going threads or announcements.]

Marcel

From tgwizard at gmail.com  Mon Aug 31 17:34:17 2015
From: tgwizard at gmail.com (Adam Renberg)
Date: Mon, 31 Aug 2015 22:34:17 +0000
Subject: Wrong character code for HELM SYMBOL in TR 51 Unicode Emoji?
In-Reply-To: <CAJ2xs_GS4T3PFNUJXrNztmBtFOeVtvCq7hOuDkkkraEpE_x7OQ@mail.gmail.com>
References: <CABzQGhncui+zx523-sLEd6oTVL_sjaLg_61bNvsezF2V_TQFzg@mail.gmail.com>
 <CA+p4_H0XgzhRGxXb4huVYQdo32YWxJNTFTsy1oyN=iy3A2fPsw@mail.gmail.com>
 <CAJ2xs_GS4T3PFNUJXrNztmBtFOeVtvCq7hOuDkkkraEpE_x7OQ@mail.gmail.com>
Message-ID: <CABzQGhkTFcujYJ2J7srxec0-1ECHFvDoqsVUKgcHAB25R0xa-g@mail.gmail.com>

Thank you for the clarification. Should the text be updated?

On Sun, Aug 30, 2015 at 12:26 PM Mark Davis ?? <mark at macchiato.com> wrote:

> Thanks, that's a mis-edit. The following text should have been removed:
> ". Symbols with a graphical form that people may treat as pictographs,
> ... are categorized as emoji"
>
>
> Mark <https://google.com/+MarkDavis>
>
> *? Il meglio ? l?inimico del bene ?*
>
> On Sun, Aug 30, 2015 at 2:33 AM, Garth Wallace <gwalla at gmail.com> wrote:
>
>> It certainly looks that way. In just the next paragraph it mentions
>> "U+2615 HOT BEVERAGE (introduced in Unicode 4.0)"
>>
>> On Sat, Aug 29, 2015 at 3:47 PM, Adam Renberg <tgwizard at gmail.com> wrote:
>> > Hi,
>> >
>> > I've just read through the Unicode Technical Report #51 Unicode Emoji
>> [1],
>> > and I have a question. In section 3.3 Methodology [2], third paragraph,
>> it
>> > says:
>> >
>> > "This document takes a functional view regarding the identification of
>> > emoji: pictographs are categorized as emoji when it is reasonable to
>> give
>> > them an emoji presentation, and where they are sufficiently distinct
>> from
>> > other emoji characters. Symbols with a graphical form that people may
>> treat
>> > as pictographs, such as U+2615 HELM SYMBOL (introduced in Unicode 3.0)
>> may
>> > be included."
>> >
>> > However, when I look up the HELM SYMBOL, it seems to have code U+2388
>> > [3][4][5]. The character with code U+2615 is HOT BEVERAGE [6][7].
>> >
>> > Is this a mistake in the technical report?
>> >
>> > Best regards,
>> > Adam Renberg
>> >
>> > [1]: http://www.unicode.org/reports/tr51/index.html
>> > [2]: http://www.unicode.org/reports/tr51/index.html#Methodology
>> > [3]: http://www.unicode.org/charts/PDF/U2300.pdf
>> > [4]: http://unicode-table.com/en/2388/
>> > [5]: http://www.unicode.org/Public/3.0-Update/UnicodeData-3.0.0.txt
>> > [6]: http://www.unicode.org/charts/PDF/U2600.pdf
>> > [7]: http://unicode-table.com/en/2615/
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150831/16d5136a/attachment.html>