From ismeta.wikt at gmail.com Tue Nov 1 08:51:53 2016 From: ismeta.wikt at gmail.com (IS META) Date: Tue, 1 Nov 2016 13:51:53 +0000 Subject: =?UTF-8?Q?U=2B1FBD_GREEK_KORONIS=3A_=E1=BE=BD?= Message-ID: Dear subscribers to the Unicode public general mail list, Can anyone tell me what the intended use(s) of the character ? (U+1FBD GREEK KORONIS) is/are, please? Or, failing that, where I can find out? Many thanks in advance for any help you can provide. Apologies if this is not the right forum in which to ask this question. Yours faithfully, I.S.M.E.T.A. -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Tue Nov 1 09:46:15 2016 From: doug at ewellic.org (Doug Ewell) Date: Tue, 01 Nov 2016 07:46:15 -0700 Subject: U+1FBD GREEK KORONIS: =?UTF-8?Q?=E1=BE=BD?= Message-ID: <20161101074615.665a7a7059d7ee80bb4d670165c8327d.08018b0db2.wbe@email03.godaddy.com> IS META wrote: > Can anyone tell me what the intended use(s) of the character ? (U+1FBD > GREEK KORONIS) is/are, please? Or, failing that, where I can find out? Section 7.2, "Greek" in TUS 9.0 says: > Greek Extended: U+1F00?U+1FFF > [...] > Spacing Diacritics. Sixteen additional spacing diacritical marks are > provided in this character block for use in the representation of > polytonic Greek texts. Each has an alternative representation for use > with systems that support nonspacing marks. The nonspacing > alternatives appear in Table 7-3. The spacing forms are meant for > keyboards and pedagogical use and are not to be used in the > representation of titlecase words. The compatibility decompositions of > these spacing forms consist of the sequence U+0020 SPACE followed by > the nonspacing form equivalents shown in Table 7-3. Source: http://www.unicode.org/versions/Unicode9.0.0/ch07.pdf -- Doug Ewell | Thornton, CO, US | ewellic.org From jtauber at jtauber.com Tue Nov 1 09:51:07 2016 From: jtauber at jtauber.com (James Tauber) Date: Tue, 1 Nov 2016 10:51:07 -0400 Subject: =?UTF-8?B?UmU6IFUrMUZCRCBHUkVFSyBLT1JPTklTOiDhvr0=?= In-Reply-To: References: Message-ID: The koronis (often latinized as coronis) is a diacritic used in Ancient Greek texts (although later, not at the time they were written). It's written over a vowel to indicate contraction by crasis. See https://en.wikipedia.org/wiki/Crasis#Greek James On Tue, Nov 1, 2016 at 9:51 AM, IS META wrote: > Dear subscribers to the Unicode public general mail list, > Can anyone tell me what the intended use(s) of the character ? (U+1FBD > GREEK KORONIS) is/are, please? Or, failing that, where I can find out? > > Many thanks in advance for any help you can provide. Apologies if this is > not the right forum in which to ask this question. > > Yours faithfully, > I.S.M.E.T.A. > -- James Tauber http://jtauber.com/ @jtauber on Twitter -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Tue Nov 1 11:05:49 2016 From: doug at ewellic.org (Doug Ewell) Date: Tue, 01 Nov 2016 09:05:49 -0700 Subject: U+1FBD GREEK KORONIS: =?UTF-8?Q?=E1=BE=BD?= Message-ID: <20161101090549.665a7a7059d7ee80bb4d670165c8327d.50b0e7457d.wbe@email03.godaddy.com> James Tauber wrote: > The koronis (often latinized as coronis) is a diacritic used in > Ancient Greek texts (although later, not at the time they were > written). > > It's written over a vowel to indicate contraction by crasis. Sorry, I thought the OP was asking about the use of the specific character at U+1FBD. not about the koronis generally (which should normally be coded as U+0343). You are correct about the function of the koronis in Greek. Apologies if my answer was misleading. -- Doug Ewell | Thornton, CO, US | ewellic.org From jtauber at jtauber.com Tue Nov 1 11:13:20 2016 From: jtauber at jtauber.com (James Tauber) Date: Tue, 1 Nov 2016 12:13:20 -0400 Subject: =?UTF-8?B?UmU6IFUrMUZCRCBHUkVFSyBLT1JPTklTOiDhvr0=?= In-Reply-To: <20161101090549.665a7a7059d7ee80bb4d670165c8327d.50b0e7457d.wbe@email03.godaddy.com> References: <20161101090549.665a7a7059d7ee80bb4d670165c8327d.50b0e7457d.wbe@email03.godaddy.com> Message-ID: On Tue, Nov 1, 2016 at 12:05 PM, Doug Ewell wrote: > James Tauber wrote: > > > The koronis (often latinized as coronis) is a diacritic used in > > Ancient Greek texts (although later, not at the time they were > > written). > > > > It's written over a vowel to indicate contraction by crasis. > > Sorry, I thought the OP was asking about the use of the specific > character at U+1FBD. not about the koronis generally (which should > normally be coded as U+0343). You are correct about the function of the > koronis in Greek. Apologies if my answer was misleading. > I wasn't sure so I thought I'd complement your answer to cover all bases :-) -------------- next part -------------- An HTML attachment was scrubbed... URL: From mats.gbproject at gmail.com Wed Nov 2 19:05:13 2016 From: mats.gbproject at gmail.com (Mats Blakstad) Date: Thu, 3 Nov 2016 01:05:13 +0100 Subject: Possible to add new precomposed characters for local language in Togo? In-Reply-To: <1368697159.3867.1456389325215.JavaMail.www@wwinf1p04> References: <20160223102509.665a7a7059d7ee80bb4d670165c8327d.2a091675e5.wbe@email03.secureserver.net> <1368697159.3867.1456389325215.JavaMail.www@wwinf1p04> Message-ID: After managing to add the keyboard to XKB I started on a new venture of trying to make a windows version of the keyboard using this: https://msdn.microsoft.com/en-us/globalization/keyboardlayouts.aspx It is nearly impossible to replicate as it seems like you can only add dead keys if they have a precomposed character. Also, in Togo it is used double tones like these: "???" LATIN CAPITAL LETTER EPSILON WITH TILDE AND ACUTE "???" LATIN CAPITAL LETTER EPSILON WITH TILDE AND GRAVE And windows do not even allow dead keys with double symbols... So I wonder if it could be a solution for a precomposed double tone? So one unicode for tilde+acute and another for tilde+grave? The only way we manage to make the keyboard now is to add all the tones behind the letters instead of before the letters. I think in fact it seems easier than on French keyboard, but it will also break the French keyboard when it comes to what order you click buttons to add tones. I also think it would be a benefit to have the keyboard on windows and Ubuntu work mostly the same. Not sure if there are any other good ideas for how to solve it? On 25 February 2016 at 09:35, Marcel Schneider wrote: > On Tue, 23 Feb 2016 12:10:51 +0100, Philippe Verdy wrote: > > > 2016-02-23 11:21 GMT+01:00 Marcel Schneider : > > > > > I feel that people coming from?or studying languages of?countries and > > > communities on other continents should become able to type their > language > > > in that script on any computer in France as well as in any other Latin > > > script using countries, > [?] > > > The only difference > > > between keyboard layouts of Latin script using countries should be > varying > > > accessibility depending on frequencies of use. > > > > > > > There will remain a resistance for the base layout of letters (basically > > QWERTY vs. AZERTY vs QWERTZ) and basic punctuation > > For all other characters (including shifted or non-shifted digits, > because > > this is only an issue on mechanical keyboards, not touche-on-screen > > keyboard, and mechanical keyboards almost always have a numeric keypad > > anyway), people can adapt easily, provided that the less frequent but > > essential punctuation (parentheses, apostrophe, hyphen) can be found on > the > > key labels, as well as the location of dead keys for all the essential > > diacritics. > > > > Indeed, if there's a new standard for French, there will be new physical > > keyboards placing the labels correctly for the essential punctuation, > plus > > the essential letters combined with diacritics with a single keystroke : > > but the later letters are language-dependant and not script-dependant, so > > people writing in other languages for the same script may not find them > > useful, but should be able to locate the deadkeys to get the full > coverage > > they need. If a standard is adopted, the set of essential letters > combined > > with diacritics should be located on a small part of the keyboard that is > > the same across all languages of the script, but tuned specifically for a > > language (or a few languages of one country). > > There will remain keyboard layouts per country differing only on those > > locations in this small part, probably reduced to only 5 > language-dependant > > keys (only designed for ease of access, e.g. "?????" in French are very > > frequent and will be located in that part, but Italians would like to > have > > all vowels with acute, Spanish will want to have the "?" in this part). > > On Tue, 23 Feb 2016 10:25:09 -0700, Doug Ewell replied: > > > Philippe Verdy wrote: > > > > > There will remain a resistance for the base layout of letters > > > (basically QWERTY vs. AZERTY vs QWERTZ) and basic punctuation > > > > Philippe is absolutely right here. Most of us on this list are > > character-set and i18n wonks, and some of us have customized our own > > keyboard layouts, but we should not delude ourselves into thinking we > > represent ordinary users. Many people are emotionally tied to a > > particular keyboard layout and become very confused when faced with > > something different. Trying to persuade them to adopt a "universal" > > keyboard, so they can type characters in a language they may not know, > > is an exercise in social frustration. > > On Wed, 24 Feb 2016 01:38:59 +0100, Philippe Verdy replied: > > > And this is demonstrated since long by the epxerience of alternate > > "ergonomic" layouts, used by very few people. > > > [?] > > > > We'll continue to live for long with the 3 basic layouts for Latin > (QWERTY, > > AZERTY, QWERTZ). And nothing will really change without a strong national > > standard that will convince manufacturers to propose it at normal prices, > > and force software vendors to include it in the builtin layouts for their > > OSes. > > When I wrote: ?The only difference [?] should be [?]?, I swapped over into > an ideal world? let alone that the historic swap from QWERTY to AZERTY was > triggered by an ?accessibility? issue based ?on frequencies of use?. My > purpose being not to *enforce* ergonomics as about the alphabetical layout, > I fully agree with Mats Blakstad, whose ?method of extending the main > layout is likely to be the only useful one? as I wrote in the same > e-mail?and with Doug Ewell and Philippe Verdy, whose valuable contributions > came on to sustain. > > All parts of the Latin script as provided by Unicode, that are not used to > write local and national languages e.g. of Togo, or of France, may be > hidden as on keytops, but accessible on software side, i.e. in the layout > driver or in the configuration files. One other challenge in Togo would be > how to give easy access to the seven supplemental letters ?, ?, ?, ?, ?, ? > and ?, while the five French precomposed letters are to be maintained, let > alone ? and ??the latter being rather seldom in French however?that are > part of the new governmental requirements in France, among other characters > like the angle quotation marks, called guillemets-chevrons[1]. > > Generally talking, I can?t help believe that providing the ability to type > any Latin script using language on any Latin keyboard would be a good idea. > Again, that is feasible without overloading the keyboard with dead keys, > just providing the most frequently used ones, six in Togo as I can see. > > Marcel > > [1] Vers une norme fran?aise pour les claviers informatiques - Langue > fran?aise et langues de France - Minist?re de la Culture et de la > Communication. (2016, January 15). Retrieved January 19, 2016, from > http://www.culturecommunication.gouv.fr/Politiques-ministerielles/ > Langue-francaise-et-langues-de-France/Politiques-de-la- > langue/Langues-et-numerique/Les-technologies-de-la-langue- > et-la-normalisation/Vers-une-norme-francaise-pour-les- > claviers-informatiques > -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Wed Nov 2 19:27:32 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Thu, 3 Nov 2016 01:27:32 +0100 Subject: Possible to add new precomposed characters for local language in Togo? In-Reply-To: References: <20160223102509.665a7a7059d7ee80bb4d670165c8327d.2a091675e5.wbe@email03.secureserver.net> <1368697159.3867.1456389325215.JavaMail.www@wwinf1p04> Message-ID: My opinion is that MSKLC should be updated to support chained dead keys (internally they are supported by the OS), using more keyboard maps. This way we could enter diacritics in any order and still when typing the base letter, the result would be the whole combination of characters in NFC form... I don't think it is a good reason for encoding the double diacritic itself, only because of a limitation of MSKLC (but the Windows keymap compiler supports chained dead keys, it's only the visual editor that does not allow it) ! 2016-11-03 1:05 GMT+01:00 Mats Blakstad : > After managing to add the keyboard to XKB I started on a new venture of > trying to make a windows version of the keyboard using this: > https://msdn.microsoft.com/en-us/globalization/keyboardlayouts.aspx > > It is nearly impossible to replicate as it seems like you can only add > dead keys if they have a precomposed character. > > Also, in Togo it is used double tones like these: > > "???" LATIN CAPITAL LETTER EPSILON WITH TILDE AND ACUTE > "???" LATIN CAPITAL LETTER EPSILON WITH TILDE AND GRAVE > > And windows do not even allow dead keys with double symbols... > > So I wonder if it could be a solution for a precomposed double tone? > So one unicode for tilde+acute and another for tilde+grave? > > The only way we manage to make the keyboard now is to add all the tones > behind the letters instead of before the letters. > I think in fact it seems easier than on French keyboard, but it will also > break the French keyboard when it comes to what order you click buttons to > add tones. > I also think it would be a benefit to have the keyboard on windows and > Ubuntu work mostly the same. > > Not sure if there are any other good ideas for how to solve it? > > On 25 February 2016 at 09:35, Marcel Schneider > wrote: > >> On Tue, 23 Feb 2016 12:10:51 +0100, Philippe Verdy wrote: >> >> > 2016-02-23 11:21 GMT+01:00 Marcel Schneider : >> > >> > > I feel that people coming from?or studying languages of?countries and >> > > communities on other continents should become able to type their >> language >> > > in that script on any computer in France as well as in any other Latin >> > > script using countries, >> [?] >> > > The only difference >> > > between keyboard layouts of Latin script using countries should be >> varying >> > > accessibility depending on frequencies of use. >> > > >> > >> > There will remain a resistance for the base layout of letters (basically >> > QWERTY vs. AZERTY vs QWERTZ) and basic punctuation >> > For all other characters (including shifted or non-shifted digits, >> because >> > this is only an issue on mechanical keyboards, not touche-on-screen >> > keyboard, and mechanical keyboards almost always have a numeric keypad >> > anyway), people can adapt easily, provided that the less frequent but >> > essential punctuation (parentheses, apostrophe, hyphen) can be found on >> the >> > key labels, as well as the location of dead keys for all the essential >> > diacritics. >> > >> > Indeed, if there's a new standard for French, there will be new physical >> > keyboards placing the labels correctly for the essential punctuation, >> plus >> > the essential letters combined with diacritics with a single keystroke : >> > but the later letters are language-dependant and not script-dependant, >> so >> > people writing in other languages for the same script may not find them >> > useful, but should be able to locate the deadkeys to get the full >> coverage >> > they need. If a standard is adopted, the set of essential letters >> combined >> > with diacritics should be located on a small part of the keyboard that >> is >> > the same across all languages of the script, but tuned specifically for >> a >> > language (or a few languages of one country). >> > There will remain keyboard layouts per country differing only on those >> > locations in this small part, probably reduced to only 5 >> language-dependant >> > keys (only designed for ease of access, e.g. "?????" in French are very >> > frequent and will be located in that part, but Italians would like to >> have >> > all vowels with acute, Spanish will want to have the "?" in this part). >> >> On Tue, 23 Feb 2016 10:25:09 -0700, Doug Ewell replied: >> >> > Philippe Verdy wrote: >> > >> > > There will remain a resistance for the base layout of letters >> > > (basically QWERTY vs. AZERTY vs QWERTZ) and basic punctuation >> > >> > Philippe is absolutely right here. Most of us on this list are >> > character-set and i18n wonks, and some of us have customized our own >> > keyboard layouts, but we should not delude ourselves into thinking we >> > represent ordinary users. Many people are emotionally tied to a >> > particular keyboard layout and become very confused when faced with >> > something different. Trying to persuade them to adopt a "universal" >> > keyboard, so they can type characters in a language they may not know, >> > is an exercise in social frustration. >> >> On Wed, 24 Feb 2016 01:38:59 +0100, Philippe Verdy replied: >> >> > And this is demonstrated since long by the epxerience of alternate >> > "ergonomic" layouts, used by very few people. >> > >> [?] >> > >> > We'll continue to live for long with the 3 basic layouts for Latin >> (QWERTY, >> > AZERTY, QWERTZ). And nothing will really change without a strong >> national >> > standard that will convince manufacturers to propose it at normal >> prices, >> > and force software vendors to include it in the builtin layouts for >> their >> > OSes. >> >> When I wrote: ?The only difference [?] should be [?]?, I swapped over >> into an ideal world? let alone that the historic swap from QWERTY to AZERTY >> was triggered by an ?accessibility? issue based ?on frequencies of use?. My >> purpose being not to *enforce* ergonomics as about the alphabetical layout, >> I fully agree with Mats Blakstad, whose ?method of extending the main >> layout is likely to be the only useful one? as I wrote in the same >> e-mail?and with Doug Ewell and Philippe Verdy, whose valuable contributions >> came on to sustain. >> >> All parts of the Latin script as provided by Unicode, that are not used >> to write local and national languages e.g. of Togo, or of France, may be >> hidden as on keytops, but accessible on software side, i.e. in the layout >> driver or in the configuration files. One other challenge in Togo would be >> how to give easy access to the seven supplemental letters ?, ?, ?, ?, ?, ? >> and ?, while the five French precomposed letters are to be maintained, let >> alone ? and ??the latter being rather seldom in French however?that are >> part of the new governmental requirements in France, among other characters >> like the angle quotation marks, called guillemets-chevrons[1]. >> >> Generally talking, I can?t help believe that providing the ability to >> type any Latin script using language on any Latin keyboard would be a good >> idea. Again, that is feasible without overloading the keyboard with dead >> keys, just providing the most frequently used ones, six in Togo as I can >> see. >> >> Marcel >> >> [1] Vers une norme fran?aise pour les claviers informatiques - Langue >> fran?aise et langues de France - Minist?re de la Culture et de la >> Communication. (2016, January 15). Retrieved January 19, 2016, from >> http://www.culturecommunication.gouv.fr/Politiques-ministerielles/Langue- >> francaise-et-langues-de-France/Politiques-de-la-langue >> /Langues-et-numerique/Les-technologies-de-la-langue-et- >> la-normalisation/Vers-une-norme-francaise-pour-les-claviers-informatiques >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From moyogo at gmail.com Thu Nov 3 01:36:26 2016 From: moyogo at gmail.com (Denis Jacquerye) Date: Thu, 03 Nov 2016 06:36:26 +0000 Subject: Possible to add new precomposed characters for local language in Togo? In-Reply-To: References: <20160223102509.665a7a7059d7ee80bb4d670165c8327d.2a091675e5.wbe@email03.secureserver.net> <1368697159.3867.1456389325215.JavaMail.www@wwinf1p04> Message-ID: 2016-11-03 1:05 GMT+01:00 Mats Blakstad : So I wonder if it could be a solution for a precomposed double tone? So one unicode for tilde+acute and another for tilde+grave? The only way we manage to make the keyboard now is to add all the tones behind the letters instead of before the letters. I think in fact it seems easier than on French keyboard, but it will also break the French keyboard when it comes to what order you click buttons to add tones. I also think it would be a benefit to have the keyboard on windows and Ubuntu work mostly the same. Not sure if there are any other good ideas for how to solve it? Don?t use dead keys on the keyboard layout, then you can have the same keyboard on Windows and Ubuntu. Even if MSKLC could handle outputting multiple characters, why are dead keys a requirement? Shouldn?t you already have broken the French layout by reassigning keys to Togo language letters ?, ?, ?, ?, ?, ?, ?? If not, it sounds like it will slow down typing in those languages. -------------- next part -------------- An HTML attachment was scrubbed... URL: From moyogo at gmail.com Thu Nov 3 01:45:51 2016 From: moyogo at gmail.com (Denis Jacquerye) Date: Thu, 03 Nov 2016 06:45:51 +0000 Subject: Possible to add new precomposed characters for local language in Togo? In-Reply-To: References: <20160223102509.665a7a7059d7ee80bb4d670165c8327d.2a091675e5.wbe@email03.secureserver.net> <1368697159.3867.1456389325215.JavaMail.www@wwinf1p04> Message-ID: You can also do dead keys in reverse where, instead of having the diacritic key as a dead key that one pressed before a letter key, you have the letter key as a dead key that you press before the diacritic key. That way, your key order is the same whether a system handles outputting multiple characters or not, and you can use precomposed characters when available if that is a requirement. On Thu, 3 Nov 2016 at 06:36 Denis Jacquerye wrote: > 2016-11-03 1:05 GMT+01:00 Mats Blakstad : > > So I wonder if it could be a solution for a precomposed double tone? > So one unicode for tilde+acute and another for tilde+grave? > > The only way we manage to make the keyboard now is to add all the tones > behind the letters instead of before the letters. > I think in fact it seems easier than on French keyboard, but it will also > break the French keyboard when it comes to what order you click buttons to > add tones. > I also think it would be a benefit to have the keyboard on windows and > Ubuntu work mostly the same. > > Not sure if there are any other good ideas for how to solve it? > > > Don?t use dead keys on the keyboard layout, then you can have the same > keyboard on Windows and Ubuntu. > Even if MSKLC could handle outputting multiple characters, why are dead > keys a requirement? > > Shouldn?t you already have broken the French layout by reassigning keys to > Togo language letters ?, ?, ?, ?, ?, ?, ?? > If not, it sounds like it will slow down typing in those languages. > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From charupdate at orange.fr Thu Nov 3 02:56:39 2016 From: charupdate at orange.fr (Marcel Schneider) Date: Thu, 3 Nov 2016 08:56:39 +0100 (CET) Subject: Possible to add new precomposed characters for local language in Togo? In-Reply-To: References: <20160223102509.665a7a7059d7ee80bb4d670165c8327d.2a091675e5.wbe@email03.secureserver.net> <1368697159.3867.1456389325215.JavaMail.www@wwinf1p04> Message-ID: <935090593.1415.1478159799533.JavaMail.www@wwinf1j20> On Thu, 3 Nov 2016 01:05:13 +0100, Mats Blakstad wrote: > After managing to add the keyboard to XKB I started on a new venture of > trying to make a windows version of the keyboard using this: > https://msdn.microsoft.com/en-us/globalization/keyboardlayouts.aspx > > It is nearly impossible to replicate as it seems like you can only add dead > keys if they have a precomposed character. This Windows limitation is indeed a significant drawback. You may wish to browse the archive back and forth starting from here: http://www.unicode.org/mail-arch/unicode-ml/y2010-m01/0040.html > > Also, in Togo it is used double tones like these: > > "???" LATIN CAPITAL LETTER EPSILON WITH TILDE AND ACUTE > "???" LATIN CAPITAL LETTER EPSILON WITH TILDE AND GRAVE > > And windows do not even allow dead keys with double symbols... I top on Philippe Verdy?s reply. Serial dead keys are a Windows feature, and implementing them is feasible around MSKLC although not in the GUI, as its developer Michael Kaplan explained in a blog post that Doug Ewell shared in: http://www.unicode.org/mail-arch/unicode-ml/y2016-m10/0214.html Actually I?m localizing in English an interactive, self-explaining script in batch to facilitate generating the sources and layout drivers. It will soon be for free download here: http://charupdate.info#drivers Even the EULA issue is settled, as you may read there. Further I recommend to program the deadtrans list in C because this has the advantage of working on a flat list, while in the .klc source it is grouped. > > So I wonder if it could be a solution for a precomposed double tone? > So one unicode for tilde+acute and another for tilde+grave? > > The only way we manage to make the keyboard now is to add all the tones > behind the letters instead of before the letters. > I think in fact it seems easier than on French keyboard, but it will also > break the French keyboard when it comes to what order you click buttons to > add tones. > I also think it would be a benefit to have the keyboard on windows and > Ubuntu work mostly the same. > > Not sure if there are any other good ideas for how to solve it? Additionally to Denis Jacquerye?s replies, I would mention again a software that I believe is best fit to get what you need on Windows: Keyman. Keyman is now a part of SIL and is being made available for free. http://keyman.com/ Best regards, Marcel > > On 25 February 2016 at 09:35, Marcel Schneider wrote: > [?] ? From mats.gbproject at gmail.com Thu Nov 3 10:01:56 2016 From: mats.gbproject at gmail.com (Mats Blakstad) Date: Thu, 3 Nov 2016 16:01:56 +0100 Subject: Possible to add new precomposed characters for local language in Togo? In-Reply-To: <935090593.1415.1478159799533.JavaMail.www@wwinf1j20> References: <20160223102509.665a7a7059d7ee80bb4d670165c8327d.2a091675e5.wbe@email03.secureserver.net> <1368697159.3867.1456389325215.JavaMail.www@wwinf1p04> <935090593.1415.1478159799533.JavaMail.www@wwinf1j20> Message-ID: > Don?t use dead keys on the keyboard layout, then you can have the same keyboard on Windows and Ubuntu. As we try to keep the French keyboard 1:1 and only extend it with extra functionalities, I guess we need to keep the dead keys already present there? > Shouldn?t you already have broken the French layout by reassigning keys to Togo language letters ?, ?, ?, ?, ?, ?, ?? > If not, it sounds like it will slow down typing in those languages. No, in XKB we managed to keep the French keyboard 1:1, only extend it with extra symbols. We can't reassigning keys as local languages in Togo also use all letters in French alphabet. Besides, they mostly use the French keyboard, it will make it a lot easier & faster if they just can get extended buttons to a keyboard they already know. > You can also do dead keys in reverse where, instead of having the diacritic key as a dead key that one pressed before a letter key, you have the letter key as a dead key that you press before the diacritic key. I managed to maske such a solution, but then the keyboard is not any longer 1;1 with French keyboard as users can use the keyboard exactly as they're used to use the French keyboard. What I try achieve is to keep the French keyboard unchanged, extend it with symbols for Togolese local languages, and keep the assignment of diacritics consistent with that of the French keyboard. > Windows keymap compiler supports chained dead keys, it's only the visual editor that does not allow it > Serial dead keys are a Windows feature,and implementing them is feasible around MSKLC although not in the GUI Are there any other framework than MSKLC that is simple and easy to use? Or do we need to build from scratch? > http://charupdate.info#drivers > Further I recommend to program the deadtrans list in C because this has the advantage of working on a flat list, while in the .klc source it is grouped. > http://keyman.com/ Thanks for these great leads! I guess keyman will make it dependent for the user to install extra softwares? And the charupdate is not available. To me now it seems like the best approach to do it in C, I will try investigate more on this. Thanks for all the helpful feedbacks! On 3 November 2016 at 08:56, Marcel Schneider wrote: > On Thu, 3 Nov 2016 01:05:13 +0100, Mats Blakstad wrote: > > > After managing to add the keyboard to XKB I started on a new venture of > > trying to make a windows version of the keyboard using this: > > https://msdn.microsoft.com/en-us/globalization/keyboardlayouts.aspx > > > > It is nearly impossible to replicate as it seems like you can only add > dead > > keys if they have a precomposed character. > > This Windows limitation is indeed a significant drawback. You may wish to > browse > the archive back and forth starting from here: > http://www.unicode.org/mail-arch/unicode-ml/y2010-m01/0040.html > > > > > Also, in Togo it is used double tones like these: > > > > "???" LATIN CAPITAL LETTER EPSILON WITH TILDE AND ACUTE > > "???" LATIN CAPITAL LETTER EPSILON WITH TILDE AND GRAVE > > > > And windows do not even allow dead keys with double symbols... > > I top on Philippe Verdy?s reply. Serial dead keys are a Windows feature, > and implementing them is feasible around MSKLC although not in the GUI, as > its developer Michael Kaplan explained in a blog post that Doug Ewell > shared in: > http://www.unicode.org/mail-arch/unicode-ml/y2016-m10/0214.html > > Actually I?m localizing in English an interactive, self-explaining script > in batch > to facilitate generating the sources and layout drivers. It will soon be > for free download here: > http://charupdate.info#drivers > > Even the EULA issue is settled, as you may read there. > > Further I recommend to program the deadtrans list in C because this has the > advantage of working on a flat list, while in the .klc source it is > grouped. > > > > > So I wonder if it could be a solution for a precomposed double tone? > > So one unicode for tilde+acute and another for tilde+grave? > > > > The only way we manage to make the keyboard now is to add all the tones > > behind the letters instead of before the letters. > > I think in fact it seems easier than on French keyboard, but it will also > > break the French keyboard when it comes to what order you click buttons > to > > add tones. > > I also think it would be a benefit to have the keyboard on windows and > > Ubuntu work mostly the same. > > > > Not sure if there are any other good ideas for how to solve it? > > Additionally to Denis Jacquerye?s replies, I would mention again a software > that I believe is best fit to get what you need on Windows: > Keyman. > Keyman is now a part of SIL and is being made available for free. > http://keyman.com/ > > Best regards, > > Marcel > > > > > On 25 February 2016 at 09:35, Marcel Schneider wrote: > > > [?] > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Thu Nov 3 15:44:12 2016 From: doug at ewellic.org (Doug Ewell) Date: Thu, 03 Nov 2016 13:44:12 -0700 Subject: Possible to add new precomposed characters for local language in =?UTF-8?Q?Togo=3F?= Message-ID: <20161103134412.665a7a7059d7ee80bb4d670165c8327d.0e1cab67b3.wbe@email03.godaddy.com> I think we are talking about two different issues here. It's important to keep these separate, to avoid talking past each other. Mats Blakstad wrote: > After managing to add the keyboard to XKB I started on a new venture > of trying to make a windows version of the keyboard using this: > > [link to Microsoft Keyboard Layout Creator] > > It is nearly impossible to replicate as it seems like you can only add > dead keys if they have a precomposed character. Mats is talking about the fact that a dead key combination (of any length) under Windows can generate only a single UTF-16 code unit. This is a Windows architectural limitation, and cannot be fixed by updating MSKLC. It can only be circumvented by using Keyman or another third-party solution that runs at a layer above the Windows architecture. Philippe Verdy wrote: > My opinion is that MSKLC should be updated to support chained dead > keys (internally they are supported by the OS), using more keyboard > maps. The fact that MSKLC does not support chained dead keys is perhaps related to the problem Mats is experiencing, but it is a different issue. Even if MSKLC were updated to allow chaining of dead keys, Mats still could not use this capability to type a TILDE dead key, then an ACUTE dead key, and then an EPSILON key and get "???" LATIN CAPITAL LETTER EPSILON WITH TILDE AND ACUTE. The reason, as Mats said, is that the NFC form of this double-accented letter is still 3 code units in length, 2 more than the Windows architecture supports. Furthermore, even though many of us would like for MSKLC to be updated, the reality is that its developer (Michael Kaplan) is no longer with us, and Microsoft had already terminated MSKLC development (a source of frequent frustration to Michael). We can all wish that Microsoft would reverse itself and start devoting resources to this project of Michael's, but it's probably not going to happen. A more realistic course of action might be for someone outside of Microsoft, maybe someone on this list, to create their own GUI wrapper around the Microsoft engine, a "new MSKLC" so to speak. That new project could remove the MSKLC limitation, but not the Windows one. Mats wrote: > So I wonder if it could be a solution for a precomposed double tone? > So one unicode for tilde+acute and another for tilde+grave? If Unicode policy is what it used to be, then Philippe is correct: vendor limits are not an adequate justification for encoding double diacritics. Doing so would introduce new ambiguities, just like encoding new precomposed versions of characters that already have decomposed representations. Denis Jacquerye suggested using the letter as the dead key instead of the diacritic. Perhaps a more straightforward approach would be to give the diacritical marks their own normal keys, so the user could type EPSILON, (combining) TILDE, (combining) ACUTE. Marcel Schneider's suggestion of using Keyman instead might be the best, if it is mandatory for the Windows version of this layout to be identical to the Ubuntu version, for reasons I don't understand (many keyboard layouts are already not constant across platforms). -- Doug Ewell | Thornton, CO, US | ewellic.org From charupdate at orange.fr Thu Nov 3 15:44:56 2016 From: charupdate at orange.fr (Marcel Schneider) Date: Thu, 3 Nov 2016 21:44:56 +0100 (CET) Subject: Possible to add new precomposed characters for local language in Togo? In-Reply-To: References: <20160223102509.665a7a7059d7ee80bb4d670165c8327d.2a091675e5.wbe@email03.secureserver.net> <1368697159.3867.1456389325215.JavaMail.www@wwinf1p04> <935090593.1415.1478159799533.JavaMail.www@wwinf1j20> Message-ID: <2090428085.16301.1478205896676.JavaMail.www@wwinf1e27> On 3 Nov 2016 16:01:56 +0100, Mats Blakstad wrote; > > Don?t use dead keys on the keyboard layout, then you can have the same > > keyboard on Windows and Ubuntu. > > As we try to keep the French keyboard 1:1 There are many. A standard is now being written, that will be subject to public enquiry from december through february, 2017. Base letters remain unchanged. > and only extend it with extra functionalities, > I guess we need to keep the dead keys already present there? E.g. with combining diacritics by [dead key] followed by [space bar], as on: http://uscustom.sourceforge.net/ [?] > > > Windows keymap compiler supports chained dead keys, it's only the visual > > editor that does not allow it > > Serial dead keys are a Windows feature,and implementing them is feasible > > around MSKLC although not in the GUI > > Are there any other framework than MSKLC that is simple and easy to use? I know people who use KbdEdit and like it, but it still has extra limitations. http://www.kbdedit.com > Or do we need to build from scratch? No, KbdUTool generates the C sources from any KLC file, that MSKLC generates from any keyboard layout that ships with Windows except the Canadian Standard Keyboard, because this uses a modifier (0x08) that is unsupported in MSKLC. > > > http://charupdate.info#drivers > > Further I recommend to program the deadtrans list in C because this has > > the advantage of working on a flat list, while in the .klc source it is > > grouped. > > http://keyman.com/ > > Thanks for these great leads! I guess keyman will make it dependent for the > user to install extra softwares? Yes, but IMHO installing custom keyboard layout drivers on Windows is not essentially different from installing extra software. However if it is to be shipped with Windows and distributed through Windows Update, Windows limitations apply, i.e. only one code unit by dead keys. In this case, even high surrogates must be entered separately (a not very intuitive workaround). > And the charupdate is not available. Now it is, though a huge part is still in French. My apologies. > To me now it seems like the best approach to do it in C, I will try > investigate more on this. > > Thanks for all the helpful feedbacks! You are welcome. > > On 3 November 2016 at 08:56, Marcel Schneider wrote: > [?] From verdy_p at wanadoo.fr Thu Nov 3 17:56:13 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Thu, 3 Nov 2016 23:56:13 +0100 Subject: Possible to add new precomposed characters for local language in Togo? In-Reply-To: <20161103134412.665a7a7059d7ee80bb4d670165c8327d.0e1cab67b3.wbe@email03.godaddy.com> References: <20161103134412.665a7a7059d7ee80bb4d670165c8327d.0e1cab67b3.wbe@email03.godaddy.com> Message-ID: 2016-11-03 21:44 GMT+01:00 Doug Ewell : > I think we are talking about two different issues here. It's important > to keep these separate, to avoid talking past each other. > > Mats Blakstad wrote: > > > After managing to add the keyboard to XKB I started on a new venture > > of trying to make a windows version of the keyboard using this: > > > > [link to Microsoft Keyboard Layout Creator] > > > > It is nearly impossible to replicate as it seems like you can only add > > dead keys if they have a precomposed character. > > Mats is talking about the fact that a dead key combination (of any > length) under Windows can generate only a single UTF-16 code unit. > > That's wrong. Windows can perfectly generate multiple code units (in fact it does it for non BMP characters, including in MSKLC!) from its KLC tables using the default system driver. Only the GUI editor MSKLC cannot use this possibility and it does not understand chained tables (note: you can perfectly assign another table index instead of a character to the combination of a dead key state and another dead key, so that you can type another key which will be mapped in the combined state; the combined state can then accept the space bar to force the output of the NFC form for SPACE+diacritic1+diacritic2, which should be, if possible, a spacing-diacritic1 followed by a combining-diacritic2, or the reverse if both diacritics have a non-zero combining class but the second one has a lower combining clas than the second one). In summary MSKLC is unable to edit **visually** the combined state prodiced by typing two dead keys. But the .klc file is compilable and works. It is trivial to make such transform to generate the C source of the tables and compile it to a driver, you should not need to know C/C++ to do that, and the .klc source contianined the chained keys should be enough. -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Thu Nov 3 18:24:57 2016 From: doug at ewellic.org (Doug Ewell) Date: Thu, 03 Nov 2016 16:24:57 -0700 Subject: Possible to add new precomposed characters for local language in =?UTF-8?Q?Togo=3F?= Message-ID: <20161103162457.665a7a7059d7ee80bb4d670165c8327d.c28f73703f.wbe@email03.godaddy.com> Philippe Verdy wrote: >> Mats is talking about the fact that a dead key combination (of any >> length) under Windows can generate only a single UTF-16 code unit. > > That's wrong. Windows can perfectly generate multiple code units (in > fact it does it for non BMP characters, including in MSKLC!) from its > KLC tables using the default system driver. >From a dead key combination? Can you provide an example? > Only the GUI editor MSKLC cannot use this possibility and it does not > understand chained tables (note: you can perfectly assign another > table index instead of a character to the combination of a dead key > state and another dead key, so that you can type another key which > will be mapped in the combined state; the combined state can then > accept the space bar to force the output of the NFC form for > SPACE+diacritic1+diacritic2, which should be, if possible, a > spacing-diacritic1 followed by a combining-diacritic2, or the reverse > if both diacritics have a non-zero combining class but the second one > has a lower combining clas than the second one). Even if true -- and I doubt that the Windows keyboard engine knows anything about Unicode combining classes -- it doesn't solve Mats's problem. He doesn't want to generate the two diacritical marks in isolation. He could do that without dead keys. If a user types a dead key, followed by a character not listed in the dead key table, Windows gives up and outputs the characters associated with the two keys. That's not at all the same thing as what Mats wants. What Mats wants is to enter , , and have the keyboard generate . That is the sequence of 3 output code units that the Windows architecture -- not just MSKLC -- does not support. If you disagree, please provide an example. -- Doug Ewell | Thornton, CO, US | ewellic.org From mark at kli.org Thu Nov 3 18:43:43 2016 From: mark at kli.org (Mark Shoulson) Date: Thu, 3 Nov 2016 19:43:43 -0400 Subject: The (Klingon) Empire Strikes Back Message-ID: <01275881-d53b-269d-fde9-330e7d94be37@kli.org> At the time of writing this letter it has not yet hit the UTC Document Register, but I have recently submitted a document revisiting the ever-popular issue of the encoding of Klingon "pIqaD". The reason always given why it could not be encoded was that it did not enjoy enough usage, and so I've collected a bunch of examples to demonstrate that this is not true (scans and also web pages, etc.) So the issue comes back up, and time to talk about it again. Michael Everson: I basically copied your 1997 proposal into the document, with some minor changes. I hope you don't mind. And if you don't want to be on the hook for providing the glyphs to UTC, I can do that. I think that proposal should serve as a starting-point for discussion anyway. There are some things that maybe should be different: 1. the "SYMBOL FOR EMPIRE" also known as the "MUMMIFICATION GLYPH". I don't know where the second name comes from, I don't know how important it is to encode it, and I don't know how much of a trademark headache it will cause with Paramount, as it is used pretty heavily in their imagery. Something we'll have to talk about. 2. I put in the COMMA and FULL STOP, which were not in the original proposal but were in the ConScript registry entry. The examples I have show them clearly being used. UTC may decide to unify them with existing triangular shapes, which may or may not be a good idea. 3. For my part, I've invented a pair of ampersands for Klingon (Klingon has two words for "and": one for joining verbs/sentences and one for joining nouns (the former goes between its "conjunctands", the latter after them)), from ligatures of the letters in question. The pretty much have NO usage, of course (and are not in the proposal), but maybe they should be presented to the community. Document is available at http://web.meson.org/downloads/pIqaDReturns.pdf Let the bickering begin! ~mark -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Thu Nov 3 18:53:57 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Fri, 4 Nov 2016 00:53:57 +0100 Subject: Possible to add new precomposed characters for local language in Togo? In-Reply-To: <20161103162457.665a7a7059d7ee80bb4d670165c8327d.c28f73703f.wbe@email03.godaddy.com> References: <20161103162457.665a7a7059d7ee80bb4d670165c8327d.c28f73703f.wbe@email03.godaddy.com> Message-ID: 2016-11-04 0:24 GMT+01:00 Doug Ewell : > Philippe Verdy wrote: > > >> Mats is talking about the fact that a dead key combination (of any > >> length) under Windows can generate only a single UTF-16 code unit. > > > > That's wrong. Windows can perfectly generate multiple code units (in > > fact it does it for non BMP characters, including in MSKLC!) from its > > KLC tables using the default system driver. > > From a dead key combination? Can you provide an example? > > > Only the GUI editor MSKLC cannot use this possibility and it does not > > understand chained tables (note: you can perfectly assign another > > table index instead of a character to the combination of a dead key > > state and another dead key, so that you can type another key which > > will be mapped in the combined state; the combined state can then > > accept the space bar to force the output of the NFC form for > > SPACE+diacritic1+diacritic2, which should be, if possible, a > > spacing-diacritic1 followed by a combining-diacritic2, or the reverse > > if both diacritics have a non-zero combining class but the second one > > has a lower combining clas than the second one). > > Even if true -- and I doubt that the Windows keyboard engine knows > anything about Unicode combining classes -- it doesn't solve Mats's > problem. He doesn't want to generate the two diacritical marks in > isolation. He could do that without dead keys. > Windows does not have to know that: the order will be the one you have used in your keymap tables. If a user types a dead key, followed by a character not listed in the > dead key table, Windows gives up and outputs the characters associated > with the two keys. That's not at all the same thing as what Mats wants. > Windows does not do that magically: for characters missing in a table, it uses by default the position assigned to the space bar, which must be mapped in all keymaps to generate a seuqnce for the "isolated" dead keys, then it will reset the state to initial, and then will try to find a mapping for that character from the table for the initial state. > > What Mats wants is to enter , , and > have the keyboard generate . That is > the sequence of 3 output code units that the Windows architecture -- not > just MSKLC -- does not support. If you disagree, please provide an > example. I had perfectly understood that ! And my response was in line for this need: Pseudo-code: Table[Initialstate] [,] = StateDeadKey1 Table[StateDeadKey1] [,] = StateDeadKey1And2 Table[StateDeadKey1And2] [,] = NFC() Each table entry can contain either a special value for a table index (representing the current state), or a sequence of UTF-16 code units (the number of code units depends on the table format, whose header indicates how many code units are stored, and how many modifiers are mapped or masked), or a null entry for unmapped keys). The maximum number of UTF-16 code units depends on the OS version which supports more formats (I think it is now up to 6 code units in past versions it was 4, but there's an extra format where table entries are in fact positions in a string table, where strings have variable lengths: the string table just follows the tables of keymaps, there's actually no code at all in most keyboard drivers that don't need a special UI. Newer drivers for Windows hwoever contain additional data with a geometric layout for touch screens. Some drivers will contain code (notably for CJK keyboards that need an UI interface for their IME, and for typing emojis, or to use assistive technologies based on lingusitic dictionnary lookups, such as "T9" input methods on smartphones/tablets/remote controls). -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Thu Nov 3 19:06:49 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Fri, 4 Nov 2016 01:06:49 +0100 Subject: The (Klingon) Empire Strikes Back In-Reply-To: <01275881-d53b-269d-fde9-330e7d94be37@kli.org> References: <01275881-d53b-269d-fde9-330e7d94be37@kli.org> Message-ID: 2016-11-04 0:43 GMT+01:00 Mark Shoulson : > 3. For my part, I've invented a pair of ampersands for Klingon (Klingon > has two words for "and": one for joining verbs/sentences and one for > joining nouns (the former goes between its "conjunctands", the latter after > them)), from ligatures of the letters in question. > That is not new to Klingon, and it exists also in Classical Latin : - the coordinator "et" between words, for simple cases; this translates as "and" in English... - the "-que" suffix at end of the second word which may be far after the first one (which could be in another prior sentence, or implied by the context and not given explicitly); this translates as the adverb "also" in English... I've seen that suffix abbreviated as a "q" with a tilde above, or a slanted tilde mark attached above, or an horizontal tilde crossing the leg of the q below... Sorry I can't remember the name of these abbreviation marks. -------------- next part -------------- An HTML attachment was scrubbed... URL: From mark at kli.org Thu Nov 3 19:51:11 2016 From: mark at kli.org (Mark E. Shoulson) Date: Thu, 3 Nov 2016 20:51:11 -0400 Subject: The (Klingon) Empire Strikes Back In-Reply-To: References: <01275881-d53b-269d-fde9-330e7d94be37@kli.org> Message-ID: <43b93f7c-dcf2-9315-5c2d-cde9896ca931@kli.org> Yes, it isn't unique to Klingon, I never said it was, and who cares that Latin also has it?? We weren't talking about Latin! ~mark On 11/03/2016 08:06 PM, Philippe Verdy wrote: > 2016-11-04 0:43 GMT+01:00 Mark Shoulson >: > > 3. For my part, I've invented a pair of ampersands for Klingon > (Klingon has two words for "and": one for joining verbs/sentences > and one for joining nouns (the former goes between its > "conjunctands", the latter after them)), from ligatures of the > letters in question. > > That is not new to Klingon, and it exists also in Classical Latin : > > - the coordinator "et" between words, for simple cases; this > translates as "and" in English... > - the "-que" suffix at end of the second word which may be far after > the first one (which could be in another prior sentence, or implied by > the context and not given explicitly); this translates as the adverb > "also" in English... I've seen that suffix abbreviated as a "q" with a > tilde above, or a slanted tilde mark attached above, or an horizontal > tilde crossing the leg of the q below... Sorry I can't remember the > name of these abbreviation marks. > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Thu Nov 3 22:29:40 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Fri, 4 Nov 2016 04:29:40 +0100 Subject: The (Klingon) Empire Strikes Back In-Reply-To: <43b93f7c-dcf2-9315-5c2d-cde9896ca931@kli.org> References: <01275881-d53b-269d-fde9-330e7d94be37@kli.org> <43b93f7c-dcf2-9315-5c2d-cde9896ca931@kli.org> Message-ID: May be but it is still relevant : what is the purpose of these invented Kilngon ampersands: aren't they ligatures or abbreviation marks like the "-que", different from the "et" (&) ligature in Latin ? We have "&" encoded only because it exists in ASCII and it is used as a distinctive isolated symbol, But why wouldn't we have the "-que" ligature encoded in Latin, but we would have two invented ligatures for Klongon ? 2016-11-04 1:51 GMT+01:00 Mark E. Shoulson : > Yes, it isn't unique to Klingon, I never said it was, and who cares that > Latin also has it?? We weren't talking about Latin! > > ~mark > > > On 11/03/2016 08:06 PM, Philippe Verdy wrote: > > 2016-11-04 0:43 GMT+01:00 Mark Shoulson : > >> 3. For my part, I've invented a pair of ampersands for Klingon (Klingon >> has two words for "and": one for joining verbs/sentences and one for >> joining nouns (the former goes between its "conjunctands", the latter after >> them)), from ligatures of the letters in question. >> > That is not new to Klingon, and it exists also in Classical Latin : > > - the coordinator "et" between words, for simple cases; this translates as > "and" in English... > - the "-que" suffix at end of the second word which may be far after the > first one (which could be in another prior sentence, or implied by the > context and not given explicitly); this translates as the adverb "also" in > English... I've seen that suffix abbreviated as a "q" with a tilde above, > or a slanted tilde mark attached above, or an horizontal tilde crossing the > leg of the q below... Sorry I can't remember the name of these abbreviation > marks. > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From charupdate at orange.fr Fri Nov 4 11:47:16 2016 From: charupdate at orange.fr (Marcel Schneider) Date: Fri, 4 Nov 2016 17:47:16 +0100 (CET) Subject: Possible to add new precomposed characters for local language in Togo? In-Reply-To: <20161103162457.665a7a7059d7ee80bb4d670165c8327d.c28f73703f.wbe@email03.godaddy.com> References: <20161103162457.665a7a7059d7ee80bb4d670165c8327d.c28f73703f.wbe@email03.godaddy.com> Message-ID: <130831862.11144.1478278036475.JavaMail.www@wwinf1j20> On Thu, 03 Nov 2016 13:44:12 -0700, Doug Ewell wrote: > I think we are talking about two different issues here. It's important > to keep these separate, to avoid talking past each other. Thank you for the clarification. > A more realistic course of action might be for someone outside of > Microsoft, maybe someone on this list, to create their own GUI wrapper > around the Microsoft engine, a "new MSKLC" so to speak. That new project > could remove the MSKLC limitation, but not the Windows one. >From my own point of view, I can tell that creating big keyboard layouts (above 500 characters) in a GUI is really inefficient, hence the demand for an ?import table? feature as expressed in the cited 2010 thread: http://www.unicode.org/mail-arch/unicode-ml/y2010-m01/0020.html # 6. The MSKLC is OK, it provides all that is needed, gives a detailed insight into the first few shift states, and is well documented. What we can do: 1) Follow Michael?s invitation to automate with a batch script, as seems to intend his cited blog post, see link on bottom of: http://www.unicode.org/mail-arch/unicode-ml/y2016-m10/0213.html 2) Share source templates (as I?m doing on http://dispoclavier.com already commented in English, but still under development); 3) Share spreadsheet folders that are automated for efficient layout table editing (allocation table, deadtrans list, ligatures table, NamesList.txt or UnicodeData.txt in a spreadsheet, for multiple purpose). > Denis Jacquerye suggested using the letter as the dead key instead of > the diacritic. Perhaps a more straightforward approach would be to give > the diacritical marks their own normal keys, so the user could type > EPSILON, (combining) TILDE, (combining) ACUTE. This is found also for Bambara on a French-layout-based Malian layout: http://www.mali-pense.net/IMG/pdf/le-clavier_francais-bambara.pdf Linked on: http://www.mali-pense.net/Ressources-pour-la-pratique-du.html On this layout, the grave and circumflex accents are duplicated as combining diacritics to be used throughout as tone marks for consistency, because rendering differences were experienced between composed and precomposed. > Marcel Schneider's suggestion of using Keyman instead might be the best, > if it is mandatory for the Windows version of this layout to be > identical to the Ubuntu version, for reasons I don't understand (many > keyboard layouts are already not constant across platforms). Yes, e.g. Apple does provide a French (France) layout that allows to write French, while Microsoft does not, although the charsets had been completed. As soon as a standard layout does exist, it should be cross-platform. So Mats Blakstad scarcely would be willing to maintain the two diverging implementations when standardization is on. Marcel From charupdate at orange.fr Fri Nov 4 11:56:00 2016 From: charupdate at orange.fr (Marcel Schneider) Date: Fri, 4 Nov 2016 17:56:00 +0100 (CET) Subject: Possible to add new precomposed characters for local language in Togo? In-Reply-To: References: <20161103162457.665a7a7059d7ee80bb4d670165c8327d.c28f73703f.wbe@email03.godaddy.com> Message-ID: <583905832.11286.1478278560491.JavaMail.www@wwinf1j20> On Fri, 4 Nov 2016 00:53:57 +0100, Philippe Verdy wrote: > > What Mats wants is to enter , , and > > have the keyboard generate . That is > > the sequence of 3 output code units that the Windows architecture -- not > > just MSKLC -- does not support. If you disagree, please provide an > > example. > > I had perfectly understood that ! And my response was in line for this need: > > Pseudo-code: > > Table[Initialstate] [,] = StateDeadKey1 > Table[StateDeadKey1] [,] = StateDeadKey1And2 > Table[StateDeadKey1And2] [,] = NFC( deadkey1; deadkey2>) > > Each table entry can contain either a special value for a table index > (representing the current state), or a sequence of UTF-16 code units (the > number of code units depends on the table format, whose header indicates > how many code units are stored, and how many modifiers are mapped or > masked), or a null entry for unmapped keys). The maximum number of UTF-16 > code units depends on the OS version which supports more formats (I think > it is now up to 6 code units in past versions it was 4, but there's an > extra format where table entries are in fact positions in a string table, > where strings have variable lengths: the string table just follows the > tables of keymaps, there's actually no code at all in most keyboard drivers > that don't need a special UI. > [?] Does this work on Windows? Being not a programmer, I mainly ape and edit existing code, so to test this I need the exact spelling of the header and one complete line of the DEADTRANS function. Would you please provide a link to a source file or to a How-to page? BTW when reading your comment, I suspect there is a mix of several sections. Michael Kaplan knew that what you are claiming does not work: ?Every sequence of chained dead keys must end up pointing to a single UTF-16 code point; no sequence can be created;? http://archives.miloush.net/michkap/archive/2011/04/16/10154700.html (Michael?s blog post about chained dead keys, again.) Having said that, your announcement (if true) shortcuts an enormous battle and greatly improves Microsoft?s relationship to Unicode support and i18n. I?m getting puzzled that this feature is being hidden instead of promoted. Finally however I?d be less surprised given these two precedents: 1) When based on MSKLC?s GUI I was in the same position of ignoring Windows support for serial dead keys, I vainly posted demands on Microsoft fora? http://answers.microsoft.com/en-us/insider/forum/insider_wintp-insider_devices/how-to-implement-multiple-deadkey-strokes/4ff38c09-b58c-490a- 963e-3cc745dfb396 https://social.technet.microsoft.com/Forums/windows/en-US/e61dad3a-dbe5-4c5e-88af-7fc33cbb2e6a/multiple-deadkey-strokes-still-not-implemented- on-windows?forum=w7itproappcompat ?until I found full explanations on the keyboarding page of MNA?s website: http://accentuez.mon.nom.free.fr/Clavier-CreationClavier.php 2) The issue about the maximum number of code units input by a single key press. So we look forward to any supplemental information, hopefully that Windows will end up having a keyboard input framework with exactly the same performances as its challengers. Marcel From doug at ewellic.org Fri Nov 4 12:03:42 2016 From: doug at ewellic.org (Doug Ewell) Date: Fri, 04 Nov 2016 10:03:42 -0700 Subject: Possible to add new precomposed characters for local language in =?UTF-8?Q?Togo=3F?= Message-ID: <20161104100342.665a7a7059d7ee80bb4d670165c8327d.70e1439568.wbe@email03.godaddy.com> Philippe Verdy wrote: >>> the combined state can then >>> accept the space bar to force the output of the NFC form for >>> SPACE+diacritic1+diacritic2, which should be, if possible, a >>> spacing-diacritic1 followed by a combining-diacritic2, or the >>> reverse if both diacritics have a non-zero combining class but the >>> second one has a lower combining clas than the second one). >> >> Even if true -- and I doubt that the Windows keyboard engine knows >> anything about Unicode combining classes -- it doesn't solve Mats's >> problem. He doesn't want to generate the two diacritical marks in >> isolation. He could do that without dead keys. > > Windows does not have to know that: the order will be the one you have > used in your keymap tables. Then combining classes have nothing to do with this after all, and it was misleading to mention them. >> If a user types a dead key, followed by a character not listed in the >> dead key table, Windows gives up and outputs the characters >> associated with the two keys. That's not at all the same thing as >> what Mats wants. > > Windows does not do that magically: for characters missing in a table, > it uses by default the position assigned to the space bar, which must > be mapped in all keymaps to generate a seuqnce for the "isolated" dead > keys, then it will reset the state to initial, and then will try to > find a mapping for that character from the table for the initial > state. Nope. Try typing , on any Windows keyboard you like. You will get 'b' followed by whatever base character is associated with the dead key. This is often apostrophe or U+00B4, but the space bar has *nothing to do with this*. It is the code point that has the @ sign before it in the main LAYOUT table. Here is a snippet you can actually copy and paste into a KLC file to illustrate this: LAYOUT ;an extra '@' at the end is a dead key //SC VK_ Cap 0 1 2 //-- ---- ---- ---- ---- ---- 28 OEM_7 0 0027@ -1 -1 // APOSTROPHE, , 30 B 0 b -1 -1 // LATIN SMALL LETTER B, , 39 SPACE 0 0020 0020 -1 // SPACE, SPACE, 53 DECIMAL 0 -1 -1 -1 // DEADKEY 0027 0061 00e1 // a -> ? > Pseudo-code: > > Table[Initialstate] [,] = StateDeadKey1 > Table[StateDeadKey1] [,] = StateDeadKey1And2 > Table[StateDeadKey1And2] [,] = > NFC() This is not an example of how it actually works, which someone else can duplicate. It is a description of how you imagine it works. The chained dead key part is fine, as I said before, but the part where NFC() adds up to two or more code units is NOT fine. You can't do that. It won't compile the way you expect, if at all. Try it and see, and send or post the *actual code* if you get it to work. > Each table entry can contain either a special value for a table index > (representing the current state), or a sequence of UTF-16 code units > (the number of code units depends on the table format, whose header > indicates how many code units are stored, and how many modifiers are > mapped or masked), or a null entry for unmapped keys). The maximum > number of UTF-16 code units depends on the OS version which supports > more formats (I think it is now up to 6 code units in past versions it > was 4, but there's an extra format where table entries are in fact > positions in a string table, where strings have variable lengths: the > string table just follows the tables of keymaps, there's actually no > code at all in most keyboard drivers that don't need a special UI. Very little of this is demonstrably true, such as the part where the limit of 4 UTF-16 code units was somehow increased to 6, despite the fact that Kaplan often said this had not happened. And dead key mappings don't follow this at all; they are limited to ONE code unit. Again, if you can't demonstrate otherwise, but can only assert it, you may as well assert that the sun revolves around the earth. -- Doug Ewell | Thornton, CO, US | ewellic.org From doug at ewellic.org Fri Nov 4 12:09:54 2016 From: doug at ewellic.org (Doug Ewell) Date: Fri, 04 Nov 2016 10:09:54 -0700 Subject: Possible to add new precomposed characters for local language in =?UTF-8?Q?Togo=3F?= Message-ID: <20161104100954.665a7a7059d7ee80bb4d670165c8327d.ca380c1ebd.wbe@email03.godaddy.com> I wrote: > You will get 'b' followed by whatever base character is associated > with the dead key. Sorry, should be "preceded by". -- Doug Ewell | Thornton, CO, US | ewellic.org From davidj_faulks at yahoo.ca Fri Nov 4 12:41:44 2016 From: davidj_faulks at yahoo.ca (David Faulks) Date: Fri, 4 Nov 2016 17:41:44 +0000 (UTC) Subject: The (Klingon) Empire Strikes Back References: <42101413.334282.1478281304520.ref@mail.yahoo.com> Message-ID: <42101413.334282.1478281304520@mail.yahoo.com> > On Thu, 11/3/16, Mark Shoulson wrote: > Subject: The (Klingon) Empire Strikes Back > At the time of writing this letter it has not yet hit the UTC > Document Register, but I have recently submitted a document > revisiting the ever-popular issue of the encoding of Klingon > "pIqaD".? The reason always given why it could not be > encoded was that it did not enjoy enough usage, and so I've > collected a bunch of examples to demonstrate that this is not > true (scans and also web pages, etc.)? So the issue comes > back up, and time to talk about it again. There is another issue of course, which I think could be a huge obstacle: the Trademark/Copyright issue. Paramount claims copyright over the entire Klingon language (presumably including the script). The issue has recently gone to court. Encoding criteria for symbols (and this likely extends to letters) is against encoding them without the permission of the Copyright/Trademark holder. Is Paramount endorsing your proposal? > ~mark David Faulks From verdy_p at wanadoo.fr Fri Nov 4 13:06:07 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Fri, 4 Nov 2016 19:06:07 +0100 Subject: Possible to add new precomposed characters for local language in Togo? In-Reply-To: <20161104100342.665a7a7059d7ee80bb4d670165c8327d.70e1439568.wbe@email03.godaddy.com> References: <20161104100342.665a7a7059d7ee80bb4d670165c8327d.70e1439568.wbe@email03.godaddy.com> Message-ID: 2016-11-04 18:03 GMT+01:00 Doug Ewell : > Philippe Verdy wrote: > > >>> the combined state can then > >>> accept the space bar to force the output of the NFC form for > >>> SPACE+diacritic1+diacritic2, which should be, if possible, a > >>> spacing-diacritic1 followed by a combining-diacritic2, or the > >>> reverse if both diacritics have a non-zero combining class but the > >>> second one has a lower combining clas than the second one). > >> > >> Even if true -- and I doubt that the Windows keyboard engine knows > >> anything about Unicode combining classes -- it doesn't solve Mats's > >> problem. He doesn't want to generate the two diacritical marks in > >> isolation. He could do that without dead keys. > > > > Windows does not have to know that: the order will be the one you have > > used in your keymap tables. > > Then combining classes have nothing to do with this after all, and it > was misleading to mention them. > > >> If a user types a dead key, followed by a character not listed in the > >> dead key table, Windows gives up and outputs the characters > >> associated with the two keys. That's not at all the same thing as > >> what Mats wants. > > > > Windows does not do that magically: for characters missing in a table, > > it uses by default the position assigned to the space bar, which must > > be mapped in all keymaps to generate a seuqnce for the "isolated" dead > > keys, then it will reset the state to initial, and then will try to > > find a mapping for that character from the table for the initial > > state. > > Nope. Try typing , on any Windows keyboard you like. > You will get 'b' followed by whatever base character is associated with > the dead key. This is often apostrophe or U+00B4, but the > space bar has *nothing to do with this*. It is the code point that has > the @ sign before it in the main LAYOUT table. > > Here is a snippet you can actually copy and paste into a KLC file to > illustrate this: > > > > LAYOUT ;an extra '@' at the end is a dead key > //SC VK_ Cap 0 1 2 > //-- ---- ---- ---- ---- ---- > 28 OEM_7 0 0027@ -1 -1 // APOSTROPHE, , > 30 B 0 b -1 -1 // LATIN SMALL LETTER B, , > 39 SPACE 0 0020 0020 -1 // SPACE, SPACE, > 53 DECIMAL 0 -1 -1 -1 // > > DEADKEY 0027 > 0061 00e1 // a -> ? > > > > > Pseudo-code: > > > > Table[Initialstate] [,] = StateDeadKey1 > > Table[StateDeadKey1] [,] = StateDeadKey1And2 > > Table[StateDeadKey1And2] [,] = > > NFC() > > This is not an example of how it actually works, which someone else can > duplicate. It is a description of how you imagine it works. > It is the way it is documented in MSDN that explains the formats fo keymap tables (you have to notice that there are several table formats, each format allowing more or less code units. You seem to only see the basic historic format (the one used in Win16/Win9x) that only stores a single code unit, there are others, and they are documented, includeing the fact that the values of table entries are two kinds: either code units, or specific values for chaining to a dead key table, and the spacial NULL value to fill gaps, because table entries have a static length. -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Fri Nov 4 13:52:24 2016 From: doug at ewellic.org (Doug Ewell) Date: Fri, 04 Nov 2016 11:52:24 -0700 Subject: Possible to add new precomposed characters for local language in =?UTF-8?Q?Togo=3F?= Message-ID: <20161104115224.665a7a7059d7ee80bb4d670165c8327d.30cff46423.wbe@email03.godaddy.com> Philippe Verdy wrote: >> This is not an example of how it actually works, which someone else >> can duplicate. It is a description of how you imagine it works. > > It is the way it is documented in MSDN that explains the formats fo > keymap tables (you have to notice that there are several table > formats, each format allowing more or less code units. Well, gee, I'd like to look that up and see how to apply it, but you didn't supply a link. Does one exist? > You seem to only see the basic historic format (the one used in Win16/ > Win9x) that only stores a single code unit, there are others, and they > are documented, includeing the fact that the values of table entries > are two kinds: either code units, or specific values for chaining to a > dead key table, and the spacial NULL value to fill gaps, because table > entries have a static length. Where is the reference to these new formats? Where are the guidelines and specifications on how to build a Windows keyboard layout, or even a "new MSKLC," taking these new formats and tables into account? Are they available anywhere? (Don't just say "MSDN," which is big. Be specific.) -- Doug Ewell | Thornton, CO, US | ewellic.org From verdy_p at wanadoo.fr Fri Nov 4 14:44:55 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Fri, 4 Nov 2016 20:44:55 +0100 Subject: Possible to add new precomposed characters for local language in Togo? In-Reply-To: <20161104115224.665a7a7059d7ee80bb4d670165c8327d.30cff46423.wbe@email03.godaddy.com> References: <20161104115224.665a7a7059d7ee80bb4d670165c8327d.30cff46423.wbe@email03.godaddy.com> Message-ID: Consider this source code (based on Microsfot "kbd.h", even if it is ported to ReadOS) https://doxygen.reactos.org/d7/df4/kbd_8h_source.html Look for the structures named with "LIGATURE" And now look at the special entry value "WCH_LGTR"=0xF002 (i.e. a PUA), which indicate these keys are mapped using those "LIGATUREn" structures (which have arbitrary lengths in WCHAR/UTF-16 code units), instead of storing a 16-bit code unit directly. predefines LIGATURE1 to LIGATURE5 but longer lengths are possible (see cbLgEntry and nLgMaxd members in the KBDTABLE structure) The table of ligatures in linked from the pLigature member of the KBDTABLES structure, which points to the first set of LIGATURE1 mappings. Now study more precisely how _KBDTABLES is defined and documented in MSDN... 2016-11-04 19:52 GMT+01:00 Doug Ewell : > > Philippe Verdy wrote: > > >> This is not an example of how it actually works, which someone else > >> can duplicate. It is a description of how you imagine it works. > > > > It is the way it is documented in MSDN that explains the formats fo > > keymap tables (you have to notice that there are several table > > formats, each format allowing more or less code units. > > Well, gee, I'd like to look that up and see how to apply it, but you > didn't supply a link. Does one exist? -------------- next part -------------- An HTML attachment was scrubbed... URL: From mark at kli.org Fri Nov 4 15:17:37 2016 From: mark at kli.org (Mark E. Shoulson) Date: Fri, 4 Nov 2016 16:17:37 -0400 Subject: The (Klingon) Empire Strikes Back In-Reply-To: <42101413.334282.1478281304520@mail.yahoo.com> References: <42101413.334282.1478281304520.ref@mail.yahoo.com> <42101413.334282.1478281304520@mail.yahoo.com> Message-ID: <27f48fcf-9ebc-8363-3b27-6540a242d375@kli.org> I know of the Axanar flap. I'm not sure that Paramount was *seriously* saying "we own everything anyone ever says or will say in this language." What they said was more "you used Klingon in your story, and Klingon is our language, therefore your story is infringing on our stuff." So while it's true they *might* make that claim, I don't know that they *have*. All of which is neither here nor there; it's something they could say. The LCS wrote an amicus brief, which is linked to from my document, by the way, arguing that very point, which the judge dismissed without prejudice on the grounds that he wasn't going to be addressing that issue (so he may not have seen it as critical to Paramount's case either). A claim as bald and universal as the way I worded it above is practically indefensible logically, intuitively, and legally (Sun invented Java, but can they claim every Java program???) At any rate, this isn't Unicode's problem. Unicode would not be creating anything in Klingon anyway! Just encoding letters used to write it. Now, those letter-shapes might (for all I know) have legal strings attached, and what's more, the word "Klingon" is definitely owned and claimed by Paramount, which might cause problems with naming the block. Really, though, that isn't what UTC should be deciding. The question is whether or not to encode pIqaD: is it a writing system that people use or have used in the past to communicate (that's the main criterion, right? Unicode is supposed to contain "all" alphabets). If there are additional issues outside of UTC's purview that raise difficulties, those will have to be heard and addressed. But decide to act first, *then* see what obstacles need to be overcome. ~mark On 11/04/2016 01:41 PM, David Faulks wrote: >> On Thu, 11/3/16, Mark Shoulson wrote: >> Subject: The (Klingon) Empire Strikes Back > >> At the time of writing this letter it has not yet hit the UTC >> Document Register, but I have recently submitted a document >> revisiting the ever-popular issue of the encoding of Klingon >> "pIqaD". The reason always given why it could not be >> encoded was that it did not enjoy enough usage, and so I've >> collected a bunch of examples to demonstrate that this is not >> true (scans and also web pages, etc.) So the issue comes >> back up, and time to talk about it again. > There is another issue of course, which I think could be a huge obstacle: the Trademark/Copyright issue. Paramount claims copyright over the entire Klingon language (presumably including the script). The issue has recently gone to court. Encoding criteria for symbols (and this likely extends to letters) is against encoding them without the permission of the Copyright/Trademark holder. > > Is Paramount endorsing your proposal? > > > >> ~mark > David Faulks > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Fri Nov 4 15:52:52 2016 From: doug at ewellic.org (Doug Ewell) Date: Fri, 04 Nov 2016 13:52:52 -0700 Subject: Possible to add new precomposed characters for local language in =?UTF-8?Q?Togo=3F?= Message-ID: <20161104135252.665a7a7059d7ee80bb4d670165c8327d.812c5c1fe0.wbe@email03.godaddy.com> Philippe Verdy wrote: > Consider this source code (based on Microsfot "kbd.h", even if it is > ported to ReadOS) > > https://doxygen.reactos.org/d7/df4/kbd_8h_source.html > > Look for the structures named with "LIGATURE" > > And now look at the special entry value "WCH_LGTR"=0xF002 (i.e. a > PUA), which indicate these keys are mapped using those "LIGATUREn" > structures (which have arbitrary lengths in WCHAR/UTF-16 code units), > instead of storing a 16-bit code unit directly. > > predefines LIGATURE1 to LIGATURE5 but longer lengths are > possible (see cbLgEntry and nLgMaxd members in the KBDTABLE structure) > > The table of ligatures in linked from the pLigature member of the > KBDTABLES structure, which points to the first set of LIGATURE1 > mappings. OK, I understand now. We are rehashing the discussion on this list from August 2015, in which Marcel claimed that the presence of these lines in kbd.h: #define TYPEDEF_LIGATURE(i) \ typedef struct _LIGATURE ## i { \ BYTE VirtualKey; \ WORD ModificationNumber; \ WCHAR wch[i]; \ } LIGATURE ## i, *PLIGATURE ## i; TYPEDEF_LIGATURE(1) TYPEDEF_LIGATURE(2) TYPEDEF_LIGATURE(3) TYPEDEF_LIGATURE(4) TYPEDEF_LIGATURE(5) was proof that some version of Windows actually supported ligatures longer than 4 code units (WCHARs). But no such proof ever materialized. There is still no documentation and no examples of any native Windows keyboard that generates more than 4 code units from one keystroke. kbd.h could declare: TYPEDEF_LIGATURE(8192) and a user could compile it, and that would have nothing to do with whether the Windows runtime could actually handle a LIGATURE structure of that size. Going beyond 4 seems like such a useful and intriguing enhancement, for some folks anyway, that if it were possible, it should be easy to find at least one example where some DDK developer has utilized it. And once again, that is not what Mats was talking about. He was talking about dead-key combinations not being able to generate more than ONE code unit. And if you go back and look at kbd.h, you will see this: typedef struct _DEADKEY { DWORD dwBoth; WCHAR wchComposed; USHORT uFlags; } DEADKEY, *PDEADKEY; typedef WCHAR *DEADKEY_LPWSTR; Notice the absence of any array of 4, 6, or 8192 WCHARs. Only one WCHAR can be composed from a dead-key sequence. This is why Mats was unable to create a keyboard for double-accented letters that don't map to a single BMP code point using dead keys. (Correct, Mats?) A clarification: When I said "send or post the *actual code*", I assumed you were creating KLC files and running them through kbdutool (bypassing MSKLC), as you implied yesterday, not examining C++ code from the DDK. I apologize for this unstated assumption and the confusion it caused, but I still don't see any facts to support either the claim that a single keystroke can generate more than 4 code units, or the claim that a dead key combination can generate more than 1. I'm currently trying to see if there is a Microsoft employee or business unit that can resolve these questions for us once and for all. -- Doug Ewell | Thornton, CO, US | ewellic.org From doug at ewellic.org Fri Nov 4 16:02:36 2016 From: doug at ewellic.org (Doug Ewell) Date: Fri, 04 Nov 2016 14:02:36 -0700 Subject: The (Klingon) Empire Strikes Back Message-ID: <20161104140236.665a7a7059d7ee80bb4d670165c8327d.ef5253d96e.wbe@email03.godaddy.com> Mark E. Shoulson wrote: > At any rate, this isn't Unicode's problem. Unicode would not be > creating anything in Klingon anyway! Well, to be fair, I thought IPR was the primary reason Unicode had never encoded the Apple logo either. I doubt that whether Unicode intended to use such a character themselves was a factor. (Of course, users who really wanted that character encoded are probably using ?? or ?? now.) -- Doug Ewell | Thornton, CO, US | ewellic.org From verdy_p at wanadoo.fr Fri Nov 4 17:16:30 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Fri, 4 Nov 2016 23:16:30 +0100 Subject: Possible to add new precomposed characters for local language in Togo? In-Reply-To: <20161104135252.665a7a7059d7ee80bb4d670165c8327d.812c5c1fe0.wbe@email03.godaddy.com> References: <20161104135252.665a7a7059d7ee80bb4d670165c8327d.812c5c1fe0.wbe@email03.godaddy.com> Message-ID: 2016-11-04 21:52 GMT+01:00 Doug Ewell : > OK, I understand now. We are rehashing the discussion on this list from > August 2015, in which Marcel claimed that the presence of these lines in > kbd.h: > > #define TYPEDEF_LIGATURE(i) \ > typedef struct _LIGATURE ## i { \ > BYTE VirtualKey; \ > WORD ModificationNumber; \ > WCHAR wch[i]; \ > } LIGATURE ## i, *PLIGATURE ## i; > > TYPEDEF_LIGATURE(1) > TYPEDEF_LIGATURE(2) > TYPEDEF_LIGATURE(3) > TYPEDEF_LIGATURE(4) > TYPEDEF_LIGATURE(5) > > was proof that some version of Windows actually supported ligatures > longer than 4 code units (WCHARs). Why then the SDK predefines a structure with 5 code units ??? > But no such proof ever materialized. > You'll find examples in the ReactOS sources (the link I gave) that provides drivers for many more languages than the two example drivers provided with the SDK. > And once again, that is not what Mats was talking about. He was talking > about dead-key combinations not being able to generate more than ONE > code unit. And if you go back and look at kbd.h, you will see this: > > typedef struct _DEADKEY { > DWORD dwBoth; > WCHAR wchComposed; > USHORT uFlags; > } DEADKEY, *PDEADKEY; > > typedef WCHAR *DEADKEY_LPWSTR; > Here again, the support of 4 code points in structures allows binding "ligatures" in keymaps, even if their entries contain a single WCHAR, using the special value for "ligatures" (which are looked up in a separate table. > > Notice the absence of any array of 4, 6, or 8192 WCHARs. You don't need to ! you assign a value WCH_LGTR=0xF002 (the PUA code unit), which triggers a lookup in the "LIGATUREn" tables. > Only one WCHAR > can be composed from a dead-key sequence. Wrong, you assign a WCH_LGTR and then ligature tables are used, they are not limited to just one code unit. -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Fri Nov 4 17:22:42 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Fri, 4 Nov 2016 23:22:42 +0100 Subject: Possible to add new precomposed characters for local language in Togo? In-Reply-To: References: <20161104135252.665a7a7059d7ee80bb4d670165c8327d.812c5c1fe0.wbe@email03.godaddy.com> Message-ID: Look at this example using LIGATURE3 (kbdinasa.dll : "ASSAMESE - INSCRIPT"): https://doxygen.reactos.org/da/dc5/kbdinasa_8c_source.html 2016-11-04 23:16 GMT+01:00 Philippe Verdy : > 2016-11-04 21:52 GMT+01:00 Doug Ewell : > >> OK, I understand now. We are rehashing the discussion on this list from >> August 2015, in which Marcel claimed that the presence of these lines in >> kbd.h: >> >> #define TYPEDEF_LIGATURE(i) \ >> typedef struct _LIGATURE ## i { \ >> BYTE VirtualKey; \ >> WORD ModificationNumber; \ >> WCHAR wch[i]; \ >> } LIGATURE ## i, *PLIGATURE ## i; >> >> TYPEDEF_LIGATURE(1) >> TYPEDEF_LIGATURE(2) >> TYPEDEF_LIGATURE(3) >> TYPEDEF_LIGATURE(4) >> TYPEDEF_LIGATURE(5) >> >> was proof that some version of Windows actually supported ligatures >> longer than 4 code units (WCHARs). > > > Why then the SDK predefines a structure with 5 code units ??? > > >> But no such proof ever materialized. >> > > You'll find examples in the ReactOS sources (the link I gave) that > provides drivers for many more languages than the two example drivers > provided with the SDK. > > >> And once again, that is not what Mats was talking about. He was talking >> about dead-key combinations not being able to generate more than ONE >> code unit. And if you go back and look at kbd.h, you will see this: >> >> typedef struct _DEADKEY { >> DWORD dwBoth; >> WCHAR wchComposed; >> USHORT uFlags; >> } DEADKEY, *PDEADKEY; >> >> typedef WCHAR *DEADKEY_LPWSTR; >> > > Here again, the support of 4 code points in structures allows binding > "ligatures" in keymaps, even if their entries contain a single WCHAR, using > the special value for "ligatures" (which are looked up in a separate table. > >> >> Notice the absence of any array of 4, 6, or 8192 WCHARs. > > > You don't need to ! you assign a value WCH_LGTR=0xF002 (the PUA code > unit), which triggers a lookup in the "LIGATUREn" tables. > > >> Only one WCHAR >> can be composed from a dead-key sequence. > > > Wrong, you assign a WCH_LGTR and then ligature tables are used, they are > not limited to just one code unit. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Fri Nov 4 17:30:48 2016 From: doug at ewellic.org (Doug Ewell) Date: Fri, 04 Nov 2016 15:30:48 -0700 Subject: Possible to add new precomposed characters for local language in =?UTF-8?Q?Togo=3F?= Message-ID: <20161104153048.665a7a7059d7ee80bb4d670165c8327d.4f17bdb7bd.wbe@email03.godaddy.com> I am seeking technical information from a Microsoft team member. Hopefully we will soon have definitive answers to replace all the controversy. -- Doug Ewell | Thornton, CO, US | ewellic.org From lang.support at gmail.com Fri Nov 4 18:17:30 2016 From: lang.support at gmail.com (Andrew Cunningham) Date: Sat, 5 Nov 2016 10:17:30 +1100 Subject: Possible to add new precomposed characters for local language in Togo? In-Reply-To: <20161104153048.665a7a7059d7ee80bb4d670165c8327d.4f17bdb7bd.wbe@email03.godaddy.com> References: <20161104153048.665a7a7059d7ee80bb4d670165c8327d.4f17bdb7bd.wbe@email03.godaddy.com> Message-ID: Thanks Doug, That would be welcome. On Saturday, 5 November 2016, Doug Ewell wrote: > I am seeking technical information from a Microsoft team member. > Hopefully we will soon have definitive answers to replace all the > controversy. > > -- > Doug Ewell | Thornton, CO, US | ewellic.org > > -- Andrew Cunningham lang.support at gmail.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From charupdate at orange.fr Fri Nov 4 22:33:00 2016 From: charupdate at orange.fr (Marcel Schneider) Date: Sat, 5 Nov 2016 04:33:00 +0100 (CET) Subject: Possible to add new precomposed characters for local language in Togo? In-Reply-To: <20161104135252.665a7a7059d7ee80bb4d670165c8327d.812c5c1fe0.wbe@email03.godaddy.com> References: <20161104135252.665a7a7059d7ee80bb4d670165c8327d.812c5c1fe0.wbe@email03.godaddy.com> Message-ID: <64711057.25674.1478316780673.JavaMail.www@wwinf1d31> Sorry, while trying to look up MSDN, I lost touch with the discussion and didn?t notice that my information about ?more than 4 code units?, more precisely ?16 code units? by a live key press has been questioned again. Even if primarily off-topic, it is a rather useful subject, along with the input of several code units by dead keys (which admittedly is more important). To achieve the requested materialization, you are welcome to do the following steps: 1) Open http://dispoclavier.com 2) Click the download button [T?l?charger] 3) Unzip the folder 4) Browse to ?DTM_Dispoclavier_v0.9.0.44\DTMD_v0.9.0.44_(installation)\ kbdfrf81 azerty d?ploy? capitales et chiffres v0.9.0.44 installation? 5) Read the ?Note? 6) Run the ?setup.exe? (noticing that it has been provided by MSKLC) 7) Click the Language button in the Language bar and select ?French (France)? 8) Eventually click the Keyboard button and select ?DTMD France azerty d?ploy? capitales et chiffres? 9) Make sure to use an ISO keyboard with a key for VK_OEM_105; or remap the left Windows key to it: if no key is already remapped, merge this: Windows Registry Editor Version 5.00 [HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Keyboard Layout] "Scancode Map"=hex:00,00,00,00,00,00,00,00,04,00,00,00,56,00,5b,e0,5b,e0,56,00,\ 00,00,00,00 (When this is used with an ISO keyboard, the two keys are swapped.) 10) Press the following three keys together: Left Shift; the ISO key or if remapped, the Left Windows key; Q (on AZERTY) or A (on QWERTY). The expected keyboard input is: ? ?q_n?existe_pas? [?superscript small q does not exist?] preceded by a white space for quick erase by Ctrl+Backspace. Please note that in the next version. ??? will be replaced by ?^?, since ?^? will be the character of the superscript dead key, while the character of the circumflex dead key is ???. Along with this test, you may wish to look up the sources in the other main folder. 16 is the empirically stated maximum number of inserted code units. Best regards, Marcel? ? > Message du 04/11/16 21:58 > De : "Doug Ewell" > A : verdy_p at wanadoo.fr > Copie ? : "Marcel Schneider" , "Denis Jacquerye" , "Mats Blakstad" , "Unicode Mailing List" > Objet : RE: Possible to add new precomposed characters for local language in Togo? > > Philippe Verdy wrote: > > > Consider this source code (based on Microsfot "kbd.h", even if it is > > ported to ReadOS) > > > > https://doxygen.reactos.org/d7/df4/kbd_8h_source.html > > > > Look for the structures named with "LIGATURE" > > > > And now look at the special entry value "WCH_LGTR"=0xF002 (i.e. a > > PUA), which indicate these keys are mapped using those "LIGATUREn" > > structures (which have arbitrary lengths in WCHAR/UTF-16 code units), > > instead of storing a 16-bit code unit directly. > > > > predefines LIGATURE1 to LIGATURE5 but longer lengths are > > possible (see cbLgEntry and nLgMaxd members in the KBDTABLE structure) > > > > The table of ligatures in linked from the pLigature member of the > > KBDTABLES structure, which points to the first set of LIGATURE1 > > mappings. > > OK, I understand now. We are rehashing the discussion on this list from > August 2015, in which Marcel claimed that the presence of these lines in > kbd.h: > > #define TYPEDEF_LIGATURE(i) \ > typedef struct _LIGATURE ## i { \ > BYTE VirtualKey; \ > WORD ModificationNumber; \ > WCHAR wch[i]; \ > } LIGATURE ## i, *PLIGATURE ## i; > > TYPEDEF_LIGATURE(1) > TYPEDEF_LIGATURE(2) > TYPEDEF_LIGATURE(3) > TYPEDEF_LIGATURE(4) > TYPEDEF_LIGATURE(5) > > was proof that some version of Windows actually supported ligatures > longer than 4 code units (WCHARs). But no such proof ever materialized. > There is still no documentation and no examples of any native Windows > keyboard that generates more than 4 code units from one keystroke. > > kbd.h could declare: > > TYPEDEF_LIGATURE(8192) > > and a user could compile it, and that would have nothing to do with > whether the Windows runtime could actually handle a LIGATURE structure > of that size. > > Going beyond 4 seems like such a useful and intriguing enhancement, for > some folks anyway, that if it were possible, it should be easy to find > at least one example where some DDK developer has utilized it. > > And once again, that is not what Mats was talking about. He was talking > about dead-key combinations not being able to generate more than ONE > code unit. And if you go back and look at kbd.h, you will see this: > > typedef struct _DEADKEY { > DWORD dwBoth; > WCHAR wchComposed; > USHORT uFlags; > } DEADKEY, *PDEADKEY; > > typedef WCHAR *DEADKEY_LPWSTR; > > Notice the absence of any array of 4, 6, or 8192 WCHARs. Only one WCHAR > can be composed from a dead-key sequence. This is why Mats was unable to > create a keyboard for double-accented letters that don't map to a single > BMP code point using dead keys. (Correct, Mats?) > > A clarification: When I said "send or post the *actual code*", I assumed > you were creating KLC files and running them through kbdutool (bypassing > MSKLC), as you implied yesterday, not examining C++ code from the DDK. I > apologize for this unstated assumption and the confusion it caused, but > I still don't see any facts to support either the claim that a single > keystroke can generate more than 4 code units, or the claim that a dead > key combination can generate more than 1. > > I'm currently trying to see if there is a Microsoft employee or business > unit that can resolve these questions for us once and for all. > > -- > Doug Ewell | Thornton, CO, US | ewellic.org > > From charupdate at orange.fr Fri Nov 4 22:41:21 2016 From: charupdate at orange.fr (Marcel Schneider) Date: Sat, 5 Nov 2016 04:41:21 +0100 (CET) Subject: Possible to add new precomposed characters for local language in Togo? In-Reply-To: <20161104135252.665a7a7059d7ee80bb4d670165c8327d.812c5c1fe0.wbe@email03.godaddy.com> References: <20161104135252.665a7a7059d7ee80bb4d670165c8327d.812c5c1fe0.wbe@email03.godaddy.com> Message-ID: <1415950225.25679.1478317281693.JavaMail.www@wwinf1d31> I?m sorry for the typo: ?VK_OEM_105? should read ?VK_OEM_102?. (The registry key is tested and OK.) A few minutes ago, I wrote: > 9) Make sure to use an ISO keyboard with a key for VK_OEM_105; > or remap the left Windows key to it: if no key is already remapped, merge this: > Windows Registry Editor Version 5.00 > [HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Keyboard Layout] > "Scancode Map"=hex:00,00,00,00,00,00,00,00,04,00,00,00,56,00,5b,e0,5b,e0,56,00,\ > 00,00,00,00 > (When this is used with an ISO keyboard, the two keys are swapped.) From charupdate at orange.fr Sat Nov 5 11:51:21 2016 From: charupdate at orange.fr (Marcel Schneider) Date: Sat, 5 Nov 2016 17:51:21 +0100 (CET) Subject: Possible to add new precomposed characters for local language in Togo? In-Reply-To: References: <20161104135252.665a7a7059d7ee80bb4d670165c8327d.812c5c1fe0.wbe@email03.godaddy.com> Message-ID: <1751824501.10460.1478364682456.JavaMail.www@wwinf1p03> Sorry not to have found time sooner to look close at the stuff that is claimed to support code unit sequences through dead keys. It?s all about live keys, none about dead keys. Yet another case of talking past each other. IMHO that happened because one simple question was not answered prior to sharing links to sources: How will the API know what line of aLigature (the ligature table) to look up, if the 0xf002 alias WCH_LGTR is not found in aVkToWch (the allocation table)? Indeed, column 1 of the ligature table contains the virtual key, and column 2 contains the modification number, that refers to the column of the allocation table where each 0xf002 or WCH_LGTR is mapped to a key and shift state: static ALLOC_SECTION_LDATA VK_TO_WCHARS38 aVkToWch38[] = { // Modification_# >>>|0|1|2|3|4|5|6|7|8|9|10|11|12|13|14|15|16|17|18|19|20|21|22|23|24|25|26|27|28|29|30|31|32|33|34|35|36|37| {'Q'/*T1E C01*/,0x01,'q','Q','#',0x2126,0x00f7,LGTR,0x0331,NONE,NONE,NONE,NONE,NONE,0x0634,'\\',0x0447,0x0427,0x0447,0x0427,'&','%',0x03c2,0x2211,'&' ,'%',0x05e7,'*','&','%',LGTR,LGTR,LGTR,LGTR,LGTR,LGTR,LGTR,LGTR,NONE,NONE}, // {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0} }; static ALLOC_SECTION_LDATA LIGATURE16 aLigature[] = { // |Virtual_Key|SC|ISO_#|Modif#|Char0|Char1|Char2|Char3|Char4|Char5|Char6|Char7|Char8|Char9|Char10|Char11|Char12|Char13|Char14|Char15| {'Q'/*T1E C01*/,5,' ',0x2191,'q','_','n',0x2019,'e','x','i','s','t','e','_','p','a','s'}, // ^q doesn't exist {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0} }; This leads again to the (off-topic) concern about some examples found on the internet. I?ll try to do some search for web pages in English, while we are looking forward to the advice that Doug Ewell kindly requested. Marcel On Fri, 4 Nov 2016 23:22:42 +0100, Philippe Verdy wrote: > Look at this example using LIGATURE3 (kbdinasa.dll : "ASSAMESE - INSCRIPT"): > > https://doxygen.reactos.org/da/dc5/kbdinasa_8c_source.html > > 2016-11-04 23:16 GMT+01:00 Philippe Verdy : > >> 2016-11-04 21:52 GMT+01:00 Doug Ewell : >>> >>> OK, I understand now. We are rehashing the discussion on this list from >>> August 2015, in which Marcel claimed that the presence of these lines in >>> kbd.h: >>> >>> #define TYPEDEF_LIGATURE(i) \ >>> typedef struct _LIGATURE ## i { \ >>> ? ? ? ? BYTE VirtualKey; \ >>> ? ? ? ? WORD ModificationNumber; \ >>> ? ? ? ? WCHAR wch[i]; \ >>> } LIGATURE ## i, *PLIGATURE ## i; >>> >>> ? ? ? ? TYPEDEF_LIGATURE(1) >>> ? ? ? ? TYPEDEF_LIGATURE(2) >>> ? ? ? ? TYPEDEF_LIGATURE(3) >>> ? ? ? ? TYPEDEF_LIGATURE(4) >>> ? ? ? ? TYPEDEF_LIGATURE(5) >>> >>> was proof that some version of Windows actually supported ligatures >>> longer than 4 code units (WCHARs). >> >> Why then the SDK predefines a structure with 5 code units ??? >> >>> But no such proof ever materialized. >> >> You'll find examples in the ReactOS ?sources (the link I gave) that provides >> drivers for many more languages than the two example drivers provided with the SDK. >> >>> And once again, that is not what Mats was talking about. He was talking >>> about dead-key combinations not being able to generate more than ONE >>> code unit. And if you go back and look at kbd.h, you will see this: >>> >>> typedef struct _DEADKEY { >>> ? ? ? ? DWORD dwBoth; >>> ? ? ? ? WCHAR wchComposed; >>> ? ? ? ? USHORT uFlags; >>> } DEADKEY, *PDEADKEY; >>> >>> typedef WCHAR *DEADKEY_LPWSTR; >>> >> Here again, the support of 4 code points in structures allows binding >> "ligatures" in keymaps, even if their entries contain a single WCHAR, using the >> special value for "ligatures" (which are looked up in a separate table. >> >>> Notice the absence of any array of 4, 6, or 8192 WCHARs. >> >> You don't need to ! you assign a value WCH_LGTR=0xF002 (the PUA code unit), >> which triggers a lookup in the "LIGATUREn" tables. >> >>> Only one WCHAR can be composed from a dead-key sequence. >> >> Wrong, you assign a?WCH_LGTR and then ligature tables are used, they are not >> limited to just one code unit. From verdy_p at wanadoo.fr Sat Nov 5 15:52:17 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Sat, 5 Nov 2016 21:52:17 +0100 Subject: Possible to add new precomposed characters for local language in Togo? In-Reply-To: <1751824501.10460.1478364682456.JavaMail.www@wwinf1p03> References: <20161104135252.665a7a7059d7ee80bb4d670165c8327d.812c5c1fe0.wbe@email03.godaddy.com> <1751824501.10460.1478364682456.JavaMail.www@wwinf1p03> Message-ID: 2016-11-05 17:51 GMT+01:00 Marcel Schneider : > Sorry not to have found time sooner to look close at the stuff > that is claimed to support code unit sequences through dead keys. > It?s all about live keys, none about dead keys. > Yet another case of talking past each other. > > IMHO that happened because one simple question was not answered prior to > sharing links to sources: How will the API know what line of aLigature > (the ligature table) to look up, if the 0xf002 alias WCH_LGTR is not found > in aVkToWch (the allocation table)? > > Indeed, column 1 of the ligature table contains the virtual key, and > column 2 contains the modification number, that refers to the column of > the allocation table where each 0xf002 or WCH_LGTR is mapped to a key and > shift state: > > static ALLOC_SECTION_LDATA VK_TO_WCHARS38 aVkToWch38[] = { > // Modification_# >>>|0|1|2|3|4|5|6|7|8|9|10|11| > 12|13|14|15|16|17|18|19|20|21|22|23|24|25|26|27|28|29|30|31| > 32|33|34|35|36|37| > {'Q'/*T1E > C01*/,0x01,'q','Q','#',0x2126,0x00f7,LGTR,0x0331,NONE,NONE, > NONE,NONE,NONE,0x0634,'\\',0x0447,0x0427,0x0447,0x0427,'& > ','%',0x03c2,0x2211,'&' > ,'%',0x05e7,'*','&','%',LGTR,LGTR,LGTR,LGTR,LGTR,LGTR,LGTR,LGTR,NONE,NONE}, > // > {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, > 0,0,0,0,0,0,0,0,0,0,0} > }; > > static ALLOC_SECTION_LDATA LIGATURE16 aLigature[] = { > // |Virtual_Key|SC|ISO_#|Modif#|Char0|Char1|Char2|Char3|Char4| > Char5|Char6|Char7|Char8|Char9|Char10|Char11|Char12|Char13|Char14|Char15| > {'Q'/*T1E C01*/,5,' ',0x2191,'q','_','n',0x2019,' > e','x','i','s','t','e','_','p','a','s'}, // ^q doesn't exist > {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0} > }; > Your structures do not seem to be correctly formatted (or it is just random data): >>> typedef struct _LIGATURE ## i { \ >>> BYTE VirtualKey; \ >>> WORD ModificationNumber; \ >>> WCHAR wch[i]; \ >>> } LIGATURE ## i, *PLIGATURE ## i; Here you set: VirtualKey='Q''/*T1E C01*/, ModificationNumber = 5, wch[0]=' ', wch[1]='0x2191',//^, wch[2]='q',... wch[15]='s' for defininig a very long "ligature" (the wrong term used in kbd.h where it should just be a "string", even if those strings have a fixed length and are null-padded) What it says is that VK_Q in modification number 5 (as defined in the MODIFIERS modifier_bits table, which remaps to modication number bits the set of virtual modifiers mapped in the VK_TO_BIT modifier_keys table) should generate your string (which can contain up to 16 WCHARS, but no null chars: it is not possible to include a NULL char in a LIGATURE table, but anyway a keyboard never has to do that, as NULL chars are not mapped in any ligature table but isolatedly in a VK mapping table, with a single WCHAR code unit directly. Note also that the definition in of the SDK: typedef struct _KBDTABLES { PMODIFIERS pCharModifiers; PVK_TO_WCHAR_TABLE pVkToWcharTable; PDEADKEY pDeadKey; VSC_LPWSTR *pKeyNames; VSC_LPWSTR *pKeyNamesExt; LPWSTR *pKeyNamesDead; USHORT *pusVSCtoVK; BYTE bMaxVSCtoVK; PVSC_VK pVSCtoVK_E0; PVSC_VK pVSCtoVK_E1; DWORD fLocaleFlags; BYTE nLgMaxd; BYTE cbLgEntry; PLIGATURE1 pLigature; } KBDTABLES, *PKBDTABLES; May be misleading, for the last two members: - nLgMaxd indicates the maximum length of null-padded strings in a pLigature table entry, whose entry size is stored in cbLgEntry: this size acts as a versioning info for the ligatures table format, and most probably it is there so that keyboard drivers compiled on another architecture will still be usable even if the size of a WCHAR is changed. - but of course the type of an entry is not a LIGATURE1, but at least a LIGATURE2 (LIGATURE1 has no use in any table, given that 1-WCHAR strings will be stored directly in one of the VK_TO_WCHAR_TABLE tables. the LIGATURE1 is just there to allow pointer typecasts in C/C++ independantly of the LIGARTURE(n) table format you need. - Windows provably works with LIGATURE2, LIGATURE3, LIGATURE4 and LIGATURE5 (I've never tested if it works for longer strings or if it really works with a LIGATURE1 table format) The LIGATURE(n) format also uses internal paddings between members, notably between "BYTE VirtualKey;" and "WORD ModificationNumber;": there's a hidden alignment BYTE between them, which could be considered as additional flags for the effective LIGATURE(n) format (C/C++ compilers are stupposed to fill these padding bytes with zeroes). Given that WORD and WCHAR have the same 16-bit size, the whole structure is an array of 16-bit blocks: in a LIGATURE1 there are two WORDS, so it is also aligned on a DWORD; in a LIGATURE2, this would take 3 useful words, but due to alignment constaints, the entry will be 4 words and sizeof(wch[0] will be 16, just like for a LIGATURE3; so LIGATURE2 has no use: therere will be an extra padding null WORD in the wchar array, and that's why "cbLgEntry " is there, but this makes "nLgMaxd" completely unneeded, except to make sure that the extra padding WCHAR in wch[] will be discarded, even if it is not filled with zeroes, i.e. a NULL WCHAR which is ignored anyway and acts as an early terminator. Now comes the question about how ligatures are matched: they are looked up in the LIGATURE(n) tables by looking only at the first two members VirtualKey and ModificationNumber (ignoring the extra padding BYTE?) but most probably by grouping them as a single DWORD (the LO WORD contains the VKEY, the HIWORD contains the modifiers). The lookup is apparently linear (there's apparently no requirement for this table to be sorted to perform a binary search, and anyway these LIGATURE tables are generally short). If a [KEY,modifiers] pair is not found in the ligature table (even if the VK_TO_WCHAR_TABLE says it should be there by assigning a WCH_LGTR value to the entry for that VKEY in the modifier column number), the behavior should probably be the same as if the entry in the VK_TO_WCHAR_TABLE contained WCH_NONE (i.e. key not mapped), but in my opinion the table data has a bug: it should contain WCH_NONE instead of WCH_LGTR. I think that the Keyboard compiler tool should detect this error (it should also detect the use of an unneeded LIGATURE1 instead of mapping directly in a VK_WCHAR_TABLE (or in a DEADKEY table) ---- Speculation follows about possible extensions for dead keys mapped to "ligatures", and arbitrary-length ligatures in general mapped from DEADKEY(n) and VK_WCHAR_TABLE(n) tables --- Note also the presence of a "flags" BYTE in entries of a DEADKEY table: could this BYTE be used as well in the LIGATURE table entries (between BYTE VirtualKey; WORD ModificationNumber) when the "comp" member of a DEADKEY's entry contains a "WCH_LGTR" and use for example to store an identifier of the deakey state for lokup in LIGATURE(n) tables (this lookup will still continue to work by grouping in a single DWORD instead of comparing them individually. Also the "nLgMaxd" member of KBDTABLES has no real use if it just contains 2, 3, 4 or 5. Setting its value to 0 would be better used to indicate that a LIGATURE(0) entry no longer contains a null-padded string "WCHAR wch[]", but instead contain a pointer to a real string with "PWSTR pwch;" ("cbLgEntry" is still used: on 32-bit architecture it returns 8 (2 BYTES+1 WORD for the composite key, 1 DWORD for the target pointer), on 64-bit architecture it will return h16 (2 BYTES+1 DWORD for the composite key, 1 DWORD of alignement, 1 QWORD for the 64 bit pointer); the alternative would be to store even shorter pointers using a single DWORD of offset in a null-terminated strings table, stored just at end of the LIGATURE(0) lookup table, these offsets being relative to the start of the LIGATURE(0) table (whose pointer just has to be typecasted as a WORD[] array). -------------- next part -------------- An HTML attachment was scrubbed... URL: From charupdate at orange.fr Sat Nov 5 21:31:58 2016 From: charupdate at orange.fr (Marcel Schneider) Date: Sun, 6 Nov 2016 03:31:58 +0100 (CET) Subject: Possible to add new precomposed characters for local language in Togo? In-Reply-To: <20161104153048.665a7a7059d7ee80bb4d670165c8327d.4f17bdb7bd.wbe@email03.godaddy.com> References: <20161104153048.665a7a7059d7ee80bb4d670165c8327d.4f17bdb7bd.wbe@email03.godaddy.com> Message-ID: <213971739.7366.1478399518373.JavaMail.www@wwinf2212> On Fri, 04 Nov 2016 13:52:52 -0700, Doug Ewell wrote: > Going beyond 4 seems like such a useful and intriguing enhancement, for > some folks anyway, that if it were possible, it should be easy to find > at least one example where some DDK developer has utilized it. Yes indeed, it?s finally rather easy to find: http://accentuez.mon.nom.free.fr/Clavier-CreationClavier.php (again) writes notably (as of the present topic; translation follows below): | aLigature | [?] | Le fichier kbd.h ne contient que 4 types LIGATURE2, LIGATURE3, LIGATURE4, | LIGATURE5. Mais en r?alit? on n?est pas limit? ? cinq unit?s de code : si on a | une touche Alt-Gr + espace qui renvoie dix unit?s de code, par exemple *LIGATURE*, | on peut d?clarer la table pr?c?dente comme suit : | | TYPEDEF_LIGATURE(10) // LIGATURE10, *PLIGATURE10; | static ALLOC_SECTION_LDATA LIGATURE10 aLigature[] = { | [?] | }; | | On peut donc cr?er des touches renvoyant des mots, voire des phrases. | On est toutefois limit? ? seize unit?s de code TYPEDEF_LIGATURE(16). ?The kbd.h file contains only 4 types LIGATURE2, LIGATURE3, LIGATURE4, LIGATURE5. But in reality one is not limited to five code units: if AltGr + space generates ten code units, e.g. as in ?*LIGATURE*?, the table above can be declared as follows: TYPEDEF_LIGATURE(10) // LIGATURE10, *PLIGATURE10; static ALLOC_SECTION_LDATA LIGATURE10 aLigature[] = { [?] }; Thus we can create keys generating words or even sentences. However we are limited to sixteen code units: TYPEDEF_LIGATURE(16).? As of retrieving this page, it is actually the 18th result of Bing Search on 'keyboard layout creation', results in all languages enabled. Marcel From charupdate at orange.fr Sat Nov 5 21:40:51 2016 From: charupdate at orange.fr (Marcel Schneider) Date: Sun, 6 Nov 2016 03:40:51 +0100 (CET) Subject: Possible to add new precomposed characters for local language in Togo? In-Reply-To: References: <20161104135252.665a7a7059d7ee80bb4d670165c8327d.812c5c1fe0.wbe@email03.godaddy.com> <1751824501.10460.1478364682456.JavaMail.www@wwinf1p03> Message-ID: <1931445883.7378.1478400051028.JavaMail.www@wwinf2212> On Sat, 5 Nov 2016 21:52:17 +0100, Philippe Verdy wrote: > Your structures do not seem to be correctly formatted (or it is just random > data): Maybe there are formal defaults, and perhaps it is written in a non-standard way. What I can tell at least, is that on my machine it works (Windows 7 Starter). And I?m not in the habits of publishing random data as if it were real code. Marcel From charupdate at orange.fr Sat Nov 5 22:11:02 2016 From: charupdate at orange.fr (Marcel Schneider) Date: Sun, 6 Nov 2016 04:11:02 +0100 (CET) Subject: Possible to add new precomposed characters for local language in Togo? In-Reply-To: <20161104153048.665a7a7059d7ee80bb4d670165c8327d.4f17bdb7bd.wbe@email03.godaddy.com> References: <20161104153048.665a7a7059d7ee80bb4d670165c8327d.4f17bdb7bd.wbe@email03.godaddy.com> Message-ID: <1745350033.7396.1478401863009.JavaMail.www@wwinf2212> On Fri, 04 Nov 2016 15:30:48 -0700, Doug Ewell wrote: > I am seeking technical information from a Microsoft team member. > Hopefully we will soon have definitive answers to replace all the > controversy. I?m aware that discussions have sometimes a way of going off the road and I do experience this also on the mailing list of a keyboarding community I?m actually very implied in, but I understand that when the layout driver architecture of some OS impacts numerous local user communities, analyzing code snippets on the Unicode List may sometimes end up meeting a real demand because at some point, the discrepancy between the on-going development of the Unicode Standard and its implementation in the real world is going to heavily compromise the usability and the usefulness of the scheme. Having said that, I?m further aware that code development is typically best done on collaborative repositories such as GitHub, GitLab, Sourceforge. I?ve tried some of them and do have accounts. Perhaps I?ve missed something: I dont find the neat display and nice syntaxic highlighting like on ReactOS. And above all, I?m unable to figure out efficient layout driver development there. A big part is done in huge workbooks. This is best done in Excel. When my workbook is up-to-date, I?ll be in a position to share it in public. Now since we are on it, be it permitted to discuss other snippets, hopefully that Microsoft (or a programmer on this List) will find a way to make the Windows APIs understand multiple code units by dead keys: /*TEMPLATE */ DEADTRANS( BASECHAR ,DEADKEY ,COMBICHAR ,DEADKEYFLAG), // UNICODE NAME ? This is how it can work without dead keys: /*COMPOSE */ DEADTRANS( L'\"' ,0x00a9 ,0x0151 ,CHAIN ), // LATIN SMALL LETTER O WITH DOUBLE ACUTE /*DOUBLE_AIGU*/ DEADTRANS( L'o' ,0x0151 ,0x0151 ,DKF_0 ), // LATIN SMALL LETTER O WITH DOUBLE ACUTE /*COMPOSE */ DEADTRANS( L':' ,0x00a9 ,0x00eb ,CHAIN ), // LATIN SMALL LETTER E WITH DIAERESIS /*TREMA */ DEADTRANS( L'a' ,0x00eb ,0x00e4 ,DKF_0 ), // LATIN SMALL LETTER A WITH DIAERESIS ? Now the acute and tilde dead keys: ? In the allocation table: {VK_OEM_1 /*T1B D12*/ ,0x08 ,DEAD ,DEAD /*snip*/ {0xff,0 ,/*acute:*/0x00e1 ,/*tilde:*/0x00f5 /*snip*/ ? In the deadtrans list: /*TILDE */ DEADTRANS( 0x00e1 ,0x00f5 ,0x1e4d ,CHAIN ), // LATIN SMALL LETTER O WITH TILDE AND ACUTE /*TILDE&AIGU */ DEADTRANS( L'O' ,0x1e4d ,0x1e4c ,DKF_0 ), // LATIN CAPITAL LETTER O WITH TILDE AND ACUTE ? And with LATIN CAPITAL LETTER OPEN E? Why not this way (as has been suggested): /*TILDE&AIGU */ DEADTRANS( 0x0190 ,0x1e4d ,{0x0190,0x0303,0x0301} ,DKF_0 ), // *LATIN CAPITAL LETTER OPEN E WITH TILDE AND ACUTE Hopefully, Marcel From verdy_p at wanadoo.fr Sat Nov 5 23:32:23 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Sun, 6 Nov 2016 05:32:23 +0100 Subject: Possible to add new precomposed characters for local language in Togo? In-Reply-To: <1745350033.7396.1478401863009.JavaMail.www@wwinf2212> References: <20161104153048.665a7a7059d7ee80bb4d670165c8327d.4f17bdb7bd.wbe@email03.godaddy.com> <1745350033.7396.1478401863009.JavaMail.www@wwinf2212> Message-ID: 2016-11-06 4:11 GMT+01:00 Marcel Schneider : > On Fri, 04 Nov 2016 15:30:48 -0700, Doug Ewell wrote: > > ? And with LATIN CAPITAL LETTER OPEN E? Why not this way (as has been > suggested): > /*TILDE&AIGU */ DEADTRANS( 0x0190 ,0x1e4d ,{0x0190,0x0303,0x0301} ,DKF_0 > ), // *LATIN CAPITAL LETTER OPEN E WITH > TILDE AND ACUTE > This snippet cannot work as is, because the DEADTRANS() macro maps gernerates a 8-BYTE structure only has a single WCHAR for storing the result of the map of a (VKEY+modifier number): typedef struct _DEADKEY { DWORD dwBoth; WCHAR wchComposed; USHORT uFlags; } DEADKEY, *PDEADKEY; So it will need to map a WCH_LGTR instead, and then use a "ligature" table to store the string containing the 3 code units you want. Then there's an unused BYTE in the DEADTRANS structure for the flags, that can be used (specifically for entries mapped to WCH_LGTR) to pass flags to the LIGATURE(n) table (where there's also a free BYTE in the indexing key, allowing to pass an identifier needed for the lookup in the LIGATURE(n) table; alternatively, instead of mapping WCH_LGTR (a PUA), you could as well map another PUA there in 0xE001.0xE0FF for passing a byte for the deadkey state into the lookup of ligatures: #define TYPEDEF_LIGATURE(i) \ typedef struct _LIGATURE ## i { \ BYTE VirtualKey; \ WORD ModificationNumber; \ WCHAR wch[i]; \ } LIGATURE ## i, *PLIGATURE ## i; which can safely be changed to: typedef struct _LIGATURE ## i { \ BYTE VirtualKey, DeadKeyState; \ WORD ModificationNumber; \ WCHAR wch[i]; \ } LIGATURE ## i, *PLIGATURE ## i; (in the current definition of the extra byte is implicit for the alignment, but not declared explicitly, it is implicitly filled with zeroes by C compilers when declaring the structure, but in my opinion this extra byte should have been declared explicitly.) But now it's up to the OS to support it, may be it works already if the lookup in the LIGATURE(n) table already scans for values of a DWORD, including this free padding byte, however there's a need to change some code in the kernel-level to check the PUA values mapped in DEADKEY structures and extract a DeadKeyState from it. The alternative is to map the combination of two deadkeys to a bit in the modifier number (this can be instructed by the uFlags, which will set the modifier bit number specified in the mapped PUA). In all cases there's still space for extension there. The last alternative is to extend the KBDTABLES structure to append new members for a table of extended DEADKEYS, and a separate table of LIGATURE for DEADKEYs (the KBDTABLE does not specify its own size, but it has a fLocaleFlags field just before the table of ligatures, which can indicate the presence of these extensions. -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Sat Nov 5 23:37:12 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Sun, 6 Nov 2016 05:37:12 +0100 Subject: Possible to add new precomposed characters for local language in Togo? In-Reply-To: References: <20161104153048.665a7a7059d7ee80bb4d670165c8327d.4f17bdb7bd.wbe@email03.godaddy.com> <1745350033.7396.1478401863009.JavaMail.www@wwinf2212> Message-ID: Note: such extension is absolutely necessary for scripts not encoded in the BMP (e.g. Gothic or Deseret, or larger scripts that will absolutely need mechanisms like dead keys if they want to have a usable keyboard layout !) 2016-11-06 5:32 GMT+01:00 Philippe Verdy : > > > 2016-11-06 4:11 GMT+01:00 Marcel Schneider : > >> On Fri, 04 Nov 2016 15:30:48 -0700, Doug Ewell wrote: >> >> ? And with LATIN CAPITAL LETTER OPEN E? Why not this way (as has been >> suggested): >> /*TILDE&AIGU */ DEADTRANS( 0x0190 ,0x1e4d ,{0x0190,0x0303,0x0301} ,DKF_0 >> ), // *LATIN CAPITAL LETTER OPEN E WITH >> TILDE AND ACUTE >> > > This snippet cannot work as is, because the DEADTRANS() macro maps > gernerates a 8-BYTE structure only has a single WCHAR for storing the > result of the map of a (VKEY+modifier number): > > typedef struct _DEADKEY { > DWORD dwBoth; > WCHAR wchComposed; > USHORT uFlags; > } DEADKEY, *PDEADKEY; > > So it will need to map a WCH_LGTR instead, and then use a "ligature" table > to store the string containing the 3 code units you want. > > Then there's an unused BYTE in the DEADTRANS structure for the flags, that > can be used (specifically for entries mapped to WCH_LGTR) to pass flags to > the LIGATURE(n) table (where there's also a free BYTE in the indexing key, > allowing to pass an identifier needed for the lookup in the LIGATURE(n) > table; alternatively, instead of mapping WCH_LGTR (a PUA), you could as > well map another PUA there in 0xE001.0xE0FF for passing a byte for the > deadkey state into the lookup of ligatures: > > #define TYPEDEF_LIGATURE(i) \ > typedef struct _LIGATURE ## i { \ > BYTE VirtualKey; \ > WORD ModificationNumber; \ > WCHAR wch[i]; \ > } LIGATURE ## i, *PLIGATURE ## i; > > which can safely be changed to: > > typedef struct _LIGATURE ## i { \ > BYTE VirtualKey, DeadKeyState; \ > WORD ModificationNumber; \ > WCHAR wch[i]; \ > } LIGATURE ## i, *PLIGATURE ## i; > > (in the current definition of the extra byte is implicit for the > alignment, but not declared explicitly, it is implicitly filled with zeroes > by C compilers when declaring the structure, but in my opinion this extra > byte should have been declared explicitly.) > > But now it's up to the OS to support it, may be it works already if the > lookup in the LIGATURE(n) table already scans for values of a DWORD, > including this free padding byte, however there's a need to change some > code in the kernel-level to check the PUA values mapped in DEADKEY > structures and extract a DeadKeyState from it. > > The alternative is to map the combination of two deadkeys to a bit in the > modifier number (this can be instructed by the uFlags, which will set the > modifier bit number specified in the mapped PUA). In all cases there's > still space for extension there. > > The last alternative is to extend the KBDTABLES structure to append new > members for a table of extended DEADKEYS, and a separate table of LIGATURE > for DEADKEYs (the KBDTABLE does not specify its own size, but it has a > fLocaleFlags field just before the table of ligatures, which can indicate > the presence of these extensions. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Sat Nov 5 23:40:59 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Sun, 6 Nov 2016 05:40:59 +0100 Subject: Possible to add new precomposed characters for local language in Togo? In-Reply-To: References: <20161104153048.665a7a7059d7ee80bb4d670165c8327d.4f17bdb7bd.wbe@email03.godaddy.com> <1745350033.7396.1478401863009.JavaMail.www@wwinf2212> Message-ID: Another use case: being able to type Bopomofo along with Cyrillic or Kanas...; and new extensions will be needed for the 2012 German layout and other layouts made according to the ISO standard (you cannot do all what you want with just a few modifier bits and Windows only implementing a Kana modifier key and limiting the number of modifiers supported even below the capacity of the WORD ModificationNumber ! 2016-11-06 5:37 GMT+01:00 Philippe Verdy : > Note: such extension is absolutely necessary for scripts not encoded in > the BMP (e.g. Gothic or Deseret, or larger scripts that will absolutely > need mechanisms like dead keys if they want to have a usable keyboard > layout !) > > 2016-11-06 5:32 GMT+01:00 Philippe Verdy : > >> >> >> 2016-11-06 4:11 GMT+01:00 Marcel Schneider : >> >>> On Fri, 04 Nov 2016 15:30:48 -0700, Doug Ewell wrote: >>> >>> ? And with LATIN CAPITAL LETTER OPEN E? Why not this way (as has been >>> suggested): >>> /*TILDE&AIGU */ DEADTRANS( 0x0190 ,0x1e4d ,{0x0190,0x0303,0x0301} ,DKF_0 >>> ), // *LATIN CAPITAL LETTER OPEN E WITH >>> TILDE AND ACUTE >>> >> >> This snippet cannot work as is, because the DEADTRANS() macro maps >> gernerates a 8-BYTE structure only has a single WCHAR for storing the >> result of the map of a (VKEY+modifier number): >> >> typedef struct _DEADKEY { >> DWORD dwBoth; >> WCHAR wchComposed; >> USHORT uFlags; >> } DEADKEY, *PDEADKEY; >> >> So it will need to map a WCH_LGTR instead, and then use a "ligature" >> table to store the string containing the 3 code units you want. >> >> Then there's an unused BYTE in the DEADTRANS structure for the flags, >> that can be used (specifically for entries mapped to WCH_LGTR) to pass >> flags to the LIGATURE(n) table (where there's also a free BYTE in the >> indexing key, allowing to pass an identifier needed for the lookup in the >> LIGATURE(n) table; alternatively, instead of mapping WCH_LGTR (a PUA), you >> could as well map another PUA there in 0xE001.0xE0FF for passing a byte for >> the deadkey state into the lookup of ligatures: >> >> #define TYPEDEF_LIGATURE(i) \ >> typedef struct _LIGATURE ## i { \ >> BYTE VirtualKey; \ >> WORD ModificationNumber; \ >> WCHAR wch[i]; \ >> } LIGATURE ## i, *PLIGATURE ## i; >> >> which can safely be changed to: >> >> typedef struct _LIGATURE ## i { \ >> BYTE VirtualKey, DeadKeyState; \ >> WORD ModificationNumber; \ >> WCHAR wch[i]; \ >> } LIGATURE ## i, *PLIGATURE ## i; >> >> (in the current definition of the extra byte is implicit for the >> alignment, but not declared explicitly, it is implicitly filled with zeroes >> by C compilers when declaring the structure, but in my opinion this extra >> byte should have been declared explicitly.) >> >> But now it's up to the OS to support it, may be it works already if the >> lookup in the LIGATURE(n) table already scans for values of a DWORD, >> including this free padding byte, however there's a need to change some >> code in the kernel-level to check the PUA values mapped in DEADKEY >> structures and extract a DeadKeyState from it. >> >> The alternative is to map the combination of two deadkeys to a bit in the >> modifier number (this can be instructed by the uFlags, which will set the >> modifier bit number specified in the mapped PUA). In all cases there's >> still space for extension there. >> >> The last alternative is to extend the KBDTABLES structure to append new >> members for a table of extended DEADKEYS, and a separate table of LIGATURE >> for DEADKEYs (the KBDTABLE does not specify its own size, but it has a >> fLocaleFlags field just before the table of ligatures, which can indicate >> the presence of these extensions. >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From charupdate at orange.fr Sun Nov 6 01:22:25 2016 From: charupdate at orange.fr (Marcel Schneider) Date: Sun, 6 Nov 2016 08:22:25 +0100 (CET) Subject: Possible to add new precomposed characters for local language in Togo? In-Reply-To: References: <20161104153048.665a7a7059d7ee80bb4d670165c8327d.4f17bdb7bd.wbe@email03.godaddy.com> <1745350033.7396.1478401863009.JavaMail.www@wwinf2212> Message-ID: <143705401.185.1478416945629.JavaMail.www@wwinf2212> On Sun, 6 Nov 2016 05:40:59 +0100, Philippe Verdy wrote: > Another use case: being able to type Bopomofo along with Cyrillic or > Kanas...; and new extensions will be needed for the 2012 German layout and > other layouts made according to the ISO standard (you cannot do all what > you want with just a few modifier bits and Windows only implementing a Kana > modifier key and limiting the number of modifiers supported even below the > capacity of the WORD ModificationNumber ! This does not match my experience. I?m actually using modifiers 0x10, 0x20, 0x40 and 0x80 too, and kbd.h has even names for most of them: [kbd.h(51)] /* * Keyboard Shift State defines. These correspond to the bit mask defined * by the VkKeyScan() API. */ #define KBDBASE 0 #define KBDSHIFT 1 #define KBDCTRL 2 #define KBDALT 4 // three symbols KANA, ROYA, LOYA are for FE #define KBDKANA 8 #define KBDROYA 0x10 #define KBDLOYA 0x20 #define KBDGRPSELTAP 0x80 0x40 proves to be useable too. What I cannot understand, and others are puzzled too, is the name KBDGRPSELTAP. It sounds like it were an acronym of ?GRouP SELecTor APing? or the like, hence my suspicion that the developers were asked to ape the *then new* ISO/IEC 9995-3 group selector. by implementing it as a dead key, as a *remnant* group selector. That?s about the name only. Much more annoying is that I?ve been unable to get any result from the application of the related attribute: [kbd.h(364)] #define CAPLOK 0x01 #define SGCAPS 0x02 #define CAPLOKALTGR 0x04 // KANALOK is for FE #define KANALOK 0x08 #define GRPSELTAP 0x80 And there is even NO COMMENT, as only the first two are mentioned in the preceding comment: [kbd.h(46)] * Special values for Attributes: * CAPLOK - The CAPS-LOCK key affects this key like SHIFT * SGCAPS - CapsLock uppercases the unshifted char (Swiss-German) So I added 0x80 to the attribute of a key, expecting that this would make it sensitive to the CapsLock toggle key VK_CAPITAL, because this would match the ISO/IEC 9995 intent of having a secondary group that is subject to CapsLock. But it did not work. Thank you for the instructions below. I hope that the programmers on this List know how exactly it must be translated into C so that it will be compiled and the API can read the compiled binaries it, and that Microsoft will make and ship the kernel-level update you mention below with one of the very next Windows Updates so that all users whose Windows version stays maintained, will be able to use keyboard layouts that can input WCHAR strings trough dead keys. Best regards, Marcel On Sun, 6 Nov 2016 05:37:12 +0100, Philippe Verdy wrote: > Note: such extension is absolutely necessary for scripts not encoded in > the BMP (e.g. Gothic or Deseret, or larger scripts that will absolutely > need mechanisms like dead keys if they want to have a usable keyboard > layout !) > > 2016-11-06 5:32 GMT+01:00 Philippe Verdy : > >> >> >> 2016-11-06 4:11 GMT+01:00 Marcel Schneider : >> >>> On Fri, 04 Nov 2016 15:30:48 -0700, Doug Ewell wrote: >>> >>> ? And with LATIN CAPITAL LETTER OPEN E? Why not this way (as has been >>> suggested): >>> /*TILDE&AIGU */ DEADTRANS( 0x0190 ,0x1e4d ,{0x0190,0x0303,0x0301} ,DKF_0 >>> ), // *LATIN CAPITAL LETTER OPEN E WITH >>> TILDE AND ACUTE >>> >> >> This snippet cannot work as is, because the DEADTRANS() macro maps >> gernerates a 8-BYTE structure only has a single WCHAR for storing the >> result of the map of a (VKEY+modifier number): >> >> typedef struct _DEADKEY { >> DWORD dwBoth; >> WCHAR wchComposed; >> USHORT uFlags; >> } DEADKEY, *PDEADKEY; >> >> So it will need to map a WCH_LGTR instead, and then use a "ligature" >> table to store the string containing the 3 code units you want. >> >> Then there's an unused BYTE in the DEADTRANS structure for the flags, >> that can be used (specifically for entries mapped to WCH_LGTR) to pass >> flags to the LIGATURE(n) table (where there's also a free BYTE in the >> indexing key, allowing to pass an identifier needed for the lookup in the >> LIGATURE(n) table; alternatively, instead of mapping WCH_LGTR (a PUA), you >> could as well map another PUA there in 0xE001.0xE0FF for passing a byte for >> the deadkey state into the lookup of ligatures: >> >> #define TYPEDEF_LIGATURE(i) \ >> typedef struct _LIGATURE ## i { \ >> BYTE VirtualKey; \ >> WORD ModificationNumber; \ >> WCHAR wch[i]; \ >> } LIGATURE ## i, *PLIGATURE ## i; >> >> which can safely be changed to: >> >> typedef struct _LIGATURE ## i { \ >> BYTE VirtualKey, DeadKeyState; \ >> WORD ModificationNumber; \ >> WCHAR wch[i]; \ >> } LIGATURE ## i, *PLIGATURE ## i; >> >> (in the current definition of the extra byte is implicit for the >> alignment, but not declared explicitly, it is implicitly filled with zeroes >> by C compilers when declaring the structure, but in my opinion this extra >> byte should have been declared explicitly.) >> >> But now it's up to the OS to support it, may be it works already if the >> lookup in the LIGATURE(n) table already scans for values of a DWORD, >> including this free padding byte, however there's a need to change some >> code in the kernel-level to check the PUA values mapped in DEADKEY >> structures and extract a DeadKeyState from it. >> >> The alternative is to map the combination of two deadkeys to a bit in the >> modifier number (this can be instructed by the uFlags, which will set the >> modifier bit number specified in the mapped PUA). In all cases there's >> still space for extension there. >> >> The last alternative is to extend the KBDTABLES structure to append new >> members for a table of extended DEADKEYS, and a separate table of LIGATURE >> for DEADKEYs (the KBDTABLE does not specify its own size, but it has a >> fLocaleFlags field just before the table of ligatures, which can indicate >> the presence of these extensions. >> From charupdate at orange.fr Sun Nov 6 12:33:39 2016 From: charupdate at orange.fr (Marcel Schneider) Date: Sun, 6 Nov 2016 19:33:39 +0100 (CET) Subject: Possible to add new precomposed characters for local language in Togo? In-Reply-To: <143705401.185.1478416945629.JavaMail.www@wwinf2212> References: <20161104153048.665a7a7059d7ee80bb4d670165c8327d.4f17bdb7bd.wbe@email03.godaddy.com> <1745350033.7396.1478401863009.JavaMail.www@wwinf2212> <143705401.185.1478416945629.JavaMail.www@wwinf2212> Message-ID: <1565548445.5480.1478457219235.JavaMail.www@wwinf2212> To complete this thread prior to Microsoft?s response, I?d quote in extenso the relevant part of the Standard. Though it basically matches the actual state of discussion, quoting it here seems useful since it highlights the fact that if the end-users are used to dead keys?as in the francophone regions in Africa? urging them to swap base characters and diacritics is *not* straightforward. Keyboard layouts without dead keys and with combining diacritics on live keys are thus to be promoted in the anglophone regions of Africa, not the francophone ones where layouts with string-generating dead keys seem to be mandatory: TUS 9.0, ?5.12 (Implementation Guidelines: Strategies for Handling Nonspacing Marks), p. 222: | | Keyboard Input | | A common implementation for the input of combining character sequences is the use of | dead keys. These keys match the mechanics used by typewriters to generate such sequences | through overtyping the base character after the nonspacing mark. In computer implementations, | keyboards enter a special state when a dead key is pressed for the accent and emit a | precomposed character only when one of a limited number of ?legal? base characters is | entered. It is straightforward to adapt such a system to emit combining character sequences | or precomposed characters as needed. | | Typists, especially in the Latin script, are trained on systems that work using dead keys. | However, many scripts in the Unicode Standard (including the Latin script) may be implemented | according to the handwriting sequence, in which users type the base character first, | followed by the accents or other nonspacing marks (see Figure 5-4). | In another part, TUS mentions the downside (outdated legacy keyboard protocols): TUS 9.0, ?2.7 (General Structure: Unicode strings), p. 43: | | [?] While an ideal protocol would allow keyboard events to contain complete strings, | many allow only a single UTF-16 code unit per event. [?] BTW there is an obvious error in my last e-mail (quoted below): > [?] that the developers were asked to ape the *then new* ISO/IEC 9995-3 group > selector. by implementing it as a dead key, as a *remnant* group selector. This is not about a dead key, but about a modifier key. Then there is a flaw when I didn?t mention that I pressed the related modifier: > So I added 0x80 to the attribute of a key, expecting that this would > make it sensitive to the CapsLock toggle key VK_CAPITAL, because this > would match the ISO/IEC 9995 intent of having a secondary group that is > subject to CapsLock. But it did not work. Should read: ?expecting that this would make it sensitive to CapsLock *on the 0x80 shift state*.? Lastly, subscribers who had trouble downloading the folder from dispoclavier.com are welcome to e-mail me off-list so that I can send sources and/or drivers without the script that is available at charupdate.info#drivers (translation will complete). Although I?m aware that developers using KbdUTool typically scripted already the automation of the process. Marcel ? On 06/11/16 08:28, I wrote: > On Sun, 6 Nov 2016 05:40:59 +0100, Philippe Verdy wrote: > > > Another use case: being able to type Bopomofo along with Cyrillic or > > Kanas...; and new extensions will be needed for the 2012 German layout and > > other layouts made according to the ISO standard (you cannot do all what > > you want with just a few modifier bits and Windows only implementing a Kana > > modifier key and limiting the number of modifiers supported even below the > > capacity of the WORD ModificationNumber ! > > This does not match my experience. I?m actually using modifiers 0x10, 0x20, > 0x40 and 0x80 too, and kbd.h has even names for most of them: [kbd.h(51)] > > /* > * Keyboard Shift State defines. These correspond to the bit mask defined > * by the VkKeyScan() API. > */ > #define KBDBASE 0 > #define KBDSHIFT 1 > #define KBDCTRL 2 > #define KBDALT 4 > // three symbols KANA, ROYA, LOYA are for FE > #define KBDKANA 8 > #define KBDROYA 0x10 > #define KBDLOYA 0x20 > #define KBDGRPSELTAP 0x80 > > 0x40 proves to be useable too. What I cannot understand, and others > are puzzled too, is the name KBDGRPSELTAP. It sounds like it were an > acronym of ?GRouP SELecTor APing? or the like, hence my suspicion that > the developers were asked to ape the *then new* ISO/IEC 9995-3 group > selector. by implementing it as a dead key, as a *remnant* group selector. > > That?s about the name only. Much more annoying is that I?ve been unable > to get any result from the application of the related attribute: [kbd.h(364)] > > #define CAPLOK 0x01 > #define SGCAPS 0x02 > #define CAPLOKALTGR 0x04 > // KANALOK is for FE > #define KANALOK 0x08 > #define GRPSELTAP 0x80 > > And there is even NO COMMENT, as only the first two are mentioned in the > preceding comment: [kbd.h(46)] > > * Special values for Attributes: > * CAPLOK - The CAPS-LOCK key affects this key like SHIFT > * SGCAPS - CapsLock uppercases the unshifted char (Swiss-German) > > So I added 0x80 to the attribute of a key, expecting that this would > make it sensitive to the CapsLock toggle key VK_CAPITAL, because this > would match the ISO/IEC 9995 intent of having a secondary group that is > subject to CapsLock. But it did not work. > > Thank you for the instructions below. I hope that the programmers on > this List know how exactly it must be translated into C so that it will > be compiled and the API can read the compiled binaries it, and that > Microsoft will make and ship the kernel-level update you mention below > with one of the very next Windows Updates so that all users whose > Windows version stays maintained, will be able to use keyboard layouts > that can input WCHAR strings trough dead keys. > > Best regards, > > Marcel > > On Sun, 6 Nov 2016 05:37:12 +0100, Philippe Verdy wrote: > > > Note: such extension is absolutely necessary for scripts not encoded in > > the BMP (e.g. Gothic or Deseret, or larger scripts that will absolutely > > need mechanisms like dead keys if they want to have a usable keyboard > > layout !) > > > > 2016-11-06 5:32 GMT+01:00 Philippe Verdy : > > > >> > >> > >> 2016-11-06 4:11 GMT+01:00 Marcel Schneider : > >> > >>> On Fri, 04 Nov 2016 15:30:48 -0700, Doug Ewell wrote: > >>> > >>> ? And with LATIN CAPITAL LETTER OPEN E? Why not this way (as has been > >>> suggested): > >>> /*TILDE&AIGU */ DEADTRANS( 0x0190 ,0x1e4d ,{0x0190,0x0303,0x0301} ,DKF_0 > >>> ), // *LATIN CAPITAL LETTER OPEN E WITH > >>> TILDE AND ACUTE > >>> > >> > >> This snippet cannot work as is, because the DEADTRANS() macro maps > >> gernerates a 8-BYTE structure only has a single WCHAR for storing the > >> result of the map of a (VKEY+modifier number): > >> > >> typedef struct _DEADKEY { > >> DWORD dwBoth; > >> WCHAR wchComposed; > >> USHORT uFlags; > >> } DEADKEY, *PDEADKEY; > >> > >> So it will need to map a WCH_LGTR instead, and then use a "ligature" > >> table to store the string containing the 3 code units you want. > >> > >> Then there's an unused BYTE in the DEADTRANS structure for the flags, > >> that can be used (specifically for entries mapped to WCH_LGTR) to pass > >> flags to the LIGATURE(n) table (where there's also a free BYTE in the > >> indexing key, allowing to pass an identifier needed for the lookup in the > >> LIGATURE(n) table; alternatively, instead of mapping WCH_LGTR (a PUA), you > >> could as well map another PUA there in 0xE001.0xE0FF for passing a byte for > >> the deadkey state into the lookup of ligatures: > >> > >> #define TYPEDEF_LIGATURE(i) \ > >> typedef struct _LIGATURE ## i { \ > >> BYTE VirtualKey; \ > >> WORD ModificationNumber; \ > >> WCHAR wch[i]; \ > >> } LIGATURE ## i, *PLIGATURE ## i; > >> > >> which can safely be changed to: > >> > >> typedef struct _LIGATURE ## i { \ > >> BYTE VirtualKey, DeadKeyState; \ > >> WORD ModificationNumber; \ > >> WCHAR wch[i]; \ > >> } LIGATURE ## i, *PLIGATURE ## i; > >> > >> (in the current definition of the extra byte is implicit for the > >> alignment, but not declared explicitly, it is implicitly filled with zeroes > >> by C compilers when declaring the structure, but in my opinion this extra > >> byte should have been declared explicitly.) > >> > >> But now it's up to the OS to support it, may be it works already if the > >> lookup in the LIGATURE(n) table already scans for values of a DWORD, > >> including this free padding byte, however there's a need to change some > >> code in the kernel-level to check the PUA values mapped in DEADKEY > >> structures and extract a DeadKeyState from it. > >> > >> The alternative is to map the combination of two deadkeys to a bit in the > >> modifier number (this can be instructed by the uFlags, which will set the > >> modifier bit number specified in the mapped PUA). In all cases there's > >> still space for extension there. > >> > >> The last alternative is to extend the KBDTABLES structure to append new > >> members for a table of extended DEADKEYS, and a separate table of LIGATURE > >> for DEADKEYs (the KBDTABLE does not specify its own size, but it has a > >> fLocaleFlags field just before the table of ligatures, which can indicate > >> the presence of these extensions. > >> > > From mark at kli.org Sun Nov 6 13:17:02 2016 From: mark at kli.org (Mark E. Shoulson) Date: Sun, 6 Nov 2016 14:17:02 -0500 Subject: The (Klingon) Empire Strikes Back In-Reply-To: <20161104140236.665a7a7059d7ee80bb4d670165c8327d.ef5253d96e.wbe@email03.godaddy.com> References: <20161104140236.665a7a7059d7ee80bb4d670165c8327d.ef5253d96e.wbe@email03.godaddy.com> Message-ID: <4a495df6-9245-c5b2-1a49-abcde8496032@kli.org> On 11/04/2016 05:02 PM, Doug Ewell wrote: > Mark E. Shoulson wrote: > >> At any rate, this isn't Unicode's problem. Unicode would not be >> creating anything in Klingon anyway! > Well, to be fair, I thought IPR was the primary reason Unicode had never > encoded the Apple logo either. I doubt that whether Unicode intended to > use such a character themselves was a factor. (Of course, users who > really wanted that character encoded are probably using ?? or ?? > now.) > > -- > Doug Ewell | Thornton, CO, US | ewellic.org The Apple logo is just that: a logo. Unicode is/used to be explicitly NOT in the business of encoding logos, and only peripherally in the business of encoding cute Wingdings and icons. pIqaD is an *alphabet* for writing a *language*; that's a whole different situation, and one that is squarely in what Unicode is all about doing. "Should" the Apple logo have been encoded? Possibly, though there are a lot of reasons not to which do not depend specifically on IP (we'd have to encode all the other emblems of all the other computer companies also... not to mention gasoline companies, cereal companies...) Should pIqaD be encoded? It is my claim that it should, and that reasons not to are (mainly) limited to IP considerations. In which case, IP considerations need to be addressed, yes, but they should not pre-determine the decision of whether or not it's worthy of inclusion. ~mark From prosfilaes at gmail.com Sun Nov 6 16:22:16 2016 From: prosfilaes at gmail.com (David Starner) Date: Sun, 06 Nov 2016 22:22:16 +0000 Subject: The (Klingon) Empire Strikes Back In-Reply-To: <42101413.334282.1478281304520@mail.yahoo.com> References: <42101413.334282.1478281304520.ref@mail.yahoo.com> <42101413.334282.1478281304520@mail.yahoo.com> Message-ID: On Fri, Nov 4, 2016 at 10:42 AM David Faulks wrote: > There is another issue of course, which I think could be a huge obstacle: > the Trademark/Copyright issue. Paramount claims copyright over the entire > Klingon language (presumably including the script). The issue has recently > gone to court. Encoding criteria for symbols (and this likely extends to > letters) is against encoding them without the permission of the > Copyright/Trademark holder. > The US copyright office will not register letters for copyright: cf. http://web.archive.org/web/20160304062736/http://www.ipmall.info/hosted_resources/CopyrightAppeals/2004/Mark%20Hendricksen.pdf So the copyright issue is not relevant here. -------------- next part -------------- An HTML attachment was scrubbed... URL: From asmusf at ix.netcom.com Sun Nov 6 20:16:37 2016 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Sun, 6 Nov 2016 18:16:37 -0800 Subject: The (Klingon) Empire Strikes Back In-Reply-To: References: <42101413.334282.1478281304520.ref@mail.yahoo.com> <42101413.334282.1478281304520@mail.yahoo.com> Message-ID: An HTML attachment was scrubbed... URL: From Shawn.Steele at microsoft.com Mon Nov 7 16:36:32 2016 From: Shawn.Steele at microsoft.com (Shawn Steele) Date: Mon, 7 Nov 2016 22:36:32 +0000 Subject: The (Klingon) Empire Strikes Back In-Reply-To: <01275881-d53b-269d-fde9-330e7d94be37@kli.org> References: <01275881-d53b-269d-fde9-330e7d94be37@kli.org> Message-ID: I guess for this thread I should subscribe to the list with a personal email address. Please don?t confuse my personal and professional opinions here ;) (Of course I?ll probably confuse them myself). Personally, as myself, no Microsoft hat, I would be interested to see the base characters encoded, excluding the ?mummification glyph? and your 2 created characters. The mummification glyph seems decorative and I haven?t seen the others in use. I would include the pIqaD comma and full stop, they seem to have fairly consistent use. Their meaning is also more specific than the triangle glyph suggestions you mentioned as possible alternatives. Since these are used in plaintext conversations and not merely as decoration, I think that attempting to overload the meaning of the non-pIqaD triangle glyphs would be inappropriate. The enthusiasts using pIqaD, and the businesses targeting that community, have, in my opinion, reached a level of adoption that requires proper Unicode encoding to make further progress. The current ConScript PUA practice is a decent hack to get things to work, but in practice there can be strange behaviors, particularly in more advanced aspects of character behavior. Like the fact that the PUA range doesn?t properly describe the character properties of these letters and digits. For example, Qurgh and others figured out how to get pIqaD to behave in Facebook posts. The current Klingon word of the day posts include the pIqaD spelling, and some discussion happens in pIqaD as well. However getting it all to behave is unnecessarily awkward given some of the current restrictions requiring using the PUA for pIqaD. Mark, you missed that pIqaD has an ISO script code now (Piqd). That might be worth mentioning. The PUA encoding makes it difficult or hacky to integrate some features for the Piqd script in computing libraries, such as digit conversion routines. Professionally, I?m not sure if Microsoft has a current position on pIqaD. As noted by Mark, the Bing Translator allows the use of pIqaD (tlh-Piqd), both for input and output. I chose to use the ConScript PUA for that feature. Had the pIqaD script been included in Unicode, we would have used the assigned Unicode codepoints instead of the ConScript PUA. -Shawn ???? ????? http://blogs.msdn.com/shawnste http://bb-8.blogspot.com From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Mark Shoulson Sent: ???????, ???????? 03, ??? 2016 16:44 To: unicode at unicode.org Subject: The (Klingon) Empire Strikes Back At the time of writing this letter it has not yet hit the UTC Document Register, but I have recently submitted a document revisiting the ever-popular issue of the encoding of Klingon "pIqaD". The reason always given why it could not be encoded was that it did not enjoy enough usage, and so I've collected a bunch of examples to demonstrate that this is not true (scans and also web pages, etc.) So the issue comes back up, and time to talk about it again. Michael Everson: I basically copied your 1997 proposal into the document, with some minor changes. I hope you don't mind. And if you don't want to be on the hook for providing the glyphs to UTC, I can do that. I think that proposal should serve as a starting-point for discussion anyway. There are some things that maybe should be different: 1. the "SYMBOL FOR EMPIRE" also known as the "MUMMIFICATION GLYPH". I don't know where the second name comes from, I don't know how important it is to encode it, and I don't know how much of a trademark headache it will cause with Paramount, as it is used pretty heavily in their imagery. Something we'll have to talk about. 2. I put in the COMMA and FULL STOP, which were not in the original proposal but were in the ConScript registry entry. The examples I have show them clearly being used. UTC may decide to unify them with existing triangular shapes, which may or may not be a good idea. 3. For my part, I've invented a pair of ampersands for Klingon (Klingon has two words for "and": one for joining verbs/sentences and one for joining nouns (the former goes between its "conjunctands", the latter after them)), from ligatures of the letters in question. The pretty much have NO usage, of course (and are not in the proposal), but maybe they should be presented to the community. Document is available at http://web.meson.org/downloads/pIqaDReturns.pdf Let the bickering begin! ~mark -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Mon Nov 7 16:59:36 2016 From: doug at ewellic.org (Doug Ewell) Date: Mon, 07 Nov 2016 15:59:36 -0700 Subject: The (Klingon) Empire Strikes Back Message-ID: <20161107155936.665a7a7059d7ee80bb4d670165c8327d.1b935c880e.wbe@email03.godaddy.com> Shawn Steele wrote: > The PUA encoding makes it difficult or hacky to integrate some > features for the Piqd script in computing libraries, such as digit > conversion routines. Although somebody did create a Ewellic calculator for iOS that uses the ConScript encoding: https://itunes.apple.com/us/app/calculator-ewellic/id850838052 -- Doug Ewell | Thornton, CO, US | ewellic.org From mark at kli.org Mon Nov 7 18:46:09 2016 From: mark at kli.org (Mark E. Shoulson) Date: Mon, 7 Nov 2016 19:46:09 -0500 Subject: The (Klingon) Empire Strikes Back In-Reply-To: References: <42101413.334282.1478281304520.ref@mail.yahoo.com> <42101413.334282.1478281304520@mail.yahoo.com> Message-ID: <933c21bf-89ea-5078-eef7-7e0453cf02b6@kli.org> Thanks, Asmus. The document from the copyright office is pretty explicit and final, and it is pretty clear that you can't copyright an *alphabet*, that is *characters*. You can copyright *glyphs* (a font), but that is another matter entirely. I've heard that there are similar questions regarding tengwar and cirth, but it is notable that UTC *did* see fit to consider this question for them and determine that they were worthy of encoding (they are on the roadmap), even though they have not actually followed through on that yet, perhaps because of these very IP concerns. Notably, pIqaD is not only not on the roadmap, it is specifically listed on the "Not on the Roadmap" page as an example of something that was not deemed worthy of being on the roadmap. If it's an IP issue, then someone will have to explain to me why it applies so asymmetrically to Tolkien and Klingon (and Blissymbolics, for that matter). And yes, these are not the only writing systems with these issues and will not be the last. One way or another, the question will have to be faced and dealt with one way or another; ignoring it won't help. ~mark On 11/06/2016 09:16 PM, Asmus Freytag wrote: > On 11/6/2016 2:22 PM, David Starner wrote: >> >> >> On Fri, Nov 4, 2016 at 10:42 AM David Faulks > > wrote: >> >> There is another issue of course, which I think could be a huge >> obstacle: the Trademark/Copyright issue. Paramount claims >> copyright over the entire Klingon language (presumably including >> the script). The issue has recently gone to court. Encoding >> criteria for symbols (and this likely extends to letters) is >> against encoding them without the permission of the >> Copyright/Trademark holder. >> >> >> The US copyright office will not register letters for copyright: cf. >> http://web.archive.org/web/20160304062736/http://www.ipmall.info/hosted_resources/CopyrightAppeals/2004/Mark%20Hendricksen.pdf >> So the copyright issue is not relevant here. > > On the face of it, the cited statement seems to very broadly reject > the copyrightability of alphabets and writing systems, tracing that > decision back to statements of intent around the copyright legislation. > > Given that, I'd tend to concur with Doug that UTC should feel free to > discuss this on the merit, but that in the case of a positive outcome > the Consortium would of course have counsel review this issue. Given > that this won't be the only writing system for which the original > invention post-dates modern IP laws, it would probably be good to have > some clarity here. > > A./ > From richard.wordingham at ntlworld.com Tue Nov 8 02:30:25 2016 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Tue, 8 Nov 2016 08:30:25 +0000 Subject: Multiple Preposed Marks Message-ID: <20161108083025.47a4784c@JRWUBU2> TUS Section 2.11 says, "If the combining characters can interact typographically?for example, U+0304 combining macron and U+0308 combining diaeresis ? then the order of graphic display is determined by the order of coded characters (see Table 2-5). By default, the diacritics or other combining characters are positioned from the base character?s glyph outward". So, if I have two spacing combining marks E and O that are each positioned to the left of the base (say X) in a left-to-right script, so that the encodings and appear with the glyph orders and , and codings and , if not total gibberish, represent a horizontal sequence of the glyphs with gX on the right, should render as or ? The phonetics and collation (in so far as it is meaningful) of the words provide no cue as to the order of the encoded characters. I have encountered both renderings. The issue came up when I was checking, in both the Firefox and MS Edge browsers, that my OpenType Tai Tham font Da Lekh could handle all the headwords of two Northern Thai dictionaries. (Sparing dotted circle deletion and orthographic syllable reunification are tricky.) One of the dictionaries spells a few words with a combination of the Tai and Pali notations for the vowel /o:/ in open syllables where one might expect to see an independent vowel. I'm down to two other rendering engine issues - a combination of tone mark and then vowel in 4 words, where the dictionary probably has a misspelling, and the need for an OpenType feature (probably a cvXX) for inconsistent handling of U+1A58 MAI KANG LAI. The latter may be a challenge - I couldn't persuade MS Edge to use the font's Lao shaping for the Tai Tham script or for the Latin script in a transliteration mode. (That mode is triggered by feature ss02 for the Latin script, and that works well enough in browsers.) Richard. From richard.wordingham at ntlworld.com Tue Nov 8 03:09:45 2016 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Tue, 8 Nov 2016 09:09:45 +0000 Subject: Suppressing Ligation of Spacing Marks Message-ID: <20161108090945.2f92771d@JRWUBU2> Should it be possible to suppress the ligation of a base character and a visually following spacing mark in plain text? The example I have in minf is the sequence . It may be desirable to suppress the ligation because both ligands have subscript consonants. However, if I write , the Universal Shaping Engine decides that the ZWNJ triggers a new syllable, and inserts a dotted circle before SIGN AA. (The dotted circle after SIGN AA results from a failure to read the proposal for the Lanna script as it was then called.) Richard. From jcb+unicode at inf.ed.ac.uk Tue Nov 8 05:58:26 2016 From: jcb+unicode at inf.ed.ac.uk (Julian Bradfield) Date: Tue, 8 Nov 2016 11:58:26 +0000 (GMT) Subject: The (Klingon) Empire Strikes Back References: <42101413.334282.1478281304520.ref@mail.yahoo.com> <42101413.334282.1478281304520@mail.yahoo.com> <933c21bf-89ea-5078-eef7-7e0453cf02b6@kli.org> Message-ID: On 2016-11-08, Mark E. Shoulson wrote: > I've heard that there are similar questions regarding tengwar and cirth, > but it is notable that UTC *did* see fit to consider this question for > them and determine that they were worthy of encoding (they are on the > roadmap), even though they have not actually followed through on that > yet, perhaps because of these very IP concerns. Notably, pIqaD is not The Tolkien Estate considers that the tengwar constitute a work of art, and it's not willing to see them in Unicode, because this would hinder its ability to pursue people using tengwar for what it considers inappropriate purposes. (I finally asked them a couple of years ago for permission to encode, based on Michael Everson's draft proposal from yonks ago, and that's the summary of their reply.) Several years ago, I was told on this list that it would be up to the proposers to deal with this, and that the Unicode Consortium would have no interest in taking on the 800lb legal gorilla that is the Tolkien Estate. (Now a 24M? gorilla with what it got from New Line Cinema.) If some wealthy Unicode Consortium member feels like paying for an American counsel's opinion that the Estate is just trying it on, feel free! -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. From c933103 at gmail.com Tue Nov 8 08:02:05 2016 From: c933103 at gmail.com (gfb hjjhjh) Date: Tue, 8 Nov 2016 22:02:05 +0800 Subject: The (Klingon) Empire Strikes Back In-Reply-To: <42101413.334282.1478281304520@mail.yahoo.com> References: <42101413.334282.1478281304520.ref@mail.yahoo.com> <42101413.334282.1478281304520@mail.yahoo.com> Message-ID: I believe there's already a court ruling that say languages and words are not copyrightablein the case about loglan, although the trademarkability of an language is another matter. 2016?11?5? 01:42 ? "David Faulks" ??? > > On Thu, 11/3/16, Mark Shoulson wrote: > > Subject: The (Klingon) Empire Strikes Back > > > At the time of writing this letter it has not yet hit the UTC > > Document Register, but I have recently submitted a document > > revisiting the ever-popular issue of the encoding of Klingon > > "pIqaD". The reason always given why it could not be > > encoded was that it did not enjoy enough usage, and so I've > > collected a bunch of examples to demonstrate that this is not > > true (scans and also web pages, etc.) So the issue comes > > back up, and time to talk about it again. > > There is another issue of course, which I think could be a huge obstacle: > the Trademark/Copyright issue. Paramount claims copyright over the entire > Klingon language (presumably including the script). The issue has recently > gone to court. Encoding criteria for symbols (and this likely extends to > letters) is against encoding them without the permission of the > Copyright/Trademark holder. > > Is Paramount endorsing your proposal? > > > > > ~mark > > David Faulks > > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Tue Nov 8 14:30:26 2016 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Tue, 8 Nov 2016 20:30:26 +0000 Subject: Multiple Preposed Marks In-Reply-To: <20161108083025.47a4784c@JRWUBU2> References: <20161108083025.47a4784c@JRWUBU2> Message-ID: <20161108203026.2a56cb1d@JRWUBU2> On Tue, 8 Nov 2016 08:30:25 +0000 Richard Wordingham wrote: > and the need for an OpenType feature (probably a cvXX) > for inconsistent handling of U+1A58 MAI KANG LAI. The latter may be a > challenge - I couldn't persuade MS Edge to use the font's Lao shaping General features (e.g. 'ss01') for Tai Tham work a treat in MS Edge, and seem to be executed at the same time time as the 'standard typographical presentation', e.g feature 'psts'. Thank you! That makes things much easier. (There seems to be quite a bit of variation in layout in Chiang Mai province, never mind the rest of the region.) Richard. From charupdate at orange.fr Tue Nov 8 15:02:16 2016 From: charupdate at orange.fr (Marcel Schneider) Date: Tue, 8 Nov 2016 22:02:16 +0100 (CET) Subject: Multiple Preposed Marks In-Reply-To: <20161108203026.2a56cb1d@JRWUBU2> References: <20161108083025.47a4784c@JRWUBU2> <20161108203026.2a56cb1d@JRWUBU2> Message-ID: <1173503927.19308.1478638936401.JavaMail.www@wwinf1c23> On Tue, 8 Nov 2016 21:36, Richard Wordingham wrote: > > On Tue, 8 Nov 2016 08:30:25 +0000 > Richard Wordingham wrote: > > > and the need for an OpenType feature (probably a cvXX) > > for inconsistent handling of U+1A58 MAI KANG LAI. The latter may be a > > challenge - I couldn't persuade MS Edge to use the font's Lao shaping > > General features (e.g. 'ss01') for Tai Tham work a treat in MS Edge, and > seem to be executed at the same time time as the 'standard typographical > presentation', e.g feature 'psts'. Thank you! That makes things much > easier. [?] ?Where there?s a will, there?s a way!? Marcel From verdy_p at wanadoo.fr Tue Nov 8 17:00:01 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Wed, 9 Nov 2016 00:00:01 +0100 Subject: Multiple Preposed Marks In-Reply-To: <20161108083025.47a4784c@JRWUBU2> References: <20161108083025.47a4784c@JRWUBU2> Message-ID: 2016-11-08 9:30 GMT+01:00 Richard Wordingham < richard.wordingham at ntlworld.com>: > TUS Section 2.11 says, "If the combining characters can interact > typographically?for example, U+0304 combining macron and U+0308 > combining diaeresis ? then the order of graphic display is > determined by the order of coded characters (see Table 2-5). > By default, the diacritics or other combining characters are > positioned from the base character?s glyph outward". > The interpretation of "If the combining characters can interact typographically" should be better read as "If the combining characters have the same non-zero combining class or any one of them has a zero combining class". Effectively the combining classes were historically intended to track these possible graphic interactions, in order to allow or disable reordering and detect canonical equivalences. But now normalization is everywhere and causes the pairs using the condition above to be freely reordered (or decomposed and recomposed, meaning that the encoding order is NOT significant at all). But it turned out that some diacritics may be positioned differently according to their base character. E.g., the cedilla which may interact below, where no interaction is supposed with other combining characters normally interacting above (so that reordering to canonical equivalents is permitted and in fact made automatically during the encoding/decoding processes of documents), but with some Latin letters these interaction do occur. The only way then to block the reordering (if you don't want the positions infered from the encoding order of normalized strings), is to block it using zero-combining joiners (CGJ). This sentence should have been updated since long in TUS, because TUS does not really know how characters will be positioned and Unicode permits reordering of pairs of diacritics if they are not blocking each other for normalization. This is important for the cedilla, but even more important for Hebrew diacritics, whose combining classes do not really track correctly their relative positioning (as discussed on this list years ago, and known as the "Hebrew points bug" (but this will never change: the combiing classes are assigned permanently and continue to work for simple cases, but will cause problems with some pairs needing insertions of CGJ). This is also important for several Indic scripts that have complex positioning rules if you use combining characters with non-zero combining classes (initially intended for simple cases in Latin/Greek/Cyrillic). Thanks, the most critical diacritics in Indic scripts for such complex cases have a combining class set to zero (meaning that they blcok eah other and their relative encoding order is not affected by normalization, but there are many cases where CGJ is needed. -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Tue Nov 8 17:42:53 2016 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Tue, 8 Nov 2016 23:42:53 +0000 Subject: Multiple Preposed Marks In-Reply-To: References: <20161108083025.47a4784c@JRWUBU2> Message-ID: <20161108234253.52544213@JRWUBU2> On Wed, 9 Nov 2016 00:00:01 +0100 Philippe Verdy wrote: > 2016-11-08 9:30 GMT+01:00 Richard Wordingham < > richard.wordingham at ntlworld.com>: > > > TUS Section 2.11 says, "If the combining characters can interact > > typographically?for example, U+0304 combining macron and U+0308 > > combining diaeresis ? then the order of graphic display is > > determined by the order of coded characters (see Table 2-5). > > By default, the diacritics or other combining characters are > > positioned from the base character?s glyph outward". > The interpretation of "If the combining characters can interact > typographically" should be better read as "If the combining > characters have the same non-zero combining class or any one of them > has a zero combining class". The combining marks in question both have canonical combining class 0. > But now normalization is everywhere and causes the pairs using the > condition above to be freely reordered (or decomposed and recomposed, > meaning that the encoding order is NOT significant at all). I believe a renderer is permitted to treat canonically equivalent sequence differently so long as it does not believe it should treat them differently. However, that is irrelevant to this case. Richard. From verdy_p at wanadoo.fr Tue Nov 8 20:26:51 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Wed, 9 Nov 2016 03:26:51 +0100 Subject: Multiple Preposed Marks In-Reply-To: <20161108234253.52544213@JRWUBU2> References: <20161108083025.47a4784c@JRWUBU2> <20161108234253.52544213@JRWUBU2> Message-ID: 2016-11-09 0:42 GMT+01:00 Richard Wordingham < richard.wordingham at ntlworld.com>: > On Wed, 9 Nov 2016 00:00:01 +0100 > Philippe Verdy wrote: > > > 2016-11-08 9:30 GMT+01:00 Richard Wordingham < > > richard.wordingham at ntlworld.com>: > > > > > TUS Section 2.11 says, "If the combining characters can interact > > > typographically?for example, U+0304 combining macron and U+0308 > > > combining diaeresis ? then the order of graphic display is > > > determined by the order of coded characters (see Table 2-5). > > > By default, the diacritics or other combining characters are > > > positioned from the base character?s glyph outward". > > > The interpretation of "If the combining characters can interact > > typographically" should be better read as "If the combining > > characters have the same non-zero combining class or any one of them > > has a zero combining class". > > The combining marks in question both have canonical combining class 0. > > > But now normalization is everywhere and causes the pairs using the > > condition above to be freely reordered (or decomposed and recomposed, > > meaning that the encoding order is NOT significant at all). > > I believe a renderer is permitted to treat canonically equivalent > sequence differently so long as it does not believe it should treat > them differently. However, that is irrelevant to this case. > This is DIRECTLY relevant to the sentence in TUS you quoted, which is all about combining characters encoded after the base letter and often have non-zero combining classes and are reorderable But evidently this sentence in TUS is not relevant to "prepended" combining marks that are all with combining class 0, here "prepended" meaning: encoded before the base character, but not after it even if they are visually combining before it, as is the case for wellknown Indic vowels that have now non-zero combining classes that allow them to be reordered before other combining marks when normalizing, but still remaining encoded after the base consonnant). What I want to say is that this sentence in TUS is quite ambiguous: it speaks about graphic interaction, but this is not really encoded in text sequences and forgets the the effect of combining classes on combining sequences, which NEVER considers any actual graphic interaction (simply because it is not specified and the actual graphic interactions may depend on font styles (notably in honorific Arabic typography using very complex layouts, but even within the Latin script when using decorated font styles or custom ligatures where complex also interactions occur, including on larger spans than clusters, such as full words). -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Tue Nov 8 20:42:07 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Wed, 9 Nov 2016 03:42:07 +0100 Subject: Suppressing Ligation of Spacing Marks In-Reply-To: <20161108090945.2f92771d@JRWUBU2> References: <20161108090945.2f92771d@JRWUBU2> Message-ID: inserting some zero-width word joiner or disjoiner should work with this... But if you see a dotted circle, you need to encode some zero-width space as the base holder for the combining vowel sign following it. However I wonder if fonts accept zero-width holders for combining vowels, they could still assume that there's no matching base consonnant and thus insert another base dotted circle. There's no consensus across script for using the same null-base holder acting as a pseudo-consonnant for vowels encoded after them (e.g. Hangul has its own jamo holder for this because of its specific algorithmic composition, but some other scripts also use such null holders for their own orthography).. In Alphabetic scripts, the ZWNJ should work. But in Indic scripts we are all depending on the capability of renderers to support specific scripts with only specific subsets of base letters and every other character outside this subset will trigger the insertion of a dotted circle glyph, and ZWJ/ZWNJ is already specific for being used in script-specific clusters for some distinctions (notably to control how parts of clusters are subgrouped ...) You'll need to "bug" the maintainers of the renderer if they forgot necessary cases described earlier for the script when it was initially approved for encoding. 2016-11-08 10:09 GMT+01:00 Richard Wordingham < richard.wordingham at ntlworld.com>: > Should it be possible to suppress the ligation of a base character and > a visually following spacing mark in plain text? > > The example I have in minf is the sequence U+1A63 TAI THAM VOWEL SIGN AA>. It may be desirable to suppress the > ligation because both ligands have subscript consonants. However, if > I write , the Universal Shaping Engine > decides that the ZWNJ triggers a new syllable, and inserts a dotted > circle before SIGN AA. (The dotted circle after SIGN AA results from a > failure to read the proposal for the Lanna script as it was then > called.) > > Richard. > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at lindenbergsoftware.com Wed Nov 9 07:10:34 2016 From: unicode at lindenbergsoftware.com (Norbert Lindenberg) Date: Wed, 9 Nov 2016 22:10:34 +0900 Subject: Suppressing Ligation of Spacing Marks In-Reply-To: <20161108090945.2f92771d@JRWUBU2> References: <20161108090945.2f92771d@JRWUBU2> Message-ID: The part of the specification of the Universal Shaping Engine [1] that deals with ZWNJ is a bit unclear, but I read it to mean that ZWNJ should not cause the insertion of a dotted circle if the character following it has general category Mn or Mc. The USE specification says: "The zero-width non-joiner is used to prevent a fusion of two characters. It continues a preceding cluster but causes a cluster break after itself when the following character is not a mark character (gc=Mn or gc=Mc).? The specification does not say how this character should be handled in cluster validation. I assume first that the statement about the combining grapheme joiner also applies to ZWNJ: ?CGJ has been omitted from the above schema in order to avoid unnecessary complexity?. I further interpret the little the spec does say about ZWNJ to imply that it should be allowed before any character with general category Mn or Mc, without affecting the validity of the cluster. Inserting a dotted circle would be equivalent to causing a cluster break, which the spec rules out when the following character has general category Mn or Mc. U+1A63 has gc=Mc, so it shouldn?t be preceded by a dotted circle in the sequence . Note that I omitted the first ??? from the sequence you provided, because an intervening character might trigger the dotted circle. So this may just be a bug in the implementation of the USE that you?re using. I see this bug in Safari (CoreText), but not in Firefox (Harfbuzz); haven?t tried Edge. Which one are you using? [1] http://www.microsoft.com/typography/OpenTypeDev/USE/intro.htm Best regards, Norbert > On Nov 8, 2016, at 18:09 , Richard Wordingham wrote: > > Should it be possible to suppress the ligation of a base character and > a visually following spacing mark in plain text? > > The example I have in minf is the sequence U+1A63 TAI THAM VOWEL SIGN AA>. It may be desirable to suppress the > ligation because both ligands have subscript consonants. However, if > I write , the Universal Shaping Engine > decides that the ZWNJ triggers a new syllable, and inserts a dotted > circle before SIGN AA. (The dotted circle after SIGN AA results from a > failure to read the proposal for the Lanna script as it was then > called.) > > Richard. > From richard.wordingham at ntlworld.com Wed Nov 9 13:53:35 2016 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Wed, 9 Nov 2016 19:53:35 +0000 Subject: Suppressing Ligation of Spacing Marks In-Reply-To: References: <20161108090945.2f92771d@JRWUBU2> Message-ID: <20161109195335.349183b3@JRWUBU2> On Wed, 9 Nov 2016 22:10:34 +0900 Norbert Lindenberg wrote: > The part of the specification of the Universal Shaping Engine [1] > that deals with ZWNJ is a bit unclear, but I read it to mean that > ZWNJ should not cause the insertion of a dotted circle if the > character following it has general category Mn or Mc. > > The USE specification says: "The zero-width non-joiner is used to > prevent a fusion of two characters. It continues a preceding cluster > but causes a cluster break after itself when the following character > is not a mark character (gc=Mn or gc=Mc).? > > The specification does not say how this character should be handled > in cluster validation. I assume first that the statement about the > combining grapheme joiner also applies to ZWNJ: ?CGJ has been omitted > from the above schema in order to avoid unnecessary complexity?. I > further interpret the little the spec does say about ZWNJ to imply > that it should be allowed before any character with general category > Mn or Mc, without affecting the validity of the cluster. Inserting a > dotted circle would be equivalent to causing a cluster break, which > the spec rules out when the following character has general category > Mn or Mc. That makes sense, but I was hoping for an opinion independent of the Microsoft policy. > U+1A63 has gc=Mc, so it shouldn?t be preceded by a dotted circle in > the sequence . Note that I omitted the first > ??? from the sequence you provided, because an intervening character > might trigger the dotted circle. The word, meaning 'to foretell' can be seen at http://www.wrdingham.co.uk/lanna/renderer_test.htm . The full encoding of the syllable is . MS Edge, running on an evaluation copy of Windows 10 kindly provided for checking web page displays in MS Edge, inserts dotted circles after* ZWNJ and before the second SAKOT. The second insertion is because USE does not recognise Indic CVC orthographic syllables, which make up about half the native vocabulary in the region. Pali is less badly affected, though one can't write _nibb?na_ 'nirvana' properly and the Tai Khuen may be unhappy with how they have to write _dhamma_ 'dharma' and its compounds in Pali. *I know it's after because of the 'shaping' in the Da Lekh font, which eliminates the vast bulk of the dotted circles misinserted by USE, whose specification is wrong. > So this may just be a bug in the implementation of the USE that > you?re using. I see this bug in Safari (CoreText), but not in Firefox > (Harfbuzz); haven?t tried Edge. Which one are you using? MS Edge (see above). The dotted circle behaviour of HarfBuzz and MS Edge is different - I have dotted circle lookups in my font dedicated to HarfBuzz patterns that don't occur in MS Edge. I haven't checked my font to destruction yet (6 marks will generally overwhelm it); I've just thrown two Northern Thai dictionaries at it. Richard. From richard.wordingham at ntlworld.com Wed Nov 9 14:27:42 2016 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Wed, 9 Nov 2016 20:27:42 +0000 Subject: Multiple Preposed Marks In-Reply-To: References: <20161108083025.47a4784c@JRWUBU2> <20161108234253.52544213@JRWUBU2> Message-ID: <20161109202742.06df65c6@JRWUBU2> On Wed, 9 Nov 2016 03:26:51 +0100 Philippe Verdy wrote: > 2016-11-09 0:42 GMT+01:00 Richard Wordingham < > richard.wordingham at ntlworld.com>: > > I believe a renderer is permitted to treat canonically equivalent > > sequence differently so long as it does not believe it should treat > > them differently. However, that is irrelevant to this case. > This is DIRECTLY relevant to the sentence in TUS you quoted, which is > all about combining characters encoded after the base letter and > often have non-zero combining classes and are reorderable As you pointed out, it most clearly addresses the case of two combining marks with the same canonical combining class, and obviously in such a case the sequence is not reorderable. > But evidently this sentence in TUS is not relevant to "prepended" > combining marks that are all with combining class 0, here "prepended" > meaning: encoded before the base character, but not after it even if > they are visually combining before it, as is the case for wellknown > Indic vowels that have now non-zero combining classes that allow them > to be reordered before other combining marks when normalizing, but > still remaining encoded after the base consonnant). I can't guess what you mean: (a) The combining marks in question *follow* the base consonant, but are rendered before it. 'Preposition' is a property of abstract characters, not of codepoints. (b) All characters with an Indic Positional Category of 'left' (or similar) have canonical combining class 0. There is a simple example of the base outwards rule in the Tai Tham script. The only way of encoding Northern Thai /p???/ 'to chan?e' with the glyphs of U+1A38 TAI THAM LETTER HIGH PA, U+1A55 TAI THAM CONSONANT SIGN MEDIAL RA and U+1A6F TAI THAM VOWEL SIGN AE acceptable to the Universal Shaping engine is , and the visual order is the reverse of the encoding order. Unfortunately, it could be argued that the encoding order is independent of the visual order. Richard. From verdy_p at wanadoo.fr Wed Nov 9 15:23:28 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Wed, 9 Nov 2016 22:23:28 +0100 Subject: Multiple Preposed Marks In-Reply-To: <20161109202742.06df65c6@JRWUBU2> References: <20161108083025.47a4784c@JRWUBU2> <20161108234253.52544213@JRWUBU2> <20161109202742.06df65c6@JRWUBU2> Message-ID: 2016-11-09 21:27 GMT+01:00 Richard Wordingham < richard.wordingham at ntlworld.com>: > On Wed, 9 Nov 2016 03:26:51 +0100 > Philippe Verdy wrote: > > > 2016-11-09 0:42 GMT+01:00 Richard Wordingham < > > richard.wordingham at ntlworld.com>: > > > > I believe a renderer is permitted to treat canonically equivalent > > > sequence differently so long as it does not believe it should treat > > > them differently. However, that is irrelevant to this case. > > > This is DIRECTLY relevant to the sentence in TUS you quoted, which is > > all about combining characters encoded after the base letter and > > often have non-zero combining classes and are reorderable > > As you pointed out, it most clearly addresses the case of two combining > marks with the same canonical combining class, and obviously in such a > case the sequence is not reorderable. > > > But evidently this sentence in TUS is not relevant to "prepended" > > combining marks that are all with combining class 0, here "prepended" > > meaning: encoded before the base character, but not after it even if > > they are visually combining before it, as is the case for wellknown > > Indic vowels that have now non-zero combining classes that allow them > > to be reordered before other combining marks when normalizing, but > > still remaining encoded after the base consonnant). > > I can't guess what you mean: > (a) The combining marks in question *follow* the base consonant, but are > rendered before it. 'Preposition' is a property of abstract > characters, not of codepoints. > > (b) All characters with an Indic Positional Category of 'left' (or > similar) have canonical combining class 0. > Reread, I was very clear between these two cases, explicitly saying that "PREPENDED" meant case (b). And yes I also said explicitly these had combining class 0 and that they were then not subject to mutual reordering. But the TUS sentence that YOU quoted was compleltely falling in case (a), where "combining marks" may still appear before but are always encoded after, and where they are freely (undistinctly) reorderable if they have distinct non-zero combining classes: these combining characters have then no well defined mutual positions. But in these cases, they are "supposed" to not "interact typographically" (due to the fact they were encoded with distinct combining positional classes), but this turns to be wrong in various cases, notably for Hebrew diacritics (between vowel points and other points modifying the consonnant) and for several kinds of Indic diacritics (mixes of vowels halfvowels, and "liquid" halfconsonnants, and within consonnant clusters). There are also some complex cases when using non-Indic diacritics over Indic letters/clusters For all these cases (a), CGJ must be used to block the possible reorderings and then being able to compose the layout of clusters with the expected typographic interactions when such interactions can effectively occur (because the **effective** relative position is DEFINTELY NOT explicitly encoded in any one of these combining characters with non-zero combining classes (whose property names, like "above" or "below", are counter-intuitive but only work with the most frequent simple cases where there's a single diacritic after a base letter and for most base letters... but not all, and without any consideration of the possible creation of ligatures and complex clusters, notably in traditional Arabic, or in decorative typographies for most all scripts including Latin)! If you're still not convinced, look at how complex typographies are used for "the name of God" in various religions and denominations (it's not just the case of the Hebrew "tetragram"). You can also look at "calligrammes" where the usual script layout is completely relaxed and where diacritics may be moved anywhere around words and not necessarily near the base letter; it is impossible to represent this typography with characters and their Unicode properties. Indic scripts however have formalized some of these freedoms of placements using complex positioning rules that are part of their most common form. -------------- next part -------------- An HTML attachment was scrubbed... URL: From petercon at microsoft.com Wed Nov 9 22:49:12 2016 From: petercon at microsoft.com (Peter Constable) Date: Thu, 10 Nov 2016 04:49:12 +0000 Subject: The (Klingon) Empire Strikes Back In-Reply-To: <27f48fcf-9ebc-8363-3b27-6540a242d375@kli.org> References: <42101413.334282.1478281304520.ref@mail.yahoo.com> <42101413.334282.1478281304520@mail.yahoo.com> <27f48fcf-9ebc-8363-3b27-6540a242d375@kli.org> Message-ID: From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Mark E. Shoulson Sent: Friday, November 4, 2016 1:18 PM > At any rate, this isn't Unicode's problem? You saying that potential IP issues are not Unicode?s problem does not in fact make it not a problem. A statement in writing from authorized Paramount representatives stating it would not be a problem for either Unicode, its members or implementers of Unicode would make it not a problem for Unicode. Peter -------------- next part -------------- An HTML attachment was scrubbed... URL: From mats.gbproject at gmail.com Thu Nov 10 05:47:41 2016 From: mats.gbproject at gmail.com (Mats Blakstad) Date: Thu, 10 Nov 2016 12:47:41 +0100 Subject: Dataset for all ISO639 code sorted by country/territory? In-Reply-To: <20160920093425.665a7a7059d7ee80bb4d670165c8327d.219e1cf756.wbe@email03.godaddy.com> References: <20160920093425.665a7a7059d7ee80bb4d670165c8327d.219e1cf756.wbe@email03.godaddy.com> Message-ID: On 20 September 2016 at 18:34, Doug Ewell wrote: > > Is there any dataset that contains all languages in the world sorted > > by country/territory? > > As others have pointed out, be careful about how slippery this slope can > get. Everyone has his or her own opinion about how many speakers of > Language X in country Y need to be identified, estimated, or conjectured > in order to say that "language X is spoken in country Y." > For myself I was not actually considering the amount of speakers in each country, but to map languages with countries/territories where the language originated or have been spoken traditionally. For instance in Norway we do have many immigrants from Pakistan, but I doubt any of them would expect to see Urdu sorted under Norway, even though there are many people in Norway that speak Urdu. They would expect to see it under Pakistan that is a their heritage country, I guess this is a lot an identity issue also I do understand that it is not easy to get a perfect language-country mapping, and I guess the mapping also depend on the use. For myself I want people to be able to sort languages by country/territories to make it easier to make lists of translations, I think it can be good to be able to sort by territories instead of providing a looong list of languages. So I guess what matters is which language people mostly expect to find under the country/territory. > > > I manage to find a dataset on the website of Ethnologue, though it > > doesn't look like open source, need to check with them exactly how I'm > > allowed to use it: > > http://www.ethnologue.com/codes/download-code-tables > > The readme file included in the downloadable zip file makes SIL's terms > very clear. Basically you need to credit SIL as the source of the data, > not change it, and not make the data directly available for others to > download. It's best not to get caught up in "open source" as if any > other terms would make the data totally unusable. > > I agree that a dataset is not unusable just because it is not open source, but for myself I in fact need a dowbloadable file! I tried contact SiL but they will only sell the dataset for a fee and will not give an open source license. Would it be possible to extend this dataset to all languages and start build an open source data set for language-territory mapping? http://www.unicode.org/cldr/charts/latest/supplemental/language_territory_information.html -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Thu Nov 10 11:56:58 2016 From: doug at ewellic.org (Doug Ewell) Date: Thu, 10 Nov 2016 10:56:58 -0700 Subject: Dataset for all ISO639 code sorted by =?UTF-8?Q?country/territory=3F?= Message-ID: <20161110105658.665a7a7059d7ee80bb4d670165c8327d.a8ff034ef1.wbe@email03.godaddy.com> Mats Blakstad wrote: > For myself I was not actually considering the amount of speakers in > each country, but to map languages with countries/territories where > the language originated or have been spoken traditionally. And that is where I think you'll have disagreement on the details. > So I guess what matters is which language people mostly expect to find > under the country/territory. Yep, that's the challenge. > Would it be possible to extend this dataset to all languages and start > build an open source data set for language-territory mapping? > http://www.unicode.org/cldr/charts/latest/supplemental/language_territory_information.html That's a good question for the CLDR folks, who have their own mailing list. Keep in mind that the CLDR table documents 675 of the world's best-known languages, counting variants such as three different orthographies of Uzbek. While anything is possible, extending this to "all languages," e.g. the other 6,300 lesser-known living languages, might require a bit of time and money. There is also a resource in the "UDHR in Unicode" project that might be worth investigating, though it too is an imperfect match with what you seem to be looking for. -- Doug Ewell | Thornton, CO, US | ewellic.org From Shawn.Steele at microsoft.com Thu Nov 10 12:33:55 2016 From: Shawn.Steele at microsoft.com (Shawn Steele) Date: Thu, 10 Nov 2016 18:33:55 +0000 Subject: The (Klingon) Empire Strikes Back In-Reply-To: References: <42101413.334282.1478281304520.ref@mail.yahoo.com> <42101413.334282.1478281304520@mail.yahoo.com> <27f48fcf-9ebc-8363-3b27-6540a242d375@kli.org> Message-ID: More generally, does that mean that alphabets with perceived owners will only be considered for encoding with permission from those owner(s)? What if the ownership is ambiguous or unclear? Getting permission may be a lot of work, or cost money, in some cases. Will applications be considered pending permission, perhaps being provisionally approved until such permission is received? Is there specific language that Unicode would require from owners to be comfortable in these cases? It makes little sense for a submitter to go through a complex exercise to request permission if Unicode is not comfortable with the wording of the permission that is garnered. Are there other such agreements that could perhaps be used as templates? Historically, the message pIqaD supporters have heard from Unicode has been that pIqaD is a toy script that does not have enough use. The new proposal attempts to respond to those concerns, particularly since there is more interest in the script now. Now, additional (valid) concerns are being raised. In Mark?s case it seems like it would be nice if Unicode could consider the rest of the proposal and either tentatively approve it pending Paramount?s approval, or to provide feedback as to other defects in the proposal that would need addressed for consideration. Meanwhile Mark can figure out how to get Paramount?s agreement. -Shawn From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Peter Constable Sent: Wednesday, November 9, 2016 8:49 PM To: Mark E. Shoulson ; David Faulks Cc: Unicode Mailing List Subject: RE: The (Klingon) Empire Strikes Back From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Mark E. Shoulson Sent: Friday, November 4, 2016 1:18 PM > At any rate, this isn't Unicode's problem? You saying that potential IP issues are not Unicode?s problem does not in fact make it not a problem. A statement in writing from authorized Paramount representatives stating it would not be a problem for either Unicode, its members or implementers of Unicode would make it not a problem for Unicode. Peter -------------- next part -------------- An HTML attachment was scrubbed... URL: From andrewcwest at gmail.com Thu Nov 10 13:25:50 2016 From: andrewcwest at gmail.com (Andrew West) Date: Thu, 10 Nov 2016 19:25:50 +0000 Subject: Dataset for all ISO639 code sorted by country/territory? In-Reply-To: <20161110105658.665a7a7059d7ee80bb4d670165c8327d.a8ff034ef1.wbe@email03.godaddy.com> References: <20161110105658.665a7a7059d7ee80bb4d670165c8327d.a8ff034ef1.wbe@email03.godaddy.com> Message-ID: On 10 November 2016 at 17:56, Doug Ewell wrote: > > Keep in mind that the CLDR table documents 675 of the world's best-known > languages, counting variants such as three different orthographies of > Uzbek. Oddly, it seems that there are over 1.2 billion speakers of Cantonese in China, but no speakers of Mandarin (the biggest language by number of speakers in the world). Andrew From Shawn.Steele at microsoft.com Thu Nov 10 13:34:53 2016 From: Shawn.Steele at microsoft.com (Shawn Steele) Date: Thu, 10 Nov 2016 19:34:53 +0000 Subject: Dataset for all ISO639 code sorted by country/territory? In-Reply-To: <20161110105658.665a7a7059d7ee80bb4d670165c8327d.a8ff034ef1.wbe@email03.godaddy.com> References: <20161110105658.665a7a7059d7ee80bb4d670165c8327d.a8ff034ef1.wbe@email03.godaddy.com> Message-ID: I didn't really say anything because this is kinda a hopeless task, but it seems like some realities are being overlooked. I'm as curious about cataloguing everything as the next OCD guy, but a general solution doesn't seem practical. * There are a *lot* of languages * Many countries have speakers of several languages. * In the US it's "obvious" that a list of languages for the US should include "English" * Spanish in the US is less obvious, however it is often considered important. * However, that's a slippery slope as there are many other languages with large groups of speakers in the US. If such a list includes Spanish, should it not include some of the others? San Francisco requires documents in 4 languages but provides telephone help for 200 languages. Where's the line? * Some languages happen in many places. There are a disproportionate # of Englishes in CLDR, however Chinese is also spoken in lots of the countries that have English available in CLDR. Yet CLDR doesn't provide data for those. * Some language/region combinations could encounter geopolitical issues. Like "it's not legal for that language to be spoken in XX" (but it happens). Or "that language isn't YY country's language, it's ours!!!" * The requirement "where the language has been spoken traditionally" is really, really subjective. "Traditionally" the US is an English speaking country. However, "Traditionally", there are hundreds of languages that have been spoken in the US. What could be more "traditional" than the native American languages? Yet those often have low numbers of speakers in the modern world, many are even dying languages. There are also a number of "traditional" languages spoken by the original settlers. Which differ than the set of languages spoken by modern immigrants. So your data is going to be very skewed depending on the person collecting the data's definition of "traditional". Ethnologue has done a decent job of identifying languages and the number of speakers in various areas, but it would be very difficult to draw a line that selected "English and Spanish in the US" and was consistent with similar real-life impacts across the other languages. Do you pick the top n languages for each country? Languages with > x million speakers (that would be very different in small and big countries). Languages with > y% of the speakers in the different countries? And then you end up with each application having to figure out it's own bar. Applications will have different market considerations and other reasons to target different regions/languages. That would skew any list for their purposes. -Shawn From mark at macchiato.com Thu Nov 10 13:34:51 2016 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Thu, 10 Nov 2016 11:34:51 -0800 Subject: The (Klingon) Empire Strikes Back In-Reply-To: References: <42101413.334282.1478281304520.ref@mail.yahoo.com> <42101413.334282.1478281304520@mail.yahoo.com> <27f48fcf-9ebc-8363-3b27-6540a242d375@kli.org> Message-ID: The committee doesn't "tentatively approve, pending X". But the good news is that I think it was the sense of the committee that the evidence of use for Klingon is now sufficient, and the rest of the proposal was in good shape (other than the lack of a date), so really only the IP stands in the way. I would suggest that the Klingon community work towards getting Paramount to engage with us, so that any IP issues could be settled. Mark Mark On Thu, Nov 10, 2016 at 10:33 AM, Shawn Steele wrote: > More generally, does that mean that alphabets with perceived owners will > only be considered for encoding with permission from those owner(s)? What > if the ownership is ambiguous or unclear? > > > > Getting permission may be a lot of work, or cost money, in some cases. > Will applications be considered pending permission, perhaps being > provisionally approved until such permission is received? > > > > Is there specific language that Unicode would require from owners to be > comfortable in these cases? It makes little sense for a submitter to go > through a complex exercise to request permission if Unicode is not > comfortable with the wording of the permission that is garnered. Are there > other such agreements that could perhaps be used as templates? > > > > Historically, the message pIqaD supporters have heard from Unicode has > been that pIqaD is a toy script that does not have enough use. The new > proposal attempts to respond to those concerns, particularly since there is > more interest in the script now. Now, additional (valid) concerns are > being raised. > > > > In Mark?s case it seems like it would be nice if Unicode could consider > the rest of the proposal and either tentatively approve it pending > Paramount?s approval, or to provide feedback as to other defects in the > proposal that would need addressed for consideration. Meanwhile Mark can > figure out how to get Paramount?s agreement. > > > > -Shawn > > > > *From:* Unicode [mailto:unicode-bounces at unicode.org] *On Behalf Of *Peter > Constable > *Sent:* Wednesday, November 9, 2016 8:49 PM > *To:* Mark E. Shoulson ; David Faulks < > davidj_faulks at yahoo.ca> > *Cc:* Unicode Mailing List > *Subject:* RE: The (Klingon) Empire Strikes Back > > > > *From:* Unicode [mailto:unicode-bounces at unicode.org > ] *On Behalf Of *Mark E. Shoulson > *Sent:* Friday, November 4, 2016 1:18 PM > > > At any rate, this isn't Unicode's problem? > > > > You saying that potential IP issues are not Unicode?s problem does not in > fact make it not a problem. A statement in writing from authorized > Paramount representatives stating it would not be a problem for either > Unicode, its members or implementers of Unicode would make it not a problem > for Unicode. > > > > > > > > Peter > -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Fri Nov 11 03:31:17 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Fri, 11 Nov 2016 10:31:17 +0100 Subject: The (Klingon) Empire Strikes Back In-Reply-To: References: <42101413.334282.1478281304520.ref@mail.yahoo.com> <42101413.334282.1478281304520@mail.yahoo.com> <27f48fcf-9ebc-8363-3b27-6540a242d375@kli.org> Message-ID: As Unicode will actually not encode the language itself, but just the characters there's no problem at all in terms of IP, except for the representative glyphs if they use the protected graphic designs. Everything else is free, including the name that Unicode will choose for designating the character names, or the single English term for designating the script itself. Then what will be challenging is not to support the script in software, but render it with fonts. If people use the script to create their own texts in this script, their text will be free, but it will not be possible to get it rendered wit hthe protected glyph designs. But supporters will be inventive and will create their own designs. So the final thing which will be difficult for encoding the script will be to produce a glyph chart in the standard and publish it under the Unicode or ISO copyright. I assume that this chart will require approval by the IP holder or some fair (but permanent) licencing to Unicode and ISO. For other users of the standard, they are in a position equivalent to other scripts, where charts are **also** protected by the copyright of the standard and the rights attached to the fonts used and embedded in the PDF documents: they cannot use the glyphs directly to derive their fonts. They have to create and support fonts with their own designs. Then whever the script will be used in texts conveying protected works in the matching language, or for representing texts in unrelated languages will have no importance : The IP rights supposedly attached to the "language" are in the works published and they must be significantly large enough and inventive to be subject to a copyright, or a patent right, or to a "sui generi" database right, or must have a valid registration in an applicable registry to be subject to a trademark right. But even if these rights exist, they won't cover the individual characters, and the Unicode character database or standard (that will reference some elements related to the original work covered by IP) are separate creations/inventions not covered by any earlier rights: this is only a very small set of external references and if Paramount claims that these references as infringing, they can be as well removed: we don't really need direct references to Paramount (not even by an URL or some other hypertext link). If Paramount refuses to be cited, then it could just stop its own activities, as no one will be able to talk and advertize their works that will be unsellable... I doubt it will ever occur, however we should honor the correct credits (fair and anyway required for any citations). -------------- next part -------------- An HTML attachment was scrubbed... URL: From charupdate at orange.fr Fri Nov 11 16:35:09 2016 From: charupdate at orange.fr (Marcel Schneider) Date: Fri, 11 Nov 2016 23:35:09 +0100 (CET) Subject: Possible to add new precomposed characters for local language in Togo? In-Reply-To: <1565548445.5480.1478457219235.JavaMail.www@wwinf2212> References: <20161104153048.665a7a7059d7ee80bb4d670165c8327d.4f17bdb7bd.wbe@email03.godaddy.com> <1745350033.7396.1478401863009.JavaMail.www@wwinf2212> <143705401.185.1478416945629.JavaMail.www@wwinf2212> <1565548445.5480.1478457219235.JavaMail.www@wwinf2212> Message-ID: <1031067964.11192.1478903709471.JavaMail.www@wwinf1e26> On Fri, 04 Nov 2016 15:30:48 -0700, Doug Ewell wrote: > I am seeking technical information from a Microsoft team member. > Hopefully we will soon have definitive answers to replace all the > controversy. For lack of anything better, and faced with Microsoft?s one week?s silence, I now suggest to make a wider use of the Vietnamese text representation scheme that Microsoft implemented for Vietnamese, that is documented in TUS [1], and that might be of wider interest for all tone mark using languages, including but not limited to Ga and other languages of Togo and other countries of Africa, and Lithuanian: ? Vowels with diacritics that are not tone marks, e. g. 6 out of the 12 Vietnamese vowels as shown in Figure 7-3. of TUS 9.0 [2] are represented in NFC and entered either with live keys or with a dead key - live key combination; ? Tone marks are added as combining diacritics with live keys after the vowels. Based on what I got and found, I believe that languages in Anglophone African countries use digraphs rather than diacritics, and that adding tone marks after the base letter could make for a consistent and already partially implemented [3] worlwide standard. Still we don?t know why Microsoft isn?t willing to upgrade its input framework for support of strings through dead keys, since Philippe Verdy?s findings show that there must be a way of doing it even without upgrading to XML layout definitions? Marcel [1] The Unicode Standard 9.0, ch. 7 Europe-I, ?7.1 Latin, sh. Vietnamese: http://www.unicode.org/versions/Unicode9.0.0/ch07.pdf#G19663 [2] http://www.unicode.org/versions/Unicode9.0.0/ch07.pdf#G17544 [3] Cf. the already cited Unified Bambara-French keyboard layout (in French): http://www.mali-pense.net/IMG/pdf/le-clavier_francais-bambara.pdf Linked on the Resources for Bambara Practice page of Mali-Pense (in French): http://www.mali-pense.net/Ressources-pour-la-pratique-du.html From mark at kli.org Sun Nov 13 15:56:30 2016 From: mark at kli.org (Mark E. Shoulson) Date: Sun, 13 Nov 2016 16:56:30 -0500 Subject: The (Klingon) Empire Strikes Back In-Reply-To: References: <42101413.334282.1478281304520.ref@mail.yahoo.com> <42101413.334282.1478281304520@mail.yahoo.com> <933c21bf-89ea-5078-eef7-7e0453cf02b6@kli.org> Message-ID: <64982934-ca5b-5d29-7210-7e8fa3a27e50@kli.org> On 11/08/2016 06:58 AM, Julian Bradfield wrote: > On 2016-11-08, Mark E. Shoulson wrote: >> I've heard that there are similar questions regarding tengwar and cirth, >> but it is notable that UTC *did* see fit to consider this question for >> them and determine that they were worthy of encoding (they are on the >> roadmap), even though they have not actually followed through on that >> yet, perhaps because of these very IP concerns. Notably, pIqaD is not > The Tolkien Estate considers that the tengwar constitute a work of > art, and it's not willing to see them in Unicode, because this would > hinder its ability to pursue people using tengwar for what it > considers inappropriate purposes. (I finally asked them a couple of > years ago for permission to encode, based on Michael Everson's draft > proposal from yonks ago, and that's the summary of their reply.) I've said it before: if we could get pIqaD at leasr on the same footing as tengwar, that would be a step in the right direction. Saying they're in a similar fix is (currently) blatantly contradicted by the facts, and we might as well clear up whatever *else* it is that's holding pIqaD back, and then see about IP problems. It sounds like some progress is being made in this front. ~mark From mark at kli.org Sun Nov 13 15:59:25 2016 From: mark at kli.org (Mark E. Shoulson) Date: Sun, 13 Nov 2016 16:59:25 -0500 Subject: The (Klingon) Empire Strikes Back In-Reply-To: References: <42101413.334282.1478281304520.ref@mail.yahoo.com> <42101413.334282.1478281304520@mail.yahoo.com> <27f48fcf-9ebc-8363-3b27-6540a242d375@kli.org> Message-ID: On 11/09/2016 11:49 PM, Peter Constable wrote: > > *From:*Unicode [mailto:unicode-bounces at unicode.org] *On Behalf Of > *Mark E. Shoulson > *Sent:* Friday, November 4, 2016 1:18 PM > > ** > > > At any rate, this isn't Unicode's problem? > > You saying that potential IP issues are not Unicode?s problem does not > in fact make it not a problem. A statement in writing from authorized > Paramount representatives stating it would not be a problem for either > Unicode, its members or implementers of Unicode would make it not a > problem for Unicode. > > Peter > That's a fair point; any problems arising from this *would* affect Unicode. I guess what I was trying to say is that such an issue, while a problem once encoding proceeds, should not affect the determination of whether or not the encoding is *warranted*. ~mark -------------- next part -------------- An HTML attachment was scrubbed... URL: From mark at kli.org Sun Nov 13 16:10:22 2016 From: mark at kli.org (Mark E. Shoulson) Date: Sun, 13 Nov 2016 17:10:22 -0500 Subject: The (Klingon) Empire Strikes Back In-Reply-To: References: <42101413.334282.1478281304520.ref@mail.yahoo.com> <42101413.334282.1478281304520@mail.yahoo.com> <27f48fcf-9ebc-8363-3b27-6540a242d375@kli.org> Message-ID: <59a89be5-1359-f5b7-905f-108c22c6e189@kli.org> On 11/10/2016 02:34 PM, Mark Davis ?? wrote: > The committee doesn't "tentatively approve, pending X". > > But the good news is that I think it was the sense of the committee > that the evidence of use for Klingon is now sufficient, and the rest > of the proposal was in good shape (other than the lack of a date), so > really only the IP stands in the way. Fair enough. There have, I think, been other cases of this sort of informal "tentative approval", usually involving someone from UTC telling the proposer, "your proposal is okay, but you probably need to change this..." And that's about the best I could hope for at this point anyway. So it sounds like (correct me if I'm wrong) there is at least unofficial recognition that pIqaD *should* be encoded, and that it's mainly an IP problem now (like with tengwar), and possibly some minor issues that maybe hadn't been addressed properly in the proposal. Can we get pIqaD removed from http://www.unicode.org/roadmaps/not-the-roadmap/ then? And (dare I ask) perhaps enshrined someplace in http://www.unicode.org/roadmaps/smp/ pending further progress with Paramount? > I would suggest that the Klingon community work towards getting > Paramount to engage with us, so that any IP issues could be settled. I'll see what we can come up with; have to start somewhere. There is a VERY good argument to be made that Paramount doesn't actually have the right to stop the encoding, as you can't copyright an alphabet (as we have seen), and they don't have a current copyright to "Klingon" in this domain, etc., and it may eventually come down to these arguments. However, I recognize that having a good argument on your side, and indeed even having the law on your side, does not guarantee smooth sailing when the other guys have a huge well-funded legal department on their side, and thus I understand UTC's reluctance to move forward without better legal direction. But at least we can say we've made progress, can't we? ~mark > > Mark > > Mark > // > > On Thu, Nov 10, 2016 at 10:33 AM, Shawn Steele > > wrote: > > More generally, does that mean that alphabets with perceived > owners will only be considered for encoding with permission from > those owner(s)? What if the ownership is ambiguous or unclear? > > Getting permission may be a lot of work, or cost money, in some > cases. Will applications be considered pending permission, > perhaps being provisionally approved until such permission is > received? > > Is there specific language that Unicode would require from owners > to be comfortable in these cases? It makes little sense for a > submitter to go through a complex exercise to request permission > if Unicode is not comfortable with the wording of the permission > that is garnered. Are there other such agreements that could > perhaps be used as templates? > > Historically, the message pIqaD supporters have heard from Unicode > has been that pIqaD is a toy script that does not have enough > use. The new proposal attempts to respond to those concerns, > particularly since there is more interest in the script now. Now, > additional (valid) concerns are being raised. > > In Mark?s case it seems like it would be nice if Unicode could > consider the rest of the proposal and either tentatively approve > it pending Paramount?s approval, or to provide feedback as to > other defects in the proposal that would need addressed for > consideration. Meanwhile Mark can figure out how to get > Paramount?s agreement. > > -Shawn > > *From:*Unicode [mailto:unicode-bounces at unicode.org > ] *On Behalf Of *Peter Constable > *Sent:* Wednesday, November 9, 2016 8:49 PM > *To:* Mark E. Shoulson >; David > Faulks > > *Cc:* Unicode Mailing List > > *Subject:* RE: The (Klingon) Empire Strikes Back > > *From:*Unicode [mailto:unicode-bounces at unicode.org > ] *On Behalf Of *Mark E. Shoulson > *Sent:* Friday, November 4, 2016 1:18 PM > > >At any rate, this isn't Unicode's problem? > > You saying that potential IP issues are not Unicode?s problem does > not in fact make it not a problem. A statement in writing from > authorized Paramount representatives stating it would not be a > problem for either Unicode, its members or implementers of Unicode > would make it not a problem for Unicode. > > Peter > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From duerst at it.aoyama.ac.jp Tue Nov 15 02:23:58 2016 From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J._D=c3=bcrst?=) Date: Tue, 15 Nov 2016 17:23:58 +0900 Subject: Possible to add new precomposed characters for local language in Togo? In-Reply-To: <1031067964.11192.1478903709471.JavaMail.www@wwinf1e26> References: <20161104153048.665a7a7059d7ee80bb4d670165c8327d.4f17bdb7bd.wbe@email03.godaddy.com> <1745350033.7396.1478401863009.JavaMail.www@wwinf2212> <143705401.185.1478416945629.JavaMail.www@wwinf2212> <1565548445.5480.1478457219235.JavaMail.www@wwinf2212> <1031067964.11192.1478903709471.JavaMail.www@wwinf1e26> Message-ID: Hello Marcel, On 2016/11/12 07:35, Marcel Schneider wrote: > For lack of anything better, and faced with Microsoft?s one week?s silence, I > now suggest to make a wider use of the Vietnamese text representation scheme > that Microsoft implemented for Vietnamese, that is documented in TUS [1], and > that might be of wider interest for all tone mark using languages, including > but not limited to Ga and other languages of Togo and other countries of Africa, > and Lithuanian: > > ? Vowels with diacritics that are not tone marks, e. g. 6 out of the 12 Vietnamese > vowels as shown in Figure 7-3. of TUS 9.0 [2] are represented in NFC and entered > either with live keys or with a dead key - live key combination; > [1] The Unicode Standard 9.0, ch. 7 Europe-I, ?7.1 Latin, sh. Vietnamese: > http://www.unicode.org/versions/Unicode9.0.0/ch07.pdf#G19663 > > [2] http://www.unicode.org/versions/Unicode9.0.0/ch07.pdf#G17544 I'm sorry, but I didn't get the fragment identifiers (#G19663, #G17544) to work. Can you tell me which pages/paragraphs you refer to here? Thanks and regards, Martin. From charupdate at orange.fr Tue Nov 15 04:38:23 2016 From: charupdate at orange.fr (Marcel Schneider) Date: Tue, 15 Nov 2016 11:38:23 +0100 (CET) Subject: Possible to add new precomposed characters for local language in Togo? Message-ID: <1902272544.4809.1479206303384.JavaMail.www@wwinf1g20> Hi Martin, On Tue, 15 Nov 2016 17:23:58 +0900, Martin J. D?rst wrote: [?] > I'm sorry, but I didn't get the fragment identifiers (#G19663, #G17544) > to work. Can you tell me which pages/paragraphs you refer to here? Sorry for the omission of the page number! In the document pagination of TUS 9.0 it?s on page 296, that is page 332 of the full PDF, or page 7 of the Chapter 7 PDF. On this page, I?m referring to the first three paragraphs, and to the second figure. As of working with PDF fragment identifiers, I must confess that I?m unable to do this in Adobe Reader, neither grabbing nor input, despite of the side pane working fine for browsing. To open a fragment following its identifier in a local copy, I must open the PDF in a web browser, add the ID in the URL bar, and refresh the document; that works fine in Chrome. To grab an ID of a TUS fragment, I open the PDF in Firefox, display the side pane in TOC mode, and copy the URI of the bookmark. Best regards, Marcel From doug at ewellic.org Tue Nov 15 10:47:00 2016 From: doug at ewellic.org (Doug Ewell) Date: Tue, 15 Nov 2016 09:47:00 -0700 Subject: Possible to add new precomposed characters for local language in =?UTF-8?Q?Togo=3F?= Message-ID: <20161115094700.665a7a7059d7ee80bb4d670165c8327d.e8e451dcd6.wbe@email03.godaddy.com> Marcel Schneider wrote: > For lack of anything better, and faced with Microsoft?s one week?s > silence, I now suggest to make a wider use of the Vietnamese text > representation scheme that Microsoft implemented for Vietnamese, that > is documented in TUS [1], The entire "documentation" of this approach in Section 7.1 of TUS is: "Some widely used implementations prefer storing the vowel letter and the tone mark separately." That said, > and that might be of wider interest for all tone mark using languages, > including but not limited to Ga and other languages of Togo and other > countries of Africa, and Lithuanian: > > ? Vowels with diacritics that are not tone marks, e. g. 6 out of the > 12 Vietnamese vowels as shown in Figure 7-3. of TUS 9.0 [2] are > represented in NFC and entered either with live keys or with a dead > key - live key combination; > > ? Tone marks are added as combining diacritics with live keys after > the vowels. As long as implementations can deal with text that is not strictly NFC, this seems like a sensible way to support multiple diacritical marks while remaining compatible with existing dead-key implementations (Mats stated that compatibility with the existing French layout was a requirement) and existing architectural constraints. -- Doug Ewell | Thornton, CO, US | ewellic.org From petercon at microsoft.com Tue Nov 15 11:22:41 2016 From: petercon at microsoft.com (Peter Constable) Date: Tue, 15 Nov 2016 17:22:41 +0000 Subject: The (Klingon) Empire Strikes Back In-Reply-To: <59a89be5-1359-f5b7-905f-108c22c6e189@kli.org> References: <42101413.334282.1478281304520.ref@mail.yahoo.com> <42101413.334282.1478281304520@mail.yahoo.com> <27f48fcf-9ebc-8363-3b27-6540a242d375@kli.org> <59a89be5-1359-f5b7-905f-108c22c6e189@kli.org> Message-ID: Klingon _should not_ be encoded so long as there are open IP issues. For that reason, I think it would be premature to place it in the roadmap. Peter From: Mark E. Shoulson [mailto:mark at kli.org] Sent: Sunday, November 13, 2016 2:10 PM To: Mark Davis ?? ; Shawn Steele Cc: Peter Constable ; David Faulks ; Unicode Mailing List Subject: Re: The (Klingon) Empire Strikes Back On 11/10/2016 02:34 PM, Mark Davis ?? wrote: The committee doesn't "tentatively approve, pending X". But the good news is that I think it was the sense of the committee that the evidence of use for Klingon is now sufficient, and the rest of the proposal was in good shape (other than the lack of a date), so really only the IP stands in the way. Fair enough. There have, I think, been other cases of this sort of informal "tentative approval", usually involving someone from UTC telling the proposer, "your proposal is okay, but you probably need to change this..." And that's about the best I could hope for at this point anyway. So it sounds like (correct me if I'm wrong) there is at least unofficial recognition that pIqaD *should* be encoded, and that it's mainly an IP problem now (like with tengwar), and possibly some minor issues that maybe hadn't been addressed properly in the proposal. Can we get pIqaD removed from http://www.unicode.org/roadmaps/not-the-roadmap/ then? And (dare I ask) perhaps enshrined someplace in http://www.unicode.org/roadmaps/smp/ pending further progress with Paramount? I would suggest that the Klingon community work towards getting Paramount to engage with us, so that any IP issues could be settled. I'll see what we can come up with; have to start somewhere. There is a VERY good argument to be made that Paramount doesn't actually have the right to stop the encoding, as you can't copyright an alphabet (as we have seen), and they don't have a current copyright to "Klingon" in this domain, etc., and it may eventually come down to these arguments. However, I recognize that having a good argument on your side, and indeed even having the law on your side, does not guarantee smooth sailing when the other guys have a huge well-funded legal department on their side, and thus I understand UTC's reluctance to move forward without better legal direction. But at least we can say we've made progress, can't we? ~mark Mark Mark On Thu, Nov 10, 2016 at 10:33 AM, Shawn Steele > wrote: More generally, does that mean that alphabets with perceived owners will only be considered for encoding with permission from those owner(s)? What if the ownership is ambiguous or unclear? Getting permission may be a lot of work, or cost money, in some cases. Will applications be considered pending permission, perhaps being provisionally approved until such permission is received? Is there specific language that Unicode would require from owners to be comfortable in these cases? It makes little sense for a submitter to go through a complex exercise to request permission if Unicode is not comfortable with the wording of the permission that is garnered. Are there other such agreements that could perhaps be used as templates? Historically, the message pIqaD supporters have heard from Unicode has been that pIqaD is a toy script that does not have enough use. The new proposal attempts to respond to those concerns, particularly since there is more interest in the script now. Now, additional (valid) concerns are being raised. In Mark?s case it seems like it would be nice if Unicode could consider the rest of the proposal and either tentatively approve it pending Paramount?s approval, or to provide feedback as to other defects in the proposal that would need addressed for consideration. Meanwhile Mark can figure out how to get Paramount?s agreement. -Shawn From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Peter Constable Sent: Wednesday, November 9, 2016 8:49 PM To: Mark E. Shoulson >; David Faulks > Cc: Unicode Mailing List > Subject: RE: The (Klingon) Empire Strikes Back From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Mark E. Shoulson Sent: Friday, November 4, 2016 1:18 PM > At any rate, this isn't Unicode's problem? You saying that potential IP issues are not Unicode?s problem does not in fact make it not a problem. A statement in writing from authorized Paramount representatives stating it would not be a problem for either Unicode, its members or implementers of Unicode would make it not a problem for Unicode. Peter -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Tue Nov 15 11:39:37 2016 From: doug at ewellic.org (Doug Ewell) Date: Tue, 15 Nov 2016 10:39:37 -0700 Subject: The (Klingon) Empire Strikes Back Message-ID: <20161115103937.665a7a7059d7ee80bb4d670165c8327d.455ad40709.wbe@email03.godaddy.com> Peter Constable wrote: > Klingon _should not_ be encoded so long as there are open IP issues. > For that reason, I think it would be premature to place it in the > roadmap. But Mark's point about removing it from the "Not the Roadmap" page, which categorizes it among "Scripts (or pseudoscripts) which have been investigated and rejected as unsuitable for encoding," may be a valid one. There is a difference between "unsuitable for encoding" and "might turn out to be unencodable due to IP issues." -- Doug Ewell | Thornton, CO, US | ewellic.org From Shawn.Steele at microsoft.com Tue Nov 15 11:44:13 2016 From: Shawn.Steele at microsoft.com (Shawn Steele) Date: Tue, 15 Nov 2016 17:44:13 +0000 Subject: The (Klingon) Empire Strikes Back In-Reply-To: <20161115103937.665a7a7059d7ee80bb4d670165c8327d.455ad40709.wbe@email03.godaddy.com> References: <20161115103937.665a7a7059d7ee80bb4d670165c8327d.455ad40709.wbe@email03.godaddy.com> Message-ID: I'm a little confused. I thought that the primary reason Cirth and Tengwar were on the roadmap - and not actually encoded - were because of the IP concerns? (I confess to not following them very closely, so I may be wrong.) -----Original Message----- From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Doug Ewell Sent: Tuesday, November 15, 2016 9:40 AM To: Unicode Mailing List Cc: Mark Shoulson ; Peter Constable Subject: RE: The (Klingon) Empire Strikes Back Peter Constable wrote: > Klingon _should not_ be encoded so long as there are open IP issues. > For that reason, I think it would be premature to place it in the > roadmap. But Mark's point about removing it from the "Not the Roadmap" page, which categorizes it among "Scripts (or pseudoscripts) which have been investigated and rejected as unsuitable for encoding," may be a valid one. There is a difference between "unsuitable for encoding" and "might turn out to be unencodable due to IP issues." -- Doug Ewell | Thornton, CO, US | ewellic.org From petercon at microsoft.com Tue Nov 15 11:49:11 2016 From: petercon at microsoft.com (Peter Constable) Date: Tue, 15 Nov 2016 17:49:11 +0000 Subject: The (Klingon) Empire Strikes Back In-Reply-To: <20161115103937.665a7a7059d7ee80bb4d670165c8327d.455ad40709.wbe@email03.godaddy.com> References: <20161115103937.665a7a7059d7ee80bb4d670165c8327d.455ad40709.wbe@email03.godaddy.com> Message-ID: I was responding to this: > And (dare I ask) perhaps enshrined someplace in http://www.unicode.org/roadmaps/smp/ pending further progress with Paramount? Peter -----Original Message----- From: Doug Ewell [mailto:doug at ewellic.org] Sent: Tuesday, November 15, 2016 9:40 AM To: Unicode Mailing List Cc: Mark Shoulson ; Peter Constable Subject: RE: The (Klingon) Empire Strikes Back Peter Constable wrote: > Klingon _should not_ be encoded so long as there are open IP issues. > For that reason, I think it would be premature to place it in the > roadmap. But Mark's point about removing it from the "Not the Roadmap" page, which categorizes it among "Scripts (or pseudoscripts) which have been investigated and rejected as unsuitable for encoding," may be a valid one. There is a difference between "unsuitable for encoding" and "might turn out to be unencodable due to IP issues." -- Doug Ewell | Thornton, CO, US | ewellic.org From asmusf at ix.netcom.com Tue Nov 15 12:21:09 2016 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Tue, 15 Nov 2016 10:21:09 -0800 Subject: The (Klingon) Empire Strikes Back In-Reply-To: References: <42101413.334282.1478281304520.ref@mail.yahoo.com> <42101413.334282.1478281304520@mail.yahoo.com> <27f48fcf-9ebc-8363-3b27-6540a242d375@kli.org> <59a89be5-1359-f5b7-905f-108c22c6e189@kli.org> Message-ID: <5b602cd1-53c8-181a-5ca4-0470ce36b92e@ix.netcom.com> An HTML attachment was scrubbed... URL: From mark at macchiato.com Tue Nov 15 18:31:42 2016 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Tue, 15 Nov 2016 17:31:42 -0700 Subject: The (Klingon) Empire Strikes Back In-Reply-To: <5b602cd1-53c8-181a-5ca4-0470ce36b92e@ix.netcom.com> References: <42101413.334282.1478281304520.ref@mail.yahoo.com> <42101413.334282.1478281304520@mail.yahoo.com> <27f48fcf-9ebc-8363-3b27-6540a242d375@kli.org> <59a89be5-1359-f5b7-905f-108c22c6e189@kli.org> <5b602cd1-53c8-181a-5ca4-0470ce36b92e@ix.netcom.com> Message-ID: > However, it appears relatively settled that one cannot claim copyright in an alphabet... We know that these parties tend to be litigious, so we have to be careful. "relatively settled" is not good enough. We do not want to be the ones responsible (and liable) for making a determination as to whether that is settled. Nor do we want to pay the legal fees necessary to make a water-tight determination. That is why if there is any question as to the IP issues, we leave it up to the proposers to get absolutely rock-solid clearance (eg from the Tolkien estate for Tengwar, or from Paramount for Klingon). The only other alternative I can think of is if the proposers provide indemnification for any legal costs that could obtain from a legal suit of us or our vendors. Mark On Tue, Nov 15, 2016 at 11:21 AM, Asmus Freytag wrote: > On 11/15/2016 9:22 AM, Peter Constable wrote: > > Klingon _*should not*_ be encoded so long as there are open IP issues. > For that reason, I think it would be premature to place it in the roadmap. > > > > Peter, > > I certainly sympathize with the fact that the Consortium wants to avoid > being drawn into litigation, and that even litigation based on unsustained > IP claims could be costly. > > However, it appears relatively settled that one cannot claim copyright in > an alphabet; one of the roles of the Unicode Consortium in this regard > would be to reach a formal decision whether this is, in fact, an > alphabet/script (and one that, based on the usual criteria of usage) is > acceptable for encoding. > > Ducking this particular determination serves no-one. > > This does not mean that publication would have to be immediate; there's > certainly room for something like an approval to include a script in "some" > future version of the standard, which would allow all parties to figure out > how to deal with any IP issues. (Note that this would not be a decision, > "pending" anything, merely separating approval of a script proposal from a > decision of the contents for a particular version - something that used to > be rather routine in earlier years). > > I would also like to point out that Unicode would be well served by taking > a stronger position on the issue of IP claims on writing systems, in > particular copyright claims. These seem to be unfounded at least under US > law; should Unicode nevertheless allow such unfounded claims become a way > to veto the encoding of any script/writing system (or script extension)? > > As we move on, the number of cases where writing systems, or innovations > in writing systems may be subject to unfounded claims of copyright may > become more mainstream (think national writing systems, rather than > fan-based ones). Already, the emoji are a good example how, now that the > bulk of living/historic writing systems has been encoded, the "novelties" > come to the forefront. > > Finally, I really can't understand the reluctance to place anything in the > roadmap. An entry in the roadmap is not a commitment to anything - many > scripts listed there face enormous obstacles before they could even reach > the stage of a well-founded proposal. And, until such a proposal exists, > there's no formal determination that a script has a truly separate identity > and meets the bar for encoding. > > A./ > > PS: the "real" reason that Klingon was never put in the roadmap (as I > recall discussions in the early years) was not so much the question whether > IP issues existed/could be resolved, but the fear that adding such an > "invented" and "frivolous" script would undermine the acceptance of > Unicode. Given the way Unicode is invested in "frivolous" communication > systems of very recent origin (emoji), that original argument surely > doesn't apply :) > > > > Peter > > > > *From:* Mark E. Shoulson [mailto:mark at kli.org ] > *Sent:* Sunday, November 13, 2016 2:10 PM > *To:* Mark Davis ?? ; Shawn > Steele > *Cc:* Peter Constable ; > David Faulks ; Unicode > Mailing List > *Subject:* Re: The (Klingon) Empire Strikes Back > > > > On 11/10/2016 02:34 PM, Mark Davis ?? wrote: > > The committee doesn't "tentatively approve, pending X". > > > > But the good news is that I think it was the sense of the committee that > the evidence of use for Klingon is now sufficient, and the rest of the > proposal was in good shape (other than the lack of a date), so really only > the IP stands in the way. > > > Fair enough. There have, I think, been other cases of this sort of > informal "tentative approval", usually involving someone from UTC telling > the proposer, "your proposal is okay, but you probably need to change > this..." And that's about the best I could hope for at this point anyway. > So it sounds like (correct me if I'm wrong) there is at least unofficial > recognition that pIqaD *should* be encoded, and that it's mainly an IP > problem now (like with tengwar), and possibly some minor issues that maybe > hadn't been addressed properly in the proposal. > > Can we get pIqaD removed from http://www.unicode.org/ > roadmaps/not-the-roadmap/ then? And (dare I ask) perhaps enshrined > someplace in http://www.unicode.org/roadmaps/smp/ pending further > progress with Paramount? > > > I would suggest that the Klingon community work towards getting Paramount > to engage with us, so that any IP issues could be settled. > > > I'll see what we can come up with; have to start somewhere. There is a > VERY good argument to be made that Paramount doesn't actually have the > right to stop the encoding, as you can't copyright an alphabet (as we have > seen), and they don't have a current copyright to "Klingon" in this domain, > etc., and it may eventually come down to these arguments. However, I > recognize that having a good argument on your side, and indeed even having > the law on your side, does not guarantee smooth sailing when the other guys > have a huge well-funded legal department on their side, and thus I > understand UTC's reluctance to move forward without better legal > direction. But at least we can say we've made progress, can't we? > > ~mark > > > > > Mark > > > Mark > > > > On Thu, Nov 10, 2016 at 10:33 AM, Shawn Steele > wrote: > > More generally, does that mean that alphabets with perceived owners will > only be considered for encoding with permission from those owner(s)? What > if the ownership is ambiguous or unclear? > > > > Getting permission may be a lot of work, or cost money, in some cases. > Will applications be considered pending permission, perhaps being > provisionally approved until such permission is received? > > > > Is there specific language that Unicode would require from owners to be > comfortable in these cases? It makes little sense for a submitter to go > through a complex exercise to request permission if Unicode is not > comfortable with the wording of the permission that is garnered. Are there > other such agreements that could perhaps be used as templates? > > > > Historically, the message pIqaD supporters have heard from Unicode has > been that pIqaD is a toy script that does not have enough use. The new > proposal attempts to respond to those concerns, particularly since there is > more interest in the script now. Now, additional (valid) concerns are > being raised. > > > > In Mark?s case it seems like it would be nice if Unicode could consider > the rest of the proposal and either tentatively approve it pending > Paramount?s approval, or to provide feedback as to other defects in the > proposal that would need addressed for consideration. Meanwhile Mark can > figure out how to get Paramount?s agreement. > > > > -Shawn > > > > *From:* Unicode [mailto:unicode-bounces at unicode.org] *On Behalf Of *Peter > Constable > *Sent:* Wednesday, November 9, 2016 8:49 PM > *To:* Mark E. Shoulson ; David Faulks < > davidj_faulks at yahoo.ca> > *Cc:* Unicode Mailing List > *Subject:* RE: The (Klingon) Empire Strikes Back > > > > *From:* Unicode [mailto:unicode-bounces at unicode.org > ] *On Behalf Of *Mark E. Shoulson > *Sent:* Friday, November 4, 2016 1:18 PM > > > At any rate, this isn't Unicode's problem? > > > > You saying that potential IP issues are not Unicode?s problem does not in > fact make it not a problem. A statement in writing from authorized > Paramount representatives stating it would not be a problem for either > Unicode, its members or implementers of Unicode would make it not a problem > for Unicode. > > > > > > > > Peter > > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mark at kli.org Tue Nov 15 18:39:36 2016 From: mark at kli.org (Mark E. Shoulson) Date: Tue, 15 Nov 2016 19:39:36 -0500 Subject: The (Klingon) Empire Strikes Back In-Reply-To: References: <42101413.334282.1478281304520.ref@mail.yahoo.com> <42101413.334282.1478281304520@mail.yahoo.com> <27f48fcf-9ebc-8363-3b27-6540a242d375@kli.org> <59a89be5-1359-f5b7-905f-108c22c6e189@kli.org> Message-ID: On 11/15/2016 12:22 PM, Peter Constable wrote: > > Klingon _/should not/_ be encoded so long as there are open IP issues. > For that reason, I think it would be premature to place it in the roadmap. > Then why is tengwar there, and Klingon proclaimed "unsuitable" for encoding? Everyone's telling me the situation is the same with tengwar, and yet it isn't. What is it about Tolkien scripts that makes them suitable and pIqaD not? Artistic interest doesn't count. I'm not trying to get tengwar/cirth *demoted*, but I would like someone to explain to me why some fandoms/scripts seem to be better than others. ~mark -------------- next part -------------- An HTML attachment was scrubbed... URL: From everson at evertype.com Tue Nov 15 18:47:32 2016 From: everson at evertype.com (Michael Everson) Date: Wed, 16 Nov 2016 00:47:32 +0000 Subject: The (Klingon) Empire Strikes Back In-Reply-To: References: <42101413.334282.1478281304520.ref@mail.yahoo.com> <42101413.334282.1478281304520@mail.yahoo.com> <27f48fcf-9ebc-8363-3b27-6540a242d375@kli.org> <59a89be5-1359-f5b7-905f-108c22c6e189@kli.org> Message-ID: <54D68C57-FB87-46D8-A822-3A1848CDD611@evertype.com> A body of a particular kind of scholarship surrounds Tolkien?s oeuvre. That?s probably the reason. Michael Everson From mark at kli.org Tue Nov 15 19:04:09 2016 From: mark at kli.org (Mark E. Shoulson) Date: Tue, 15 Nov 2016 20:04:09 -0500 Subject: The (Klingon) Empire Strikes Back In-Reply-To: <5b602cd1-53c8-181a-5ca4-0470ce36b92e@ix.netcom.com> References: <42101413.334282.1478281304520.ref@mail.yahoo.com> <42101413.334282.1478281304520@mail.yahoo.com> <27f48fcf-9ebc-8363-3b27-6540a242d375@kli.org> <59a89be5-1359-f5b7-905f-108c22c6e189@kli.org> <5b602cd1-53c8-181a-5ca4-0470ce36b92e@ix.netcom.com> Message-ID: <33c1ba77-2ba0-dd62-327f-edd32b3efa23@kli.org> On 11/15/2016 01:21 PM, Asmus Freytag wrote: > On 11/15/2016 9:22 AM, Peter Constable wrote: >> >> Klingon _/should not/_ be encoded so long as there are open IP >> issues. For that reason, I think it would be premature to place it in >> the roadmap. >> > Peter, > > I certainly sympathize with the fact that the Consortium wants to > avoid being drawn into litigation, and that even litigation based on > unsustained IP claims could be costly. > > However, it appears relatively settled that one cannot claim copyright > in an alphabet; one of the roles of the Unicode Consortium in this > regard would be to reach a formal decision whether this is, in fact, > an alphabet/script (and one that, based on the usual criteria of > usage) is acceptable for encoding. > > Ducking this particular determination serves no-one. Thanks, Asmus. I can understand the UTC's caution: you don't want to open yourself up to litigation?even if you eventually win. But this also is likely not going to be the first time that there is this kind of legal hold on something encodable. I note that Blissymbolics, according to Wikipedia, *does* have a copyright (as opposed to "maybe they might think they do") and yet it, too, is roadmapped. If I didn't know better (and I don't), I might think there was some sort of bias against Klingon. > Finally, I really can't understand the reluctance to place anything in > the roadmap. An entry in the roadmap is not a commitment to anything - > many scripts listed there face enormous obstacles before they could > even reach the stage of a well-founded proposal. And, until such a > proposal exists, there's no formal determination that a script has a > truly separate identity and meets the bar for encoding. NOT being called out for being unencodable would be a step up for Klingon, at least, let alone the roadmap. > PS: the "real" reason that Klingon was never put in the roadmap (as I > recall discussions in the early years) was not so much the question > whether IP issues existed/could be resolved, but the fear that adding > such an "invented" and "frivolous" script would undermine the > acceptance of Unicode. Given the way Unicode is invested in > "frivolous" communication systems of very recent origin (emoji), that > original argument surely doesn't apply :) Yes, of course, though it's nice to have someone say it out loud. You do of course realize that that sentiment is *precisely* as offensive as "Unicode shouldn't encode African scripts, because only darkies use them anyway, and we wouldn't want to be seen as supporting *those* people." Bigotry is bigotry, even when applied to fans. Essentially, the claim is "we shouldn't encode those, not because nobody uses them, but because nobody *important* uses them." I was talking to someone once about Unicode, and explained that they were responsible for encoding emoji, etc. And he scoffed at that, "why encode those? who uses those anyway?" I said, "Millions of people around the world use them every day in tweets and instant messages..." "Yeah, but I mean, aside from that!" The question is, who out there who is *important* is using them for *important* things. And if the UTC has to get in the business of judging what qualifies as "important" communication, you're going to need a lot more members, just to go through everything being printed. (Why encode chess pieces? Only chess nerds use them, and I don't care about chess. Go piece signs? Nobody *I* talk to uses those. And don't even get me started on pictures of baseballs. And only goyim would need a picture of a breaded shrimp...) It's refreshing to hear it finally admitted in full. I always felt that if people are going to act unfairly, they should at least say "yes, we're acting unfairly, because you don't deserve fairness." Then they can explain why fairness is undeserved. ~mark From kenwhistler at att.net Tue Nov 15 19:15:58 2016 From: kenwhistler at att.net (Ken Whistler) Date: Tue, 15 Nov 2016 17:15:58 -0800 Subject: The (Klingon) Empire Strikes Back In-Reply-To: <5b602cd1-53c8-181a-5ca4-0470ce36b92e@ix.netcom.com> References: <42101413.334282.1478281304520.ref@mail.yahoo.com> <42101413.334282.1478281304520@mail.yahoo.com> <27f48fcf-9ebc-8363-3b27-6540a242d375@kli.org> <59a89be5-1359-f5b7-905f-108c22c6e189@kli.org> <5b602cd1-53c8-181a-5ca4-0470ce36b92e@ix.netcom.com> Message-ID: <90d16ff4-eef9-28df-3d9a-51a8011339ce@att.net> On 11/15/2016 10:21 AM, Asmus Freytag wrote: > Finally, I really can't understand the reluctance to place anything in > the roadmap. An entry in the roadmap is not a commitment to anything - > many scripts listed there face enormous obstacles before they could > even reach the stage of a well-founded proposal. And, until such a > proposal exists, there's no formal determination that a script has a > truly separate identity and meets the bar for encoding. The barrier to putting it in the roadmap is the that it pIQaD is currently listed on *not*-the-roadmap: http://www.unicode.org/roadmaps/not-the-roadmap/ as Mark Shoulsen has been repeatedly pointing out. It would be inconsistent to add it to the SMP roadmap unless we delete it from not-the-roadmap. And the reason that step has been stuck is because the UTC is still on record with a nonapproval notice for the Klingon script from 2001. (Based on Consensus 87-M3.) http://www.unicode.org/alloc/nonapprovals.html So figure it out, folks. First bring to the UTC a proposal to reverse 87-M3. (Not to *encode* pIQaD yet -- just, on the basis of the new, more mature proposal, to *entertain* appropriate discussion about suitability for encoding, by rescinding the prior determination of nonapproval.) If *that* proposal passed, then the nonapproval notice would also be dropped. If the nonapproval notice is dropped, the not-the-roadmap entry would be dropped. And if that is dropped, then the Roadmap committee would dig around for a tentative allocation slot, pending the determination of outcome for any other issues. Which then could focus on the next obstacle, which is IP and the unresolved risk of litigation. In any case, folks should stop with with "Unfair! Unfair!" stuff, and just set to work, step-by-step, to deal with the items noted above. "A Klingon is trained to use everything around them to their advantage." O.k., I've just provided something useful -- go for it. And you won't even need a cloaking device. --Ken From mark at kli.org Tue Nov 15 19:19:23 2016 From: mark at kli.org (Mark E. Shoulson) Date: Tue, 15 Nov 2016 20:19:23 -0500 Subject: The (Klingon) Empire Strikes Back In-Reply-To: References: <42101413.334282.1478281304520.ref@mail.yahoo.com> <42101413.334282.1478281304520@mail.yahoo.com> <27f48fcf-9ebc-8363-3b27-6540a242d375@kli.org> <59a89be5-1359-f5b7-905f-108c22c6e189@kli.org> <5b602cd1-53c8-181a-5ca4-0470ce36b92e@ix.netcom.com> Message-ID: On 11/15/2016 07:31 PM, Mark Davis ?? wrote: > > However, it appears relatively settled that one cannot claim > copyright in an alphabet... > > We know that these parties tend to be litigious, so we have to be > careful. "relatively settled" is not good enough. > > We do not want to be the ones responsible (and liable) for making a > determination as to whether that is settled. Nor do we want to pay the > legal fees necessary to make a water-tight determination. > > That is why if there is any question as to the IP issues, we leave it > up to the proposers to get absolutely rock-solid clearance (eg from > the Tolkien estate for Tengwar, or from Paramount for Klingon). The > only other alternative I can think of is if the proposers provide > indemnification for any legal costs that could obtain from a legal > suit of us or our vendors. > > Mark > // How about legal counsel on the matter? We're a little hesitant of asking Paramount/CBS about this, because of course, asking means that we think maybe they can say no, and we don't want to imply that. So I'm thinking/hoping maybe we can do some research by a qualified legal expert (and not us armchair-lawyers, "yeah, it looks pretty settled to me...") to make a determination. I'm trying to find out some more information about the KLI's pIqaD font, which it has been using and distributing for decades, during some of which time it was licensed by Paramount, and which apparently was *not* covered in the licensing agreements?precisely because typefaces are *not* copyrightable in the US! (I thought they were, though... like I said, I'm trying to find out more about this.) And all that time without objection from Paramount. Not a slam-dunk argument, but it's something. ~mark From mark at kli.org Tue Nov 15 19:22:36 2016 From: mark at kli.org (Mark E. Shoulson) Date: Tue, 15 Nov 2016 20:22:36 -0500 Subject: The (Klingon) Empire Strikes Back In-Reply-To: <54D68C57-FB87-46D8-A822-3A1848CDD611@evertype.com> References: <42101413.334282.1478281304520.ref@mail.yahoo.com> <42101413.334282.1478281304520@mail.yahoo.com> <27f48fcf-9ebc-8363-3b27-6540a242d375@kli.org> <59a89be5-1359-f5b7-905f-108c22c6e189@kli.org> <54D68C57-FB87-46D8-A822-3A1848CDD611@evertype.com> Message-ID: On 11/15/2016 07:47 PM, Michael Everson wrote: > A body of a particular kind of scholarship surrounds Tolkien?s oeuvre. That?s probably the reason. > > Michael Everson Ah. So it *is* a matter of "some literature is better than others." I repeat here all the stuff I said in my response to Asmus' letter. Since when did Unicode get in the business of deciding whose literature was important and whose wasn't? And what do they base their decisions on? How much Klingon correspondence and conversation did the UTC sift through in order to reach its learned conclusion that Klingon-speakers don't do anything "scholarly"? Do you guys even hear how ridiculously bigoted this all sounds? ~mark From Shawn.Steele at microsoft.com Tue Nov 15 19:26:16 2016 From: Shawn.Steele at microsoft.com (Shawn Steele) Date: Wed, 16 Nov 2016 01:26:16 +0000 Subject: The (Klingon) Empire Strikes Back In-Reply-To: References: <42101413.334282.1478281304520.ref@mail.yahoo.com> <42101413.334282.1478281304520@mail.yahoo.com> <27f48fcf-9ebc-8363-3b27-6540a242d375@kli.org> <59a89be5-1359-f5b7-905f-108c22c6e189@kli.org> <5b602cd1-53c8-181a-5ca4-0470ce36b92e@ix.netcom.com> Message-ID: As I understand the issue, the problem is less of whether or not it is legal, then whether or not Paramount might sue. Whether Unicode wins or not, it would still cost money to defend. I was wondering like Mark Davis mentioned if there were some sort of companies that sold bonds for this kind of thing (though that might be out of KLI's budget.) Being afraid of a no answer probably isn't going to inspire confidence. But maybe you could do a combination of the above. Get someone to give you a legal opinion and then present that to Paramount with a "hey, they said this was probably legal anyway, but we wanted to ask nicely to be sure." -Shawn -----Original Message----- From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Mark E. Shoulson Sent: Tuesday, 15 November 2016 5:19 PM To: unicode at unicode.org Subject: Re: The (Klingon) Empire Strikes Back On 11/15/2016 07:31 PM, Mark Davis ?? wrote: > > However, it appears relatively settled that one cannot claim > copyright in an alphabet... > > We know that these parties tend to be litigious, so we have to be > careful. "relatively settled" is not good enough. > > We do not want to be the ones responsible (and liable) for making a > determination as to whether that is settled. Nor do we want to pay the > legal fees necessary to make a water-tight determination. > > That is why if there is any question as to the IP issues, we leave it > up to the proposers to get absolutely rock-solid clearance (eg from > the Tolkien estate for Tengwar, or from Paramount for Klingon). The > only other alternative I can think of is if the proposers provide > indemnification for any legal costs that could obtain from a legal > suit of us or our vendors. > > Mark > // How about legal counsel on the matter? We're a little hesitant of asking Paramount/CBS about this, because of course, asking means that we think maybe they can say no, and we don't want to imply that. So I'm thinking/hoping maybe we can do some research by a qualified legal expert (and not us armchair-lawyers, "yeah, it looks pretty settled to me...") to make a determination. I'm trying to find out some more information about the KLI's pIqaD font, which it has been using and distributing for decades, during some of which time it was licensed by Paramount, and which apparently was *not* covered in the licensing agreements?precisely because typefaces are *not* copyrightable in the US! (I thought they were, though... like I said, I'm trying to find out more about this.) And all that time without objection from Paramount. Not a slam-dunk argument, but it's something. ~mark From everson at evertype.com Tue Nov 15 19:29:14 2016 From: everson at evertype.com (Michael Everson) Date: Wed, 16 Nov 2016 01:29:14 +0000 Subject: The (Klingon) Empire Strikes Back In-Reply-To: References: <42101413.334282.1478281304520.ref@mail.yahoo.com> <42101413.334282.1478281304520@mail.yahoo.com> <27f48fcf-9ebc-8363-3b27-6540a242d375@kli.org> <59a89be5-1359-f5b7-905f-108c22c6e189@kli.org> <54D68C57-FB87-46D8-A822-3A1848CDD611@evertype.com> Message-ID: Mark, No need to be defensive. Tengwar and Cirth are in there because *I* put them there *long ago*, and the argument made was the nature of Tolkien?s work and study of it. That remains valid for keeping there, for one day the Tolkien Estate may revise its view on the matter. Maybe a version of the Roadmap had Klingon in it. I don?t recall. I?d?ve been the one to have put it there. There are records. It doesn?t matter, though. When lack of use made Klingon made UTC remove it from consideration, it would have been removed. The Roadmaps are really of no consequence. They?re useful, but they have no status and are subject to any kind of change before ballotting ends. Michael > On 16 Nov 2016, at 01:22, Mark E. Shoulson wrote: > > On 11/15/2016 07:47 PM, Michael Everson wrote: >> A body of a particular kind of scholarship surrounds Tolkien?s oeuvre. That?s probably the reason. >> >> Michael Everson > > Ah. So it *is* a matter of "some literature is better than others." I repeat here all the stuff I said in my response to Asmus' letter. Since when did Unicode get in the business of deciding whose literature was important and whose wasn't? And what do they base their decisions on? How much Klingon correspondence and conversation did the UTC sift through in order to reach its learned conclusion that Klingon-speakers don't do anything "scholarly"? > > Do you guys even hear how ridiculously bigoted this all sounds? > > ~mark > From mark at kli.org Tue Nov 15 19:31:21 2016 From: mark at kli.org (Mark E. Shoulson) Date: Tue, 15 Nov 2016 20:31:21 -0500 Subject: The (Klingon) Empire Strikes Back In-Reply-To: <90d16ff4-eef9-28df-3d9a-51a8011339ce@att.net> References: <42101413.334282.1478281304520.ref@mail.yahoo.com> <42101413.334282.1478281304520@mail.yahoo.com> <27f48fcf-9ebc-8363-3b27-6540a242d375@kli.org> <59a89be5-1359-f5b7-905f-108c22c6e189@kli.org> <5b602cd1-53c8-181a-5ca4-0470ce36b92e@ix.netcom.com> <90d16ff4-eef9-28df-3d9a-51a8011339ce@att.net> Message-ID: On 11/15/2016 08:15 PM, Ken Whistler wrote: > > On 11/15/2016 10:21 AM, Asmus Freytag wrote: >> Finally, I really can't understand the reluctance to place anything >> in the roadmap. An entry in the roadmap is not a commitment to >> anything - many scripts listed there face enormous obstacles before >> they could even reach the stage of a well-founded proposal. And, >> until such a proposal exists, there's no formal determination that a >> script has a truly separate identity and meets the bar for encoding. > > The barrier to putting it in the roadmap is the that it pIQaD is > currently listed on *not*-the-roadmap: > > http://www.unicode.org/roadmaps/not-the-roadmap/ > > as Mark Shoulsen has been repeatedly pointing out. > > It would be inconsistent to add it to the SMP roadmap unless we delete > it from not-the-roadmap. > > And the reason that step has been stuck is because the UTC is still on > record with a nonapproval notice for the Klingon script from 2001. > (Based on Consensus 87-M3.) > > http://www.unicode.org/alloc/nonapprovals.html > > So figure it out, folks. First bring to the UTC a proposal to reverse > 87-M3. (Not to *encode* pIQaD yet -- just, on the basis of the new, > more mature proposal, to *entertain* appropriate discussion about > suitability for encoding, by rescinding the prior determination of > nonapproval.) If *that* proposal passed, then the nonapproval notice > would also be dropped. If the nonapproval notice is dropped, the > not-the-roadmap entry would be dropped. And if that is dropped, then > the Roadmap committee would dig around for a tentative allocation > slot, pending the determination of outcome for any other issues. Which > then could focus on the next obstacle, which is IP and the unresolved > risk of litigation. So.... now the problem *isn't* the IP. All along I've been saying that UTC needs to decide that pIqaD *should* be encoded first, without consideration of the IP issues, and *then* we can worry about dealing with the IP. And the answers I got were all about how we can't do *anything* until this IP stuff is dealt with. And now Ken Whistler comes and says what I said in the first place! At least someone was paying attention. So... Now it's not enough to propose that pIqaD get encoded, like any other script would need. First we need a proposal to *permit* a proposal for encoding? Um. OK. What should such a thing look like? Perhaps something like the document I submitted, showing lots of usage and asking if it could be considered now? I originally wasn't going to append the full proposal to the document, but it was suggested to me that it would be expected. Should I split the document up into two pieces and re-submit the two halves, one as a proposal, and one for permission to consider the proposal? Would that satisfy the requirements? > In any case, folks should stop with with "Unfair! Unfair!" stuff, and > just set to work, step-by-step, to deal with the items noted above. "A > Klingon is trained to use everything around them to their advantage." > O.k., I've just provided something useful -- go for it. And you won't > even need a cloaking device. I've been working with whatever I could find all along. The unfairness is a recognized fact, apparently, that can finally be faced and fixed, or so I hope. I'm trying to get this done; best I can do is answer the questions put to me and look how other scripts in similar situations (like Tolkien scripts) have done what they did. ~mark From mark at kli.org Tue Nov 15 19:41:02 2016 From: mark at kli.org (Mark E. Shoulson) Date: Tue, 15 Nov 2016 20:41:02 -0500 Subject: The (Klingon) Empire Strikes Back In-Reply-To: References: <42101413.334282.1478281304520.ref@mail.yahoo.com> <42101413.334282.1478281304520@mail.yahoo.com> <27f48fcf-9ebc-8363-3b27-6540a242d375@kli.org> <59a89be5-1359-f5b7-905f-108c22c6e189@kli.org> <5b602cd1-53c8-181a-5ca4-0470ce36b92e@ix.netcom.com> Message-ID: <713bb56b-1e37-1c16-d5e6-72828032bf28@kli.org> On 11/15/2016 08:26 PM, Shawn Steele wrote: > As I understand the issue, the problem is less of whether or not it is legal, then whether or not Paramount might sue. Whether Unicode wins or not, it would still cost money to defend. There ought to be laws against suits brought just to intimidate. I think there are. But yes, they aren't easy to prove or enforce. > I was wondering like Mark Davis mentioned if there were some sort of companies that sold bonds for this kind of thing (though that might be out of KLI's budget.) > > Being afraid of a no answer probably isn't going to inspire confidence. But maybe you could do a combination of the above. Get someone to give you a legal opinion and then present that to Paramount with a "hey, they said this was probably legal anyway, but we wanted to ask nicely to be sure." Not so much "afraid" of a no answer, but would rather not give the sense that we even thought that one was an option. And for a company that makes its living from IP, they usually don't even have to bother listening to the whole question: "Say, can we use your?" "No!" (This is probably also partly due to the way the laws are structured). Your idea is a good one, though. Get a legal opinion and maybe *inform* Paramount of it, and ask if they'd like to be involved in sanctioning it. If spun right, it could even be sold as offering them the opportunity to get in on this, magnanimously offering them the privilege of giving their blessing... ~mark From mark at kli.org Tue Nov 15 19:47:42 2016 From: mark at kli.org (Mark E. Shoulson) Date: Tue, 15 Nov 2016 20:47:42 -0500 Subject: The (Klingon) Empire Strikes Back In-Reply-To: References: <42101413.334282.1478281304520.ref@mail.yahoo.com> <42101413.334282.1478281304520@mail.yahoo.com> <27f48fcf-9ebc-8363-3b27-6540a242d375@kli.org> <59a89be5-1359-f5b7-905f-108c22c6e189@kli.org> <54D68C57-FB87-46D8-A822-3A1848CDD611@evertype.com> Message-ID: <0ca0b95c-fa07-daae-7f18-23e7b66915eb@kli.org> On 11/15/2016 08:29 PM, Michael Everson wrote: > Mark, > > No need to be defensive. > > Tengwar and Cirth are in there because *I* put them there *long ago*, and the argument made was the nature of Tolkien?s work and study of it. That remains valid for keeping there, for one day the Tolkien Estate may revise its view on the matter. > > Maybe a version of the Roadmap had Klingon in it. I don?t recall. I?d?ve been the one to have put it there. There are records. It doesn?t matter, though. When lack of use made Klingon made UTC remove it from consideration, it would have been removed. The defensiveness was not that Tolkienian scholarship was deemed "worthy", but more that Klingon's apparently was not. There was a Roadmap with pIqaD on it, and indeed you were the one who put it there. Nick Nicholas, in https://web.archive.org/web/20120307231609fw_/http://www.tlg.uci.edu/~opoudjis/Klingon/piqad.html credits you with a "delightful move of defiance" for replacing pIqaD with Sarati when it was removed. > The Roadmaps are really of no consequence. They?re useful, but they have no status and are subject to any kind of change before ballotting ends. Getting pIqaD off the "not-roadmapped" list is more important, both symbolically and, as Ken Whistler says, practically. ~mark From everson at evertype.com Tue Nov 15 20:18:49 2016 From: everson at evertype.com (Michael Everson) Date: Wed, 16 Nov 2016 02:18:49 +0000 Subject: The (Klingon) Empire Strikes Back In-Reply-To: <0ca0b95c-fa07-daae-7f18-23e7b66915eb@kli.org> References: <42101413.334282.1478281304520.ref@mail.yahoo.com> <42101413.334282.1478281304520@mail.yahoo.com> <27f48fcf-9ebc-8363-3b27-6540a242d375@kli.org> <59a89be5-1359-f5b7-905f-108c22c6e189@kli.org> <54D68C57-FB87-46D8-A822-3A1848CDD611@evertype.com> <0ca0b95c-fa07-daae-7f18-23e7b66915eb@kli.org> Message-ID: <82CC2935-D61A-4956-BC75-79DC578E0871@evertype.com> On 16 Nov 2016, at 01:47, Mark E. Shoulson wrote: > > The defensiveness was not that Tolkienian scholarship was deemed "worthy", but more that Klingon's apparently was not. Back in the day? No. It wasn?t. > There was a Roadmap with pIqaD on it, and indeed you were the one who put it there. Nick Nicholas, in https://web.archive.org/web/20120307231609fw_/http://www.tlg.uci.edu/~opoudjis/Klingon/piqad.html credits you with a "delightful move of defiance" for replacing pIqaD with Sarati when it was removed. That would be me. >> The Roadmaps are really of no consequence. They?re useful, but they have no status and are subject to any kind of change before ballotting ends. > > Getting pIqaD off the "not-roadmapped" list is more important, both symbolically and, as Ken Whistler says, practically. Ha? ruch. Michael From everson at evertype.com Tue Nov 15 21:57:23 2016 From: everson at evertype.com (Michael Everson) Date: Wed, 16 Nov 2016 03:57:23 +0000 Subject: The (Klingon) Empire Strikes Back In-Reply-To: <01275881-d53b-269d-fde9-330e7d94be37@kli.org> References: <01275881-d53b-269d-fde9-330e7d94be37@kli.org> Message-ID: <01C190B4-7440-4085-B723-AC7EED0444AF@evertype.com> On 3 Nov 2016, at 23:43, Mark Shoulson wrote: > Michael Everson: I basically copied your 1997 proposal into the document, with some minor changes. I hope you don't mind. I do not. > And if you don't want to be on the hook for providing the glyphs to UTC, I can do that. I think that proposal should serve as a starting-point for discussion anyway. I?m in. > 1. the "SYMBOL FOR EMPIRE" also known as the "MUMMIFICATION GLYPH". I don't know where the second name comes from, I don't know how important it is to encode it, and I don't know how much of a trademark headache it will cause with Paramount, as it is used pretty heavily in their imagery. Something we'll have to talk about. I?d leave it out for now. > 2. I put in the COMMA and FULL STOP, which were not in the original proposal but were in the ConScript registry entry. Yes, those have been adopted since 1997. > The examples I have show them clearly being used. UTC may decide to unify them with existing triangular shapes, which may or may not be a good idea. As they are punctuation, I think it unlikely. > 3. For my part, I've invented a pair of ampersands for Klingon (Klingon has two words for "and": one for joining verbs/sentences and one for joining nouns (the former goes between its "conjunctands", the latter after them)), from ligatures of the letters in question. The pretty much have NO usage, of course (and are not in the proposal), but maybe they should be presented to the community. That?s up to you. Adoption is a matter for the user community. > Let the bickering begin! may? malujpu'. veS maQap. Michael Everson From petercon at microsoft.com Thu Nov 17 17:10:41 2016 From: petercon at microsoft.com (Peter Constable) Date: Thu, 17 Nov 2016 23:10:41 +0000 Subject: "Oh that's what you meant!: reducing emoji misunderstanding" Message-ID: Somewhat interesting: a paper from a conference in Italy a couple of months ago: http://discovery.dundee.ac.uk/portal/en/research/oh-thats-what-you-meant(20b8923c-28da-49ed-bc78-fcc741db3187).html I anticipated old news about misunderstanding based on presentation differences on the level of water gun vs. etc. But it focuses on subtleties in emotional reactions that different users associate with different smileys. E.g., how does U+1F624 ???? compare with U+1F62C ????? A given user may perceive the two differently, and for either one a given user?s perception may differ when evaluating the depiction used in one app/platform versus another. They suggest that, if users gave a characterization of reactions to different emoji on a given platform (e.g., degree of emotion, how positive or negative) then an automated system could translate one user?s message to display an emoji to a second user that more closely reflects the emotion intended by the first user. Peter -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Thu Nov 17 17:31:34 2016 From: doug at ewellic.org (Doug Ewell) Date: Thu, 17 Nov 2016 16:31:34 -0700 Subject: "Oh that's what you meant!: reducing emoji misunderstanding" Message-ID: <20161117163134.665a7a7059d7ee80bb4d670165c8327d.e5ed1c79b6.wbe@email03.godaddy.com> Peter Constable wrote: > E.g., how does U+1F624 ???? compare with U+1F62C ????? A given user may > perceive the two differently, and for either one a given user?s > perception may differ when evaluating the depiction used in one app/ > platform versus another. They suggest that, if users gave a > characterization of reactions to different emoji on a given platform > (e.g., degree of emotion, how positive or negative) then an automated > system could translate one user?s message to display an emoji to a > second user that more closely reflects the emotion intended by the > first user. Or, people could just say what they mean, using language. -- Doug Ewell | Thornton, CO, US | ewellic.org From verdy_p at wanadoo.fr Thu Nov 17 21:46:07 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Fri, 18 Nov 2016 04:46:07 +0100 Subject: The (Klingon) Empire Strikes Back In-Reply-To: References: <42101413.334282.1478281304520.ref@mail.yahoo.com> <42101413.334282.1478281304520@mail.yahoo.com> <27f48fcf-9ebc-8363-3b27-6540a242d375@kli.org> <59a89be5-1359-f5b7-905f-108c22c6e189@kli.org> <5b602cd1-53c8-181a-5ca4-0470ce36b92e@ix.netcom.com> Message-ID: Fonts when they are not copyrightable are still patentable. The complexity of IP rights is growing and their scope of application as well (sometimes with backward effects in time, including on the "public domain"). I would not bet anything on a past decision by a US court, and anyway we're not building just an US standard but an international standard: may be the Unicode consortium or ISO would not be liable of infringments or subjects to claims of IP rights in US, this doies not mean that there won't be claims elsewhere, if the standard bodies cannot assert themselves their own IP rights (which are then allowing them to licence the standards "for free" to anyone in the world). In this complex world, all that can be done is tohave a faire procedure for litigations, and get some security by offering enough time for such claims, after which a local (but applicable) law enforcement body will be able to decide that these claims are coming too late to be valid (in the IP world, such delays for "too late" claims can be extremely long, up to 70 years or more for claims by individual people, ot 10 years for tangible properties and appropriation of the public domain or the private domain of someone else). On the Internet this fair system is known as the "UDRP" procedure (which applies as well on claims for domain names). But once this time is exhausted an IP rights are no longer exclusive, someone else could build a new claim (e.g. by registering new patents against what shoudl be the public domain and it is then costly to counter these attacks that are too common with patents and trademarks). And when there's uncertainty about the oreservation of the public domain or legitimete use of it, some countries prefer redefining the delays (including with backward applications, for example Russia): they can do that with national laws unless these countries are bound to international treaties: this has occured notably before the WIPO became a mostly worldwide body enforcing the applicability or non-applicability of IP rights in more tan just one country. But WIPO is now concerned with new kind of rights. Historically there was the patent system (derived from industrial rights and artistic rights), then the copyright system, now there's the new database IP system, and the moral right for physical persons starts being extended to moral persons... In fact with these constant extensions, I am not sure that all existing publications of the standard are not partly covered now by new claims against which we've not opposed officially in due time. This means that this goes beyond the single case of Klingons. We know that the historic human language is now being appropriated (notably by trademarks). In fact, all existing standards are concerned. 2016-11-16 2:26 GMT+01:00 Shawn Steele : > As I understand the issue, the problem is less of whether or not it is > legal, then whether or not Paramount might sue. Whether Unicode wins or > not, it would still cost money to defend. > > I was wondering like Mark Davis mentioned if there were some sort of > companies that sold bonds for this kind of thing (though that might be out > of KLI's budget.) > > Being afraid of a no answer probably isn't going to inspire confidence. > But maybe you could do a combination of the above. Get someone to give you > a legal opinion and then present that to Paramount with a "hey, they said > this was probably legal anyway, but we wanted to ask nicely to be sure." > > -Shawn > > -----Original Message----- > From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Mark E. > Shoulson > Sent: Tuesday, 15 November 2016 5:19 PM > To: unicode at unicode.org > Subject: Re: The (Klingon) Empire Strikes Back > > On 11/15/2016 07:31 PM, Mark Davis ?? wrote: > > > However, it appears relatively settled that one cannot claim > > copyright in an alphabet... > > > > We know that these parties tend to be litigious, so we have to be > > careful. "relatively settled" is not good enough. > > > > We do not want to be the ones responsible (and liable) for making a > > determination as to whether that is settled. Nor do we want to pay the > > legal fees necessary to make a water-tight determination. > > > > That is why if there is any question as to the IP issues, we leave it > > up to the proposers to get absolutely rock-solid clearance (eg from > > the Tolkien estate for Tengwar, or from Paramount for Klingon). The > > only other alternative I can think of is if the proposers provide > > indemnification for any legal costs that could obtain from a legal > > suit of us or our vendors. > > > > Mark > > // > > How about legal counsel on the matter? > > We're a little hesitant of asking Paramount/CBS about this, because of > course, asking means that we think maybe they can say no, and we don't want > to imply that. So I'm thinking/hoping maybe we can do some research by a > qualified legal expert (and not us armchair-lawyers, "yeah, it looks pretty > settled to me...") to make a determination. > > I'm trying to find out some more information about the KLI's pIqaD font, > which it has been using and distributing for decades, during some of which > time it was licensed by Paramount, and which apparently was *not* covered > in the licensing agreements?precisely because typefaces are > *not* copyrightable in the US! (I thought they were, though... like I > said, I'm trying to find out more about this.) And all that time without > objection from Paramount. Not a slam-dunk argument, but it's something. > > ~mark > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jameskasskrv at gmail.com Thu Nov 17 21:55:01 2016 From: jameskasskrv at gmail.com (James Kass) Date: Thu, 17 Nov 2016 19:55:01 -0800 Subject: "Oh that's what you meant!: reducing emoji misunderstanding" In-Reply-To: <20161117163134.665a7a7059d7ee80bb4d670165c8327d.e5ed1c79b6.wbe@email03.godaddy.com> References: <20161117163134.665a7a7059d7ee80bb4d670165c8327d.e5ed1c79b6.wbe@email03.godaddy.com> Message-ID: Doug Ewell responded to Peter Constable, >> then an automated system could translate one user?s message to >> display an emoji to a second user that more closely reflects >> the emotion intended by the first user. > > Or, people could just say what they mean, using language. How about some kind of automated system for translating icons into words? >> E.g., how does U+1F624 ???? compare with U+1F62C ????? They display identically in Notepad using Lucida Console, but I'm OK with that. So if anyone seeks an easy method for translating emoji characters into meaningless little rectangles, there you go! Best regards, James Kass From verdy_p at wanadoo.fr Thu Nov 17 22:27:25 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Fri, 18 Nov 2016 05:27:25 +0100 Subject: "Oh that's what you meant!: reducing emoji misunderstanding" In-Reply-To: References: <20161117163134.665a7a7059d7ee80bb4d670165c8327d.e5ed1c79b6.wbe@email03.godaddy.com> Message-ID: such system already exists since long in various forums and chats, you already write a word between colons, you get the emoji without having to select it in a list or remember their code point and use complex input, but there's a way to reverse this conversion if needed. The conversion of ":colon-bracketed-words:" to emojis has frequent false positives, notably with punctuation: I've seen regularly false conversions of "-)" or similar into undesired emojis. There's no evident and universal way to convert emojis to natural language, you'll collide sometimes as well with non-Emoji meanings I've seen some forums substituting programming code (properly tagged as such using surrounding markup such as ... or
...
or ...) and replacing it with non-sense emojis. The same could happen in the reverse direction (even if you surround the ":word:" with additional spaces. Even if you choose some keywords or markup such as "smiley" instead of " :-) " or " :smiley: ", you may break tabular data (using ":" as column separators). 2016-11-18 4:55 GMT+01:00 James Kass : > Doug Ewell responded to Peter Constable, > > >> then an automated system could translate one user?s message to > >> display an emoji to a second user that more closely reflects > >> the emotion intended by the first user. > > > > Or, people could just say what they mean, using language. > > How about some kind of automated system for translating icons into words? > > >> E.g., how does U+1F624 ???? compare with U+1F62C ????? > > They display identically in Notepad using Lucida Console, but I'm OK > with that. So if anyone seeks an easy method for translating emoji > characters into meaningless little rectangles, there you go! > > Best regards, > > James Kass > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jameskasskrv at gmail.com Fri Nov 18 00:06:32 2016 From: jameskasskrv at gmail.com (James Kass) Date: Thu, 17 Nov 2016 22:06:32 -0800 Subject: "Oh that's what you meant!: reducing emoji misunderstanding" In-Reply-To: References: <20161117163134.665a7a7059d7ee80bb4d670165c8327d.e5ed1c79b6.wbe@email03.godaddy.com> Message-ID: Philippe Verdy wrote, > There's no evident and universal way to convert > emojis to natural language ... Indeed. Emoji characters apparently mean whatever their users want them to mean. Such meanings may be perceived differently by various users or communities, as the subject line indicates, and these meanings are subject to change without notice. Any effort to standardize such a conversion seems doomed, but someone with funding would probably try it anyway. Best regards, James Kass -------------- next part -------------- An HTML attachment was scrubbed... URL: From christoph.paeper at crissov.de Fri Nov 18 00:27:55 2016 From: christoph.paeper at crissov.de (=?utf-8?Q?Christoph_P=C3=A4per?=) Date: Fri, 18 Nov 2016 07:27:55 +0100 Subject: "Oh that's what you meant!: reducing emoji misunderstanding" In-Reply-To: <20161117163134.665a7a7059d7ee80bb4d670165c8327d.e5ed1c79b6.wbe@email03.godaddy.com> References: <20161117163134.665a7a7059d7ee80bb4d670165c8327d.e5ed1c79b6.wbe@email03.godaddy.com> Message-ID: <82FB3E37-F903-4BCF-8320-2DBDE0BC41F8@crissov.de> Doug Ewell : > > Or, people could just say what they mean, using language. That?s not how language (or communication in general) works. At all. From jameskasskrv at gmail.com Fri Nov 18 00:55:20 2016 From: jameskasskrv at gmail.com (James Kass) Date: Thu, 17 Nov 2016 22:55:20 -0800 Subject: "Oh that's what you meant!: reducing emoji misunderstanding" In-Reply-To: <82FB3E37-F903-4BCF-8320-2DBDE0BC41F8@crissov.de> References: <20161117163134.665a7a7059d7ee80bb4d670165c8327d.e5ed1c79b6.wbe@email03.godaddy.com> <82FB3E37-F903-4BCF-8320-2DBDE0BC41F8@crissov.de> Message-ID: Christoph P?per wrote, >> Or, people could just say what they mean, using language. > > That?s not how language (or communication in general) works. At all. Language works best when people say what they mean and mean what they say, just as democracy works best with an informed electorate. The absence of either factor would tend to break down communication in general. Are we communicating with language here? Best regards, James Kass From Shawn.Steele at microsoft.com Fri Nov 18 01:30:53 2016 From: Shawn.Steele at microsoft.com (Shawn Steele) Date: Fri, 18 Nov 2016 07:30:53 +0000 Subject: "Oh that's what you meant!: reducing emoji misunderstanding" In-Reply-To: <20161117163134.665a7a7059d7ee80bb4d670165c8327d.e5ed1c79b6.wbe@email03.godaddy.com> References: <20161117163134.665a7a7059d7ee80bb4d670165c8327d.e5ed1c79b6.wbe@email03.godaddy.com> Message-ID: > Or, people could just say what they mean, using language. Hmm, some languages don't have words to express what one means (or feels) in every circumstance. I've used emoji when the concept would be tough, or impossible, to convey accurately in English. -Shawn From verdy_p at wanadoo.fr Fri Nov 18 01:40:09 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Fri, 18 Nov 2016 08:40:09 +0100 Subject: "Oh that's what you meant!: reducing emoji misunderstanding" In-Reply-To: References: <20161117163134.665a7a7059d7ee80bb4d670165c8327d.e5ed1c79b6.wbe@email03.godaddy.com> Message-ID: I would even add the Emojis are in fact a new separate language, written with its own script, its own grammar/syntax, and its specific layout and combinations (ligatured clusters, partly documented in Unicode) and sometimes specificities about colors of rendering (e.g. the human skin colors, or national flags if they are colorized). I think it would merit a language code for itself. But you could use some special language codes for notations, if "zxx" (no lingusitic content) is not appropriate. (same remark about musical notations) 2016-11-18 7:06 GMT+01:00 James Kass : > > Philippe Verdy wrote, > > > There's no evident and universal way to convert > > emojis to natural language ... > > Indeed. Emoji characters apparently mean whatever their users want them > to mean. Such meanings may be perceived differently by various users or > communities, as the subject line indicates, and these meanings are subject > to change without notice. Any effort to standardize such a conversion > seems doomed, but someone with funding would probably try it anyway. > > Best regards, > James Kass > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From A.Schappo at lboro.ac.uk Fri Nov 18 03:26:06 2016 From: A.Schappo at lboro.ac.uk (Andre Schappo) Date: Fri, 18 Nov 2016 09:26:06 +0000 Subject: "Oh that's what you meant!: reducing emoji misunderstanding" In-Reply-To: References: <20161117163134.665a7a7059d7ee80bb4d670165c8327d.e5ed1c79b6.wbe@email03.godaddy.com> Message-ID: As Richard Ishida insightfully points out ? should Emoji sequences/phrases/sentences adhere to the human language context eg a Japanese Emoji sequence could/should be in Japanese "Subject - Object - Verb" order https://twitter.com/r12a/status/798151134963757056 Andr? Schappo On 18 Nov 2016, at 07:40, Philippe Verdy > wrote: I would even add the Emojis are in fact a new separate language, written with its own script, its own grammar/syntax, and its specific layout and combinations (ligatured clusters, partly documented in Unicode) and sometimes specificities about colors of rendering (e.g. the human skin colors, or national flags if they are colorized). I think it would merit a language code for itself. But you could use some special language codes for notations, if "zxx" (no lingusitic content) is not appropriate. (same remark about musical notations) 2016-11-18 7:06 GMT+01:00 James Kass >: Philippe Verdy wrote, > There's no evident and universal way to convert > emojis to natural language ... Indeed. Emoji characters apparently mean whatever their users want them to mean. Such meanings may be perceived differently by various users or communities, as the subject line indicates, and these meanings are subject to change without notice. Any effort to standardize such a conversion seems doomed, but someone with funding would probably try it anyway. Best regards, James Kass -------------- next part -------------- An HTML attachment was scrubbed... URL: From duerst at it.aoyama.ac.jp Fri Nov 18 04:24:40 2016 From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J._D=c3=bcrst?=) Date: Fri, 18 Nov 2016 19:24:40 +0900 Subject: "Oh that's what you meant!: reducing emoji misunderstanding" In-Reply-To: References: <20161117163134.665a7a7059d7ee80bb4d670165c8327d.e5ed1c79b6.wbe@email03.godaddy.com> Message-ID: In many cases, emoji communication is a lot more complicated than just copying word order from the host language. See e.g. https://www.wired.com/2016/08/how-teens-use-social-media/ for some examples. Regards, Martin. On 2016/11/18 18:26, Andre Schappo wrote: > > As Richard Ishida insightfully points out ? should Emoji sequences/phrases/sentences adhere to the human language context eg a Japanese Emoji sequence could/should be in Japanese "Subject - Object - Verb" order https://twitter.com/r12a/status/798151134963757056 > > Andr? Schappo > > On 18 Nov 2016, at 07:40, Philippe Verdy > wrote: > > I would even add the Emojis are in fact a new separate language, written with its own script, its own grammar/syntax, and its specific layout and combinations (ligatured clusters, partly documented in Unicode) and sometimes specificities about colors of rendering (e.g. the human skin colors, or national flags if they are colorized). > > I think it would merit a language code for itself. But you could use some special language codes for notations, if "zxx" (no lingusitic content) is not appropriate. (same remark about musical notations) > > 2016-11-18 7:06 GMT+01:00 James Kass >: > > Philippe Verdy wrote, > >> There's no evident and universal way to convert >> emojis to natural language ... > > Indeed. Emoji characters apparently mean whatever their users want them to mean. Such meanings may be perceived differently by various users or communities, as the subject line indicates, and these meanings are subject to change without notice. Any effort to standardize such a conversion seems doomed, but someone with funding would probably try it anyway. > > Best regards, > > James Kass > > > -- Prof. Dr.sc. Martin J. D?rst Department of Intelligent Information Technology College of Science and Engineering Aoyama Gakuin University Fuchinobe 5-1-10, Chuo-ku, Sagamihara 252-5258 Japan From otto.stolz at uni-konstanz.de Fri Nov 18 05:49:49 2016 From: otto.stolz at uni-konstanz.de (Otto Stolz) Date: Fri, 18 Nov 2016 12:49:49 +0100 Subject: "Oh that's what you meant!: reducing emoji misunderstanding" In-Reply-To: <20161117163134.665a7a7059d7ee80bb4d670165c8327d.e5ed1c79b6.wbe@email03.godaddy.com> References: <20161117163134.665a7a7059d7ee80bb4d670165c8327d.e5ed1c79b6.wbe@email03.godaddy.com> Message-ID: <582EEADD.2070302@uni-konstanz.de> Am 18.11.2016 um 00:31 schrieb Doug Ewell: > Or, people could just say what they mean, using language. This is not so easy, as already Lewis Carroll had seen, cf. this snippet from ?Alice in Wonderland?: > ?Then you should say what you mean,? the March Hare went on. > ?I do,? Alice hastily replied; ?at least?at least I mean what I say? > that?s the same thing, you know.? > ?Not the same thing a bit!? said the Hatter. Best wishes, Otto From wjgo_10009 at btinternet.com Fri Nov 18 09:41:36 2016 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Fri, 18 Nov 2016 15:41:36 +0000 (GMT) Subject: "Oh that's what you meant!: reducing emoji misunderstanding" In-Reply-To: References: <20161117163134.665a7a7059d7ee80bb4d670165c8327d.e5ed1c79b6.wbe@email03.godaddy.com> Message-ID: <2292097.44071.1479483696494.JavaMail.defaultUser@defaultHost> Andr? Schappo wrote: > As Richard Ishida insightfully points out ? should Emoji sequences/phrases/sentences adhere to the human language context eg a Japanese Emoji sequence could/should be in Japanese "Subject - Object - Verb" order https://twitter.com/r12a/status/798151134963757056 As it happens I have recently been designing some emoji grammatical operator characters. They are abstract emoji. The concept is that the emoji grammatical operator operates on the emoji character that follows it, so as to provide a grammatical context for the emoji character. Each of the characters is designed to be on a 7 by 7 grid, and is one contiguous piece with no inner hole. Lines are always one unit wide and only corners and T junctions are allowed. I have now added images of glyph designs for fifteen emoji grammatical operator characters to the web. They are included on the following web page. http://www.users.globalnet.co.uk/~ngo/abstract_emoji.htm That page is linked from the following web page. http://www.users.globalnet.co.uk/~ngo/library.htm I have attached copies of two of the images to this email as examples. They are as follows. emoji_grammatical_operator_verb_pluperfect_tense.png emoji_grammatical_operator_noun_direct_object.png William Overington Friday 18 November 2016 -------------- next part -------------- A non-text attachment was scrubbed... Name: emoji_grammatical_operator_noun_direct_object.png Type: image/png Size: 3013 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: emoji_grammatical_operator_verb_pluperfect_tense.png Type: image/png Size: 3022 bytes Desc: not available URL: From verdy_p at wanadoo.fr Sun Nov 20 02:46:14 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Sun, 20 Nov 2016 09:46:14 +0100 Subject: Bidi: inserting Japanese paragraphs in Arabic/Farsi document Message-ID: How should the following Japanese **paragraph** be displayed when inserted in a RTL context (Arabic/Farsi/...) ? ?Japanese1?Japanese2 What I see in browsers is: Japanese1?Japanese2 ? Why don't the Japanese backets pair together to avoid having one mirrored and not the other one ? -------------- next part -------------- An HTML attachment was scrubbed... URL: From simon at simon-cozens.org Sun Nov 20 04:22:46 2016 From: simon at simon-cozens.org (Simon Cozens) Date: Sun, 20 Nov 2016 21:22:46 +1100 Subject: Bidi: inserting Japanese paragraphs in Arabic/Farsi document In-Reply-To: References: Message-ID: <7be20863-8c28-221f-d240-0cc5e9531352@simon-cozens.org> On 20/11/2016 19:46, Philippe Verdy wrote: > Why don't the Japanese backets pair together to avoid having one > mirrored and not the other one ? Isn't this the classic bidi brackets problem? The ? is assumed to belong to the base level because it's bidi neutral, but the ? is assumed to be part of the LTR text, so they end up in different isolating runs. I don't think there's anything special about Japanese here. The same happens for () brackets and English text. From verdy_p at wanadoo.fr Sun Nov 20 04:52:01 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Sun, 20 Nov 2016 11:52:01 +0100 Subject: Bidi: inserting Japanese paragraphs in Arabic/Farsi document In-Reply-To: <7be20863-8c28-221f-d240-0cc5e9531352@simon-cozens.org> References: <7be20863-8c28-221f-d240-0cc5e9531352@simon-cozens.org> Message-ID: Wasn't this corrected so that the direction of such 'bidi neutral" pairs should match, i.e. the leading character would adopt the direction of the trailing one in the same pair, rather than inheriting the direction from the outer context ? 2016-11-20 11:22 GMT+01:00 Simon Cozens : > On 20/11/2016 19:46, Philippe Verdy wrote: > > Why don't the Japanese backets pair together to avoid having one > > mirrored and not the other one ? > > Isn't this the classic bidi brackets problem? The ? is assumed to belong > to the base level because it's bidi neutral, but the ? is assumed to be > part of the LTR text, so they end up in different isolating runs. > > I don't think there's anything special about Japanese here. The same > happens for () brackets and English text. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ismeta.wikt at gmail.com Sun Nov 20 05:14:18 2016 From: ismeta.wikt at gmail.com (IS META) Date: Sun, 20 Nov 2016 11:14:18 +0000 Subject: Unicode Digest, Vol 35, Issue 16 In-Reply-To: References: Message-ID: Dear William Overington, Your abstract emoji are interesting. I am especially pleased that your *noun brown* emoji express a number of grammatical cases. However, your *Some designs for emoji of personal pronouns* is less flexible, wherein the pronouns can only express singular and plural grammatical numbers. Is there any chance that the system may be modified to enable the expression of dual grammatical number? Though the dual number is rarer than the singular?plural distinction, it occurs in many languages, including major ones like Classical Greek, Sanskrit, and Modern Standard Arabic, and it is far more widespread in pronominal systems. Perhaps the way American Sign Language expresses the dual number could provide some inspiration for this. Yours sincerely, I.S.M.E.T.A. On Sat, Nov 19, 2016 at 6:00 PM, wrote: > Send Unicode mailing list submissions to > unicode at unicode.org > > To subscribe or unsubscribe via the World Wide Web, visit > http://unicode.org/mailman/listinfo/unicode > or, via email, send a message with subject or body 'help' to > unicode-request at unicode.org > > You can reach the person managing the list at > unicode-owner at unicode.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of Unicode digest..." > > Today's Topics: > > 1. Re: "Oh that's what you meant!: reducing emoji > misunderstanding" (William_J_G Overington) > > > ---------- Forwarded message ---------- > From: William_J_G Overington > To: A.Schappo at lboro.ac.uk, unicode at unicode.org > Cc: > Date: Fri, 18 Nov 2016 15:41:36 +0000 (GMT) > Subject: Re: "Oh that's what you meant!: reducing emoji misunderstanding" > Andr? Schappo wrote: > > > As Richard Ishida insightfully points out ? should Emoji > sequences/phrases/sentences adhere to the human language context eg a > Japanese Emoji sequence could/should be in Japanese "Subject - Object - > Verb" order https://twitter.com/r12a/status/798151134963757056 > > As it happens I have recently been designing some emoji grammatical > operator characters. They are abstract emoji. > > The concept is that the emoji grammatical operator operates on the emoji > character that follows it, so as to provide a grammatical context for the > emoji character. > > Each of the characters is designed to be on a 7 by 7 grid, and is one > contiguous piece with no inner hole. > > Lines are always one unit wide and only corners and T junctions are > allowed. > > I have now added images of glyph designs for fifteen emoji grammatical > operator characters to the web. > > They are included on the following web page. > > http://www.users.globalnet.co.uk/~ngo/abstract_emoji.htm > > That page is linked from the following web page. > > http://www.users.globalnet.co.uk/~ngo/library.htm > > I have attached copies of two of the images to this email as examples. > > They are as follows. > > emoji_grammatical_operator_verb_pluperfect_tense.png > > emoji_grammatical_operator_noun_direct_object.png > > William Overington > > Friday 18 November 2016 > > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From eliz at gnu.org Sun Nov 20 09:27:41 2016 From: eliz at gnu.org (Eli Zaretskii) Date: Sun, 20 Nov 2016 17:27:41 +0200 Subject: Bidi: inserting Japanese paragraphs in Arabic/Farsi document In-Reply-To: (message from Philippe Verdy on Sun, 20 Nov 2016 09:46:14 +0100) References: Message-ID: <83twb29no2.fsf@gnu.org> > From: Philippe Verdy > Date: Sun, 20 Nov 2016 09:46:14 +0100 > > How should the following Japanese **paragraph** be displayed when inserted in a RTL context > (Arabic/Farsi/...) ? > > ?Japanese1?Japanese2 > > What I see in browsers is: > > Japanese1?Japanese2 ? > > Why don't the Japanese backets pair together to avoid having one mirrored and not the other one ? I guess your browser doesn't support the full Unicode 9.0 UBA. Emacs 25, for example, does TRT: I see Japanese2 ?Japanese1? (flushed all the way to the right margin of the window), as expected. P.S. I assume that by "RTL context" you mean right-to-left base paragraph direction. From eliz at gnu.org Sun Nov 20 09:29:35 2016 From: eliz at gnu.org (Eli Zaretskii) Date: Sun, 20 Nov 2016 17:29:35 +0200 Subject: Bidi: inserting Japanese paragraphs in Arabic/Farsi document In-Reply-To: <7be20863-8c28-221f-d240-0cc5e9531352@simon-cozens.org> (message from Simon Cozens on Sun, 20 Nov 2016 21:22:46 +1100) References: <7be20863-8c28-221f-d240-0cc5e9531352@simon-cozens.org> Message-ID: <83shqm9nkw.fsf@gnu.org> > From: Simon Cozens > Date: Sun, 20 Nov 2016 21:22:46 +1100 > > On 20/11/2016 19:46, Philippe Verdy wrote: > > Why don't the Japanese backets pair together to avoid having one > > mirrored and not the other one ? > > Isn't this the classic bidi brackets problem? The ? is assumed to belong > to the base level because it's bidi neutral, but the ? is assumed to be > part of the LTR text, so they end up in different isolating runs. The UBA was changed in Unicode 6.3 to process mirrored bracket pairs specially, to avoid this issue. But not all browsers caught up with that yet. From verdy_p at wanadoo.fr Sun Nov 20 10:20:49 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Sun, 20 Nov 2016 17:20:49 +0100 Subject: Bidi: inserting Japanese paragraphs in Arabic/Farsi document In-Reply-To: <83shqm9nkw.fsf@gnu.org> References: <7be20863-8c28-221f-d240-0cc5e9531352@simon-cozens.org> <83shqm9nkw.fsf@gnu.org> Message-ID: So it is an issue of Chrome, still not using the new rules. I thought it was already using them. The alignment of the paragraph to the right is optional, it is less essential. It would still be satisfactory to see: Japanese2 ?Japanese1? That alignment is prefered only when it is a separate paragraph, but if the Japanese citation is within an Arabic paragraph encoded as : ARABIC-ONE "?Japanese1?Japanese2" ARABIC-TWO I expect to see OWT-CIBARA Japanese2 ?Japanese1?"" ENO-CIBARA aligned to the right margin,or: OWT-CIBARA Japanese2 ?Japanese1?"" ENO-CIBARA if it occurs in an Arabic document. There's still the problem of surrounding quation marks that don't form matching pairs (unlike brackets), that's why authors will likely use mirrorable quotation marks, or will need to surround the Japanese citation and the quotations using some isolation using ... or equivalent bidi isolate controls, or an LTR override control for the leading quotation mark to get: OWT-CIBARA "Japanese2 ?Japanese1?" ENO-CIBARA May be some bidi processors may opt for matching quotation mark pairs such as "..." or ?...? or ?...? or ?...? or ?...? or ?...?, but it is well known that this won't work if quotation marks are not paired or use the same mirrorable character for the leasing and trailing quotation marks as ? ...?,. Same problem if quotations span multiple paragraphs where an additional quotation mark is leading each additional paragraph in the same quotation (for saying that the quotation continues), with only one quotation mark at end of the last paragraph) which can't be paired easily without ambiguities, or more complex resolution which will be language dependant and would probably require additonal markup of the language used in the citation text itself, or for the whole container including the quotation marks. And example of this complex case is ? CITATION1 ? CITATION2 ? CITATION3 ?, Author This style above is parsable by considering that any "trailing" quotation mark leading any line cannot be really a trailing mark (it is then a continuation mark) and that to match the trailing quotation mark, you need to look further, possibly in multiple paragraphs. As far as I know, there's no easy way to encode in plain-text Unicode only (without markup), that continuation marks should be ignored by Bidi processors for matching pairs, except by putting these continuation marks in isolates (e.g. above the continuation marks just before CITATION2 and CITATION3 will be encoded as , or in HTML as ?). There's no easy solution for this case except by using some isolation with an explicit direction set to surround the whole (... or LRI...PDI). It is notable that most quotation marks are also not mirrorable, but pseudo-mirroring by replacing these marks may be made in language-dependant processors. 2016-11-20 16:29 GMT+01:00 Eli Zaretskii : > > From: Simon Cozens > > Date: Sun, 20 Nov 2016 21:22:46 +1100 > > > > On 20/11/2016 19:46, Philippe Verdy wrote: > > > Why don't the Japanese backets pair together to avoid having one > > > mirrored and not the other one ? > > > > Isn't this the classic bidi brackets problem? The ? is assumed to belong > > to the base level because it's bidi neutral, but the ? is assumed to be > > part of the LTR text, so they end up in different isolating runs. > > The UBA was changed in Unicode 6.3 to process mirrored bracket pairs > specially, to avoid this issue. But not all browsers caught up with > that yet. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From eliz at gnu.org Sun Nov 20 10:37:23 2016 From: eliz at gnu.org (Eli Zaretskii) Date: Sun, 20 Nov 2016 18:37:23 +0200 Subject: Bidi: inserting Japanese paragraphs in Arabic/Farsi document In-Reply-To: (message from Philippe Verdy on Sun, 20 Nov 2016 17:20:49 +0100) References: <7be20863-8c28-221f-d240-0cc5e9531352@simon-cozens.org> <83shqm9nkw.fsf@gnu.org> Message-ID: <83h9729kfw.fsf@gnu.org> > From: Philippe Verdy > Date: Sun, 20 Nov 2016 17:20:49 +0100 > Cc: Simon Cozens , > unicode Unicode Discussion > > The alignment of the paragraph to the right is optional, it is less essential. It's essential for people who speak those languages. Not seeing the alignment would cause some brows to be raised (and can also cause incorrect reading in some marginal cases). > That alignment is prefered only when it is a separate paragraph, but if the Japanese citation is within an Arabic > paragraph encoded as : > > ARABIC-ONE "?Japanese1?Japanese2" ARABIC-TWO > > I expect to see > > OWT-CIBARA Japanese2 ?Japanese1?"" ENO-CIBARA > > aligned to the right margin No, you should see this: OWT-CIBARA "Japanese2 ?Japanese1?" ENO-CIBARA That's what Emacs shows me. > There's still the problem of surrounding quation marks that don't form matching pairs (unlike brackets), that's > why authors will likely use mirrorable quotation marks, or will need to surround the Japanese citation and the > quotations using some isolation using ... or equivalent bidi isolate controls, or an LTR override > control for the leading quotation mark to get: I don't see any problems with quotes in Emacs, see above. > May be some bidi processors may opt for matching quotation mark pairs such as "..." or ?...? or ?...? or ?...? or > ?...? or ?...?, but it is well known that this won't work if quotation marks are not paired or use the same > mirrorable character for the leasing and trailing quotation marks as ?...?,. They do match here without any problems. From verdy_p at wanadoo.fr Sun Nov 20 10:58:54 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Sun, 20 Nov 2016 17:58:54 +0100 Subject: Bidi: inserting Japanese paragraphs in Arabic/Farsi document In-Reply-To: <83h9729kfw.fsf@gnu.org> References: <7be20863-8c28-221f-d240-0cc5e9531352@simon-cozens.org> <83shqm9nkw.fsf@gnu.org> <83h9729kfw.fsf@gnu.org> Message-ID: 2016-11-20 17:37 GMT+01:00 Eli Zaretskii : > > From: Philippe Verdy > > Date: Sun, 20 Nov 2016 17:20:49 +0100 > > Cc: Simon Cozens , > > unicode Unicode Discussion > > > > The alignment of the paragraph to the right is optional, it is less > essential. > > It's essential for people who speak those languages. Not seeing the > alignment would cause some brows to be raised (and can also cause > incorrect reading in some marginal cases). > > > That alignment is prefered only when it is a separate paragraph, but if > the Japanese citation is within an Arabic > > paragraph encoded as : > > > > ARABIC-ONE "?Japanese1?Japanese2" ARABIC-TWO > > > > I expect to see > > > > OWT-CIBARA Japanese2 ?Japanese1?"" ENO-CIBARA > > > > aligned to the right margin > > No, you should see this: > > OWT-CIBARA "Japanese2 ?Japanese1?" ENO-CIBARA > > That's what Emacs shows me. > That's because EMACS uses some "smart quote" processing, but it is absolutely not part of the Unicode Bidi standard, This is an extension ("smart quote" matching is known to be defective in all processors in many cases because they assume rules used for specific languages, but they DO NOT work properly notably when using multilingual text where various languages use quotation marks very differently and in incompatible ways!!! The ASCII quotes are neither opening, nor closing, they do not form **clear** pairs (e.g when I speak about the two characters ' " ' and " ' ", smart processors are unable to correctly guess how simple and double quotes are pairing, or if they are really pairing or not !!!). Emacs will be as stupid as other wordprocessors here if it uses its "smart quotes" to tune the behavior Bidi algorithm (IMHO this is clearly a real BUG of Emacs if it does that, this will never be portable and this behavior is completely unpredictable). Here I was speaking about the standard Bidi algorithm (also part of HTML and SVG, and implemetned in browsers: none of them can use any "smart quote" processing, only some word processors may do that but with interaction with users dueing editing, but NEVER for rendering a read-only document, because those "smart quotes" are just guesses for most frequent cases, but there are many exceptions, notably in multilingual documents like here) -------------- next part -------------- An HTML attachment was scrubbed... URL: From eliz at gnu.org Sun Nov 20 11:24:05 2016 From: eliz at gnu.org (Eli Zaretskii) Date: Sun, 20 Nov 2016 19:24:05 +0200 Subject: Bidi: inserting Japanese paragraphs in Arabic/Farsi document In-Reply-To: (message from Philippe Verdy on Sun, 20 Nov 2016 17:58:54 +0100) References: <7be20863-8c28-221f-d240-0cc5e9531352@simon-cozens.org> <83shqm9nkw.fsf@gnu.org> <83h9729kfw.fsf@gnu.org> Message-ID: <83d1hq9ia2.fsf@gnu.org> > From: Philippe Verdy > Date: Sun, 20 Nov 2016 17:58:54 +0100 > Cc: Simon Cozens , > unicode Unicode Discussion > > No, you should see this: > > OWT-CIBARA "Japanese2 ?Japanese1?" ENO-CIBARA > > That's what Emacs shows me. > > That's because EMACS uses some "smart quote" processing It doesn't. It might have bugs in its UBA implementation, but otherwise it just implements the UBA. I wrote it, so I should know. I believe in this case there's no bug, since each quote is between an LTR and an RTL character, so they both take the base paragraph level. FWIW, I see the same behavior in Notepad. From eliz at gnu.org Sun Nov 20 11:48:00 2016 From: eliz at gnu.org (Eli Zaretskii) Date: Sun, 20 Nov 2016 19:48:00 +0200 Subject: Bidi: inserting Japanese paragraphs in Arabic/Farsi document In-Reply-To: <83d1hq9ia2.fsf@gnu.org> (message from Eli Zaretskii on Sun, 20 Nov 2016 19:24:05 +0200) References: <7be20863-8c28-221f-d240-0cc5e9531352@simon-cozens.org> <83shqm9nkw.fsf@gnu.org> <83h9729kfw.fsf@gnu.org> <83d1hq9ia2.fsf@gnu.org> Message-ID: <837f7y9h67.fsf@gnu.org> > Date: Sun, 20 Nov 2016 19:24:05 +0200 > From: Eli Zaretskii > Cc: simon at simon-cozens.org, unicode at unicode.org > > > From: Philippe Verdy > > Date: Sun, 20 Nov 2016 17:58:54 +0100 > > Cc: Simon Cozens , > > unicode Unicode Discussion > > > > No, you should see this: > > > > OWT-CIBARA "Japanese2 ?Japanese1?" ENO-CIBARA > > > > That's what Emacs shows me. > > > > That's because EMACS uses some "smart quote" processing > > It doesn't. It might have bugs in its UBA implementation, but > otherwise it just implements the UBA. I wrote it, so I should know. > > I believe in this case there's no bug, since each quote is between an > LTR and an RTL character, so they both take the base paragraph level. > > FWIW, I see the same behavior in Notepad. I've now double-checked this in the Reference Implementation, and it also exhibits the same behavior I see in Emacs. So I believe there's no bug, and the display should be as shown above. From verdy_p at wanadoo.fr Sun Nov 20 11:50:18 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Sun, 20 Nov 2016 18:50:18 +0100 Subject: Unicode Digest, Vol 35, Issue 16 In-Reply-To: References: Message-ID: 2016-11-20 12:14 GMT+01:00 IS META : > Dear William Overington, > Your abstract emoji are interesting. I am especially pleased that your *noun > brown* emoji express a number of grammatical cases. However, your *Some > designs for emoji of personal pronouns* is less flexible, wherein the > pronouns can only express singular and plural grammatical numbers. Is there > any chance that the system may be modified to enable the expression of dual > grammatical number? Though the dual number is rarer than the > singular?plural distinction, it occurs in many languages, including major > ones like Classical Greek, Sanskrit, and Modern Standard Arabic, and it is > far more widespread in pronominal systems. Perhaps the way American Sign > Language expresses the dual number could provide some inspiration for this. > For such graphical notations, there's absolutely no need to distinguish singular and plural (many Asian languages do not have distinctive grammatical numbers): if the numal quantity is important, it should just be represented directly by its value (e.g. by showing hands with a number of fingers raised), but most probably by using digits directly). On the opposite I think it is much more important to be able to designate the 1st person speaking, and if she speaks for herself or in the noun of a group, the person(s) to she is speaking to (either directly, as as the representant of a group, but this could be a separate "privately" or "alone" attribute), and a generic undesignated/umpersonal 3rd person not designating anyone (he/she/it/they), possiblyt with an additional attribute (a number? an adjective for "near" versus "far", like in the distinction of "this" and "that" or "here" and "there' in English, or "left" vs."right", or "front" vs. "back") to distinguish several entities. But once again this discussion is about a long personal invention by William, that attempts since long to push it as a "standard", when he is actually alone and not qualified alone to be an academic source representing an active community, and whre he never demonstrated the existance of any active community supporting his "inventions" (often self-contradictory and constantly changing) : In other words it is out of scope for the Unicode standard. Emojis are definitely NOT used in the world the way that William thinks. William is in fact inventing since long another script (which has nothing in copmmon with Emojis) but has not been able to conveince a community to use and support it. Borrowing Emojis inside his personnaly invented script does not mean that Emojis are part of William's script. But there's a very active community using Emojis (notably in Japan), and with active support by local providers of communication channels, that developed initially separate incompatible solutions before thinking about standardizing their usage using a common agreed set (because their users wanted interoperability across providers and urged them to use comatible schemes, without loosing their freedom to use Emojis like they want, i.e. without any strong "grammatical" rules) However there's much more promizing scripts to think about, notably SignWriting (but hre also some Emojis could be borrowed, this does not mean that Emojis are full part of SignWriting, just like they are not directly part of Han signograms, or Kanas, or Latin) ! Emojis are and will remain a specific script that will never be able to express a full human language, only some small isolated items whose interpretation will remain very fuzzy, and with an extremely minimalist grammar and an minimalist orthography (the "ligature" clusters documented in Unicode), so that they can be used in various languages having very different grammars or conceptual models: the interpretation of emojis are left to readers in some linguistic, territorial, cultural, or social community, that DON'T want any strong grammar: they really love the freedom of speech and composition offered by Emojis, and certainly don't want such grammar ! So please keep William's proposed (unsupported) script completely out of way of the encoding of Emojis that are and will remain isolate symbols, with minimal interactions among themselves or with other scripts. I also note that Emojis that **should** all have neutral directionality, and should all be mirrorable where approriate (so that they'll be usable in LTR or RTL contexts), unless they explicitly express the "left" vs "right semantics (but they could also express the "start" vs. "end" semantic that MUST be mirorrable, and possibly even "rotatable" in vertical script presentations). -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Sun Nov 20 11:51:01 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Sun, 20 Nov 2016 18:51:01 +0100 Subject: Bidi: inserting Japanese paragraphs in Arabic/Farsi document In-Reply-To: <83h9729kfw.fsf@gnu.org> References: <7be20863-8c28-221f-d240-0cc5e9531352@simon-cozens.org> <83shqm9nkw.fsf@gnu.org> <83h9729kfw.fsf@gnu.org> Message-ID: >> I expect to see >> >> OWT-CIBARA Japanese2 ?Japanese1?"" ENO-CIBARA >> >> aligned to the right margin >No, you should see this: > OWT-CIBARA "Japanese2 ?Japanese1?" ENO-CIBARA Correction: I expect to see: OWT-CIBARA Japanese2" ?Japanese1?" ENO-CIBARA 2016-11-20 17:37 GMT+01:00 Eli Zaretskii : > > From: Philippe Verdy > > Date: Sun, 20 Nov 2016 17:20:49 +0100 > > Cc: Simon Cozens , > > unicode Unicode Discussion > > > > The alignment of the paragraph to the right is optional, it is less > essential. > > It's essential for people who speak those languages. Not seeing the > alignment would cause some brows to be raised (and can also cause > incorrect reading in some marginal cases). > > > That alignment is prefered only when it is a separate paragraph, but if > the Japanese citation is within an Arabic > > paragraph encoded as : > > > > ARABIC-ONE "?Japanese1?Japanese2" ARABIC-TWO > > > > I expect to see > > > > OWT-CIBARA Japanese2 ?Japanese1?"" ENO-CIBARA > > > > aligned to the right margin > > No, you should see this: > > OWT-CIBARA "Japanese2 ?Japanese1?" ENO-CIBARA > > That's what Emacs shows me. > > > There's still the problem of surrounding quation marks that don't form > matching pairs (unlike brackets), that's > > why authors will likely use mirrorable quotation marks, or will need to > surround the Japanese citation and the > > quotations using some isolation using ... or equivalent bidi > isolate controls, or an LTR override > > control for the leading quotation mark to get: > > I don't see any problems with quotes in Emacs, see above. > > > May be some bidi processors may opt for matching quotation mark pairs > such as "..." or ?...? or ?...? or ?...? or > > ?...? or ?...?, but it is well known that this won't work if quotation > marks are not paired or use the same > > mirrorable character for the leasing and trailing quotation marks as > ?...?,. > > They do match here without any problems. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From eliz at gnu.org Sun Nov 20 12:19:04 2016 From: eliz at gnu.org (Eli Zaretskii) Date: Sun, 20 Nov 2016 20:19:04 +0200 Subject: Bidi: inserting Japanese paragraphs in Arabic/Farsi document In-Reply-To: (message from Philippe Verdy on Sun, 20 Nov 2016 18:51:01 +0100) References: <7be20863-8c28-221f-d240-0cc5e9531352@simon-cozens.org> <83shqm9nkw.fsf@gnu.org> <83h9729kfw.fsf@gnu.org> Message-ID: <834m329fqf.fsf@gnu.org> > From: Philippe Verdy > Date: Sun, 20 Nov 2016 18:51:01 +0100 > Cc: Simon Cozens , > unicode Unicode Discussion > > Correction: I expect to see: > > OWT-CIBARA Japanese2" ?Japanese1?" ENO-CIBARA I don't understand why. What do you expect with the brackets removed? I expect this: OWT-CIBARA "Japanese1 Japanese2" ENO-CIBARA because N0 and N1 are no-ops, and N2 clearly says that a neutral character that is surrounded by text of different directionalities takes the embedding direction. From verdy_p at wanadoo.fr Sun Nov 20 13:58:58 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Sun, 20 Nov 2016 20:58:58 +0100 Subject: Bidi: inserting Japanese paragraphs in Arabic/Farsi document In-Reply-To: <834m329fqf.fsf@gnu.org> References: <7be20863-8c28-221f-d240-0cc5e9531352@simon-cozens.org> <83shqm9nkw.fsf@gnu.org> <83h9729kfw.fsf@gnu.org> <834m329fqf.fsf@gnu.org> Message-ID: 2016-11-20 19:19 GMT+01:00 Eli Zaretskii : > > From: Philippe Verdy > > Date: Sun, 20 Nov 2016 18:51:01 +0100 > > Cc: Simon Cozens , > > unicode Unicode Discussion > > > > Correction: I expect to see: > > > > OWT-CIBARA Japanese2" ?Japanese1?" ENO-CIBARA > > I don't understand why. > > What do you expect with the brackets removed? I expect this: > > OWT-CIBARA "Japanese1 Japanese2" ENO-CIBARA > > because N0 and N1 are no-ops, and N2 clearly says that a neutral > character that is surrounded by text of different directionalities > takes the embedding direction. > With ASCII quotes that are hard to match unambiguously in pairs, they would normally inherit what is in their prior context if they cannot be paired. So the first quotation mark would take the RTL direction of ARABIC-ONE. the second quotation mark would also inherit the LTR direction of "Japanese2" and would to its right. The final effect would be that quotes would appear glued side-by-side. But note that the two japanese backets are matching together, so no quotation mark can be between them: the whole bracketed section including brackets should be creating its own isolate: this occurs only with the old Bidi algorithm that did not take bracket pairs into account. So the [Japanese1] bracketed section should be OK with new renderers (this is not the case with Chrome that still uses the old algorithm), just after the ARABIC-ONE and the leading quotation mark of the Japanese section. But probably the correct rendering should rather be: OWT-CIBARA ?Japanese1? Japanese2"" ENO-CIBARA unless ASCII quotation marks are paired, in which case you'll get: OWT-CIBARA "?Japanese1? Japanese2" ENO-CIBARA which is most probably what is expected. All this is about deciding if a quotation mark is "leading" or "trailing", and this is not clear at all for ASCII quotation marks and it has a consequence on the final rendering made by the Bidi algorithm -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Sun Nov 20 14:19:40 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Sun, 20 Nov 2016 21:19:40 +0100 Subject: Bidi: inserting Japanese paragraphs in Arabic/Farsi document In-Reply-To: References: <7be20863-8c28-221f-d240-0cc5e9531352@simon-cozens.org> <83shqm9nkw.fsf@gnu.org> <83h9729kfw.fsf@gnu.org> <834m329fqf.fsf@gnu.org> Message-ID: Note that if you get : OWT-CIBARA "Japanese2 ?Japanese1?" ENO-CIBARA this means that the first quotation mark is "transparent" and preserves the RTL direction. And I don't see then how you can pair the final quotation mark, unless you consider it as "leading" the ARABIC-TWO part (meaning that you don't pair these quotation marks at all: only brackets are paired and the fragment?Japanese1? is correct (you are using the new Bidi algorithm). There's still ambiguities for handling pairs of quotation marks (this is not evident at all and it is language-dependant when some languages do not distinguish the glyph for the leading and trailing marks, or swap them, for example with ?Deutsch? as opposed to ?Italiano? or ? fran?ais?, and it is a difdicult problem in multilingual documents not only mixing RTL and LTR scripts and needing the Bidi algorithm, and different LTR languages are occuring). For citation of Japanese in Arabic text, I sould suggest using Asian quotation marks by encoding: ARABIC-ONE ??Japanese1? Japanese2? ARABIC-TWO so that Asian quotation marks will unambiguously pair together and you'll get: OWT-CIBARA ?Japanese2 ?Japanese1?? ENO-CIBARA Or because ??, like also ??, are unambiguously LTR giving them a strong LTR direction, you'd then get the best: OWT-CIBARA ??Japanese1? Japanese2? ENO-CIBARA But If there are line-wraps in the middle of the Japanese section: ??Japanese1? ENO-CIBARA OWT-CIBARA Japanese2? notably if you can't mirror the CJK quotation marks Otherwise if you can mirror these marks : ?Japanese1?? ENO-CIBARA OWT-CIBARA ?Japanese2 or without any line-break in the middle of the Japanese quotation : OWT-CIBARA ?Japanese2?Japanese1?? ENO-CIBARA (here I use? ? only as aliases for the mirrored??, which are not encoded) 2016-11-20 20:58 GMT+01:00 Philippe Verdy : > > > 2016-11-20 19:19 GMT+01:00 Eli Zaretskii : > >> > From: Philippe Verdy >> > Date: Sun, 20 Nov 2016 18:51:01 +0100 >> > Cc: Simon Cozens , >> > unicode Unicode Discussion >> > >> > Correction: I expect to see: >> > >> > OWT-CIBARA Japanese2" ?Japanese1?" ENO-CIBARA >> >> I don't understand why. >> >> What do you expect with the brackets removed? I expect this: >> >> OWT-CIBARA "Japanese1 Japanese2" ENO-CIBARA >> >> because N0 and N1 are no-ops, and N2 clearly says that a neutral >> character that is surrounded by text of different directionalities >> takes the embedding direction. >> > > With ASCII quotes that are hard to match unambiguously in pairs, they > would normally inherit what is in their prior context if they cannot be > paired. > So the first quotation mark would take the RTL direction of ARABIC-ONE. > the second quotation mark would also inherit the LTR direction of > "Japanese2" and would to its right. > > The final effect would be that quotes would appear glued side-by-side. But > note that the two japanese backets are matching together, so no quotation > mark can be between them: the whole bracketed section including brackets > should be creating its own isolate: this occurs only with the old Bidi > algorithm that did not take bracket pairs into account. > > So the [Japanese1] bracketed section should be OK with new renderers (this > is not the case with Chrome that still uses the old algorithm), just after > the ARABIC-ONE and the leading quotation mark of the Japanese section. > > But probably the correct rendering should rather be: > > OWT-CIBARA ?Japanese1? Japanese2"" ENO-CIBARA > > unless ASCII quotation marks are paired, in which case you'll get: > > OWT-CIBARA "?Japanese1? Japanese2" ENO-CIBARA > > which is most probably what is expected. > > All this is about deciding if a quotation mark is "leading" or "trailing", > and this is not clear at all for ASCII quotation marks and it has a > consequence on the final rendering made by the Bidi algorithm > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From eliz at gnu.org Sun Nov 20 21:39:56 2016 From: eliz at gnu.org (Eli Zaretskii) Date: Mon, 21 Nov 2016 05:39:56 +0200 Subject: Bidi: inserting Japanese paragraphs in Arabic/Farsi document In-Reply-To: (message from Philippe Verdy on Sun, 20 Nov 2016 21:19:40 +0100) References: <7be20863-8c28-221f-d240-0cc5e9531352@simon-cozens.org> <83shqm9nkw.fsf@gnu.org> <83h9729kfw.fsf@gnu.org> <834m329fqf.fsf@gnu.org> Message-ID: <83r3658prn.fsf@gnu.org> > From: Philippe Verdy > Date: Sun, 20 Nov 2016 21:19:40 +0100 > Cc: Simon Cozens , > unicode Unicode Discussion > > Note that if you get : > > OWT-CIBARA "Japanese2 ?Japanese1?" ENO-CIBARA > > this means that the first quotation mark is "transparent" and preserves the RTL direction. Yes. It takes the direction of the paragraph, which is RTL. > And I don't see then how you can pair the final quotation mark, unless you consider it as "leading" the > ARABIC-TWO part (meaning that you don't pair these quotation marks at all: only brackets are paired and the > fragment?Japanese1? is correct (you are using the new Bidi algorithm). The quotes don't need to pair, they just need both to have the paragraph direction. And that's what happens, because text on each side of each quote has different directionality. The UBA mandates that the quote (which is ON) takes the embedding direction in that case, and the embedding direction here is the base paragraph direction. From wjgo_10009 at btinternet.com Mon Nov 21 05:55:54 2016 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Mon, 21 Nov 2016 11:55:54 +0000 (GMT) Subject: Unicode Digest, Vol 35, Issue 16 In-Reply-To: References: Message-ID: <25253037.25408.1479729354126.JavaMail.defaultUser@defaultHost> Thank you for your email and for your comments. > Your abstract emoji are interesting. Thank you. > I am especially pleased that your noun brown emoji express a number of grammatical cases. Thank you. I designed the glyphs with both the Latin case system, and also the way that Esperanto uses a subject, an inflected version of the subject for direct object, and a preposition followed by the same form as used for the subject for all other grammatical cases. in mind. > However, your Some designs for emoji of personal pronouns is less flexible, wherein the pronouns can only express singular and plural grammatical numbers. Is there any chance that the system may be modified to enable the expression of dual grammatical number? Yes. I have added some more designs for personal pronouns. I have added designs for "two" and also designs for "three or more". I have also added some designs so as to give the option of expressing "we" either basically or with specifying one or other of "inclusive we" or "exclusive we". I have also added a design for the form of you that is expressed by the word "tu" of French. At the time of writing this note I have got thirty-one designs all in a document produced using the Serif PagePlus version X7 desktop publishing package. I am hoping to export each of the thirty-one designs as an individual graphic file and add the graphic files to the following web page. http://www.users.globalnet.co.uk/~ngo/abstract_emoji.htm William Overington Monday 21 November 2016 From wjgo_10009 at btinternet.com Mon Nov 21 06:46:18 2016 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Mon, 21 Nov 2016 12:46:18 +0000 (GMT) Subject: Unicode Digest, Vol 35, Issue 16 In-Reply-To: References: Message-ID: <22985217.31526.1479732378425.JavaMail.defaultUser@defaultHost> > On the opposite I think it is much more important to be able to designate the 1st person speaking, and if she speaks for herself or in the noun of a group, the person(s) to she is speaking to (either directly, as as the representant of a group, but this could be a separate "privately" or "alone" attribute), and a generic undesignated/umpersonal 3rd person not designating anyone (he/she/it/they), possiblyt with an additional attribute (a number? an adjective for "near" versus "far", like in the distinction of "this" and "that" or "here" and "there' in English, or "left" vs."right", or "front" vs. "back") to distinguish several entities. Well, yes, there could be a design for an emoji that means ", speaking for myself," and a design for an emoji that means ", speaking on behalf of ..." and they could be useful in some circumstances. Also, there could be abstract emoji for distinguish several entities as you suggest. The way that emoji are becoming a script upon which language is built is fascinating. I wonder if there are any parallels with how picture writing turned into scripts in the past. > But once again this discussion is about a long personal invention by William, that attempts since long to push it as a "standard", when he is actually alone and not qualified alone to be an academic source representing an active community, and whre he never demonstrated the existance of any active community supporting his "inventions" (often self-contradictory and constantly changing) : No. I have been researching on an invention at times since 2009, but this discussion is not about that at all. This discussion is about conveying meaning using a direct display of emoji characters. In some circumstances that conveying of meaning could go through the language barrier. However, the items in this discussion are abstract emoji and are not part of the other project at all. > In other words it is out of scope for the Unicode standard. Well, emoji are part of the Unicode Standard and there can be abstract emoji. Please note item 13 of the following document. http://www.unicode.org/L2/L2016/16356-esc-cmt-feedback.pdf > Emojis are definitely NOT used in the world the way that William thinks. Oh, what do you opine that I think? > William is in fact inventing since long another script (which has nothing in copmmon with Emojis) but has not been able to conveince a community to use and support it. Borrowing Emojis inside his personnaly invented script does not mean that Emojis are part of William's script. Well, although I would not call it a script, I have been researching on an invention at times since 2009, but this discussion is not about that invention at all. In fact, emoji are not used at all in that collection of items due to the lack of precision of meaning of emoji characters. This discussion is about emoji. William Overington Monday 21 November 2016 -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Mon Nov 21 12:27:10 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Mon, 21 Nov 2016 19:27:10 +0100 Subject: Bidi: inserting Japanese paragraphs in Arabic/Farsi document In-Reply-To: <83r3658prn.fsf@gnu.org> References: <7be20863-8c28-221f-d240-0cc5e9531352@simon-cozens.org> <83shqm9nkw.fsf@gnu.org> <83h9729kfw.fsf@gnu.org> <834m329fqf.fsf@gnu.org> <83r3658prn.fsf@gnu.org> Message-ID: 2016-11-21 4:39 GMT+01:00 Eli Zaretskii : > > From: Philippe Verdy > > Date: Sun, 20 Nov 2016 21:19:40 +0100 > > Cc: Simon Cozens , > > unicode Unicode Discussion > > > > Note that if you get : > > > > OWT-CIBARA "Japanese2 ?Japanese1?" ENO-CIBARA > > > > this means that the first quotation mark is "transparent" and preserves > the RTL direction. > > Yes. It takes the direction of the paragraph, which is RTL. > > > And I don't see then how you can pair the final quotation mark, unless > you consider it as "leading" the > > ARABIC-TWO part (meaning that you don't pair these quotation marks at > all: only brackets are paired and the > > fragment?Japanese1? is correct (you are using the new Bidi algorithm). > > The quotes don't need to pair, they just need both to have the > paragraph direction. And that's what happens, because text on each > side of each quote has different directionality. The UBA mandates > that the quote (which is ON) takes the embedding direction in that > case, and the embedding direction here is the base paragraph > direction. > This is a reasonnable rule for most frequent cases, but I'm not sure this works in the case of multiple levels of inclusions (with different directions), where the paragraph direction is not relevant for quotation marks in the inner levels. -------------- next part -------------- An HTML attachment was scrubbed... URL: From asmusf at ix.netcom.com Mon Nov 21 14:23:06 2016 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Mon, 21 Nov 2016 12:23:06 -0800 Subject: Bidi: inserting Japanese paragraphs in Arabic/Farsi document In-Reply-To: <834m329fqf.fsf@gnu.org> References: <7be20863-8c28-221f-d240-0cc5e9531352@simon-cozens.org> <83shqm9nkw.fsf@gnu.org> <83h9729kfw.fsf@gnu.org> <834m329fqf.fsf@gnu.org> Message-ID: Can we get that example with actual code points, for testing? A./ From verdy_p at wanadoo.fr Mon Nov 21 15:17:39 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Mon, 21 Nov 2016 22:17:39 +0100 Subject: Bidi: inserting Japanese paragraphs in Arabic/Farsi document In-Reply-To: References: <7be20863-8c28-221f-d240-0cc5e9531352@simon-cozens.org> <83shqm9nkw.fsf@gnu.org> <83h9729kfw.fsf@gnu.org> <834m329fqf.fsf@gnu.org> Message-ID: Examples were in the initial post I sent in this thread, or in other replies. In encoded order, it should be testing this: ARABIC-ONE "?japanese1?japanese2: ?english1, ? french1 ?, or? japanese3??? " ARABIC-TWO 2016-11-21 21:23 GMT+01:00 Asmus Freytag : > Can we get that example with actual code points, for testing? > > A./ > -------------- next part -------------- An HTML attachment was scrubbed... URL: From asmusf at ix.netcom.com Mon Nov 21 15:40:21 2016 From: asmusf at ix.netcom.com (Asmus Freytag (c)) Date: Mon, 21 Nov 2016 13:40:21 -0800 Subject: Bidi: inserting Japanese paragraphs in Arabic/Farsi document In-Reply-To: References: <7be20863-8c28-221f-d240-0cc5e9531352@simon-cozens.org> <83shqm9nkw.fsf@gnu.org> <83h9729kfw.fsf@gnu.org> <834m329fqf.fsf@gnu.org> Message-ID: On 11/21/2016 1:17 PM, Philippe Verdy wrote: > Examples were in the initial post I sent in this thread, or in other > replies. > > In encoded order, it should be testing this: > > ARABIC-ONE "?japanese1?japanese2: ?english1, ? french1 ?, or? > japanese3???" ARABIC-TWO I don't see any actual Arabic or Japanese letters. A./ > > > 2016-11-21 21:23 GMT+01:00 Asmus Freytag >: > > Can we get that example with actual code points, for testing? > > A./ > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Mon Nov 21 16:02:32 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Mon, 21 Nov 2016 23:02:32 +0100 Subject: Bidi: inserting Japanese paragraphs in Arabic/Farsi document In-Reply-To: References: <7be20863-8c28-221f-d240-0cc5e9531352@simon-cozens.org> <83shqm9nkw.fsf@gnu.org> <83h9729kfw.fsf@gnu.org> <834m329fqf.fsf@gnu.org> Message-ID: You don't need them, I just used lowercase letters for strong LTR characters and uppercase for RTL, just like in the existing Bidi test page. Use some random Arabic or Japanese words if you prefer. 2016-11-21 22:40 GMT+01:00 Asmus Freytag (c) : > On 11/21/2016 1:17 PM, Philippe Verdy wrote: > > Examples were in the initial post I sent in this thread, or in other > replies. > > In encoded order, it should be testing this: > > ARABIC-ONE "?japanese1?japanese2: ?english1, ? french1 ?, or? japanese3??? > " ARABIC-TWO > > I don't see any actual Arabic or Japanese letters. > A./ > > > > 2016-11-21 21:23 GMT+01:00 Asmus Freytag : > >> Can we get that example with actual code points, for testing? >> >> A./ >> > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Mon Nov 21 16:17:15 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Mon, 21 Nov 2016 23:17:15 +0100 Subject: Bidi: inserting Japanese paragraphs in Arabic/Farsi document In-Reply-To: References: <7be20863-8c28-221f-d240-0cc5e9531352@simon-cozens.org> <83shqm9nkw.fsf@gnu.org> <83h9729kfw.fsf@gnu.org> <834m329fqf.fsf@gnu.org> Message-ID: 2016-11-21 23:02 GMT+01:00 Philippe Verdy : > You don't need them, I just used lowercase letters for strong LTR > characters and uppercase for RTL, just like in the existing Bidi test page. > Use some random Arabic or Japanese words if you prefer. > > 2016-11-21 22:40 GMT+01:00 Asmus Freytag (c) : > >> On 11/21/2016 1:17 PM, Philippe Verdy wrote: >> >> Examples were in the initial post I sent in this thread, or in other >> replies. >> >> In encoded order, it should be testing this: >> >> ARABIC-ONE "?japanese1?japanese2: ?english1, ? french1 ?, or? japanese3?? >> ?" ARABIC-TWO >> >> Replacing "japanese" by its translation in Japanese, and translating ARABIC-ONE and TWO into Arabic (Note: japanese3 is been also translated in Arabic): ??????? ????? "????1????2: ?english1, ? french1 ?, or?????????? ??????" ???????-????? The CJK square quote are not mirrored, they are just swapped, but still do not embed their content as pairs... This is an example of where the simple assignement of direction for quotes from the paragraph direction only does not work, and where detecting pairs or quotes would be necessary to fix their enclosure as isolates at inner levels.n -------------- next part -------------- An HTML attachment was scrubbed... URL: From chris.jacobs at xs4all.nl Mon Nov 21 16:38:15 2016 From: chris.jacobs at xs4all.nl (Chris Jacobs) Date: Mon, 21 Nov 2016 23:38:15 +0100 Subject: Bidi: inserting Japanese paragraphs in Arabic/Farsi document In-Reply-To: References: <7be20863-8c28-221f-d240-0cc5e9531352@simon-cozens.org> <83shqm9nkw.fsf@gnu.org> <83h9729kfw.fsf@gnu.org> <834m329fqf.fsf@gnu.org> Message-ID: <40450033fc19abe42a4ed83ff9adc5f7@xs4all.nl> The CJK quotes display here just fine in XS4ALL webmail, but not in Outlook. Chris -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 2016-11-21.png Type: image/png Size: 200905 bytes Desc: not available URL: From asmusf at ix.netcom.com Mon Nov 21 16:58:40 2016 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Mon, 21 Nov 2016 14:58:40 -0800 Subject: Bidi: inserting Japanese paragraphs in Arabic/Farsi document In-Reply-To: References: <7be20863-8c28-221f-d240-0cc5e9531352@simon-cozens.org> <83shqm9nkw.fsf@gnu.org> <83h9729kfw.fsf@gnu.org> <834m329fqf.fsf@gnu.org> Message-ID: <31c61d2a-8911-9503-139b-9497137e2dff@ix.netcom.com> On 11/21/2016 2:17 PM, Philippe Verdy wrote: > > > 2016-11-21 23:02 GMT+01:00 Philippe Verdy >: > > You don't need them, I just used lowercase letters for strong LTR > characters and uppercase for RTL, just like in the existing Bidi > test page. Use some random Arabic or Japanese words if you prefer. > > 2016-11-21 22:40 GMT+01:00 Asmus Freytag (c) >: > > On 11/21/2016 1:17 PM, Philippe Verdy wrote: >> Examples were in the initial post I sent in this thread, or >> in other replies. >> >> In encoded order, it should be testing this: >> >> ARABIC-ONE "?japanese1?japanese2: ?english1, ? french1 ?, or? >> japanese3???" ARABIC-TWO > > Replacing "japanese" by its translation in Japanese, and translating > ARABIC-ONE and TWO into Arabic (Note: japanese3 is been also > translated in Arabic): > > ??????? ????? "????1????2: ?english1, ? french1 ?, > or?????????? ??????" ???????-????? > > > The CJK square quote are not mirrored, they are just swapped, but > still do not embed their content as pairs... > This is an example of where the simple assignement of direction for > quotes from the paragraph direction only does not work, and where > detecting pairs or quotes would be necessary to fix their enclosure as > isolates at inner levels.n I get where is the problem? A./ -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: fhmbobjniphfamjk.png Type: image/png Size: 1707 bytes Desc: not available URL: From asmusf at ix.netcom.com Mon Nov 21 17:00:28 2016 From: asmusf at ix.netcom.com (Asmus Freytag (c)) Date: Mon, 21 Nov 2016 15:00:28 -0800 Subject: Bidi: inserting Japanese paragraphs in Arabic/Farsi document In-Reply-To: References: <7be20863-8c28-221f-d240-0cc5e9531352@simon-cozens.org> <83shqm9nkw.fsf@gnu.org> <83h9729kfw.fsf@gnu.org> <834m329fqf.fsf@gnu.org> Message-ID: <0ec66454-1dce-c926-9c35-fa6d3613e3ce@ix.netcom.com> On 11/21/2016 2:02 PM, Philippe Verdy wrote: > You don't need them, I just used lowercase letters for strong LTR > characters and uppercase for RTL, just like in the existing Bidi test > page. Use some random Arabic or Japanese words if you prefer. The difference is that I can cut/paste an actual string into my mailer/browser/whatever and observe what's happening. I see that you sent me something. I'll try and mail back a screenshot, but the list is so super-restrictive on images that you may only get it cc'd directly to you. A./ > > 2016-11-21 22:40 GMT+01:00 Asmus Freytag (c) >: > > On 11/21/2016 1:17 PM, Philippe Verdy wrote: >> Examples were in the initial post I sent in this thread, or in >> other replies. >> >> In encoded order, it should be testing this: >> >> ARABIC-ONE "?japanese1?japanese2: ?english1, ? french1 ?, or? >> japanese3???" ARABIC-TWO > I don't see any actual Arabic or Japanese letters. > A./ >> >> >> 2016-11-21 21:23 GMT+01:00 Asmus Freytag > >: >> >> Can we get that example with actual code points, for testing? >> >> A./ >> >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Mon Nov 21 19:47:10 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Tue, 22 Nov 2016 02:47:10 +0100 Subject: Bidi: inserting Japanese paragraphs in Arabic/Farsi document In-Reply-To: <31c61d2a-8911-9503-139b-9497137e2dff@ix.netcom.com> References: <7be20863-8c28-221f-d240-0cc5e9531352@simon-cozens.org> <83shqm9nkw.fsf@gnu.org> <83h9729kfw.fsf@gnu.org> <834m329fqf.fsf@gnu.org> <31c61d2a-8911-9503-139b-9497137e2dff@ix.netcom.com> Message-ID: Look at where the Asian quotes are partially "moved" by the ASCII quotes in Chrome. May be the reason is that Chrome still does not use the new rules. You probably use another browser that implement other rules. 2016-11-21 23:58 GMT+01:00 Asmus Freytag : > On 11/21/2016 2:17 PM, Philippe Verdy wrote: > > > > 2016-11-21 23:02 GMT+01:00 Philippe Verdy : > >> You don't need them, I just used lowercase letters for strong LTR >> characters and uppercase for RTL, just like in the existing Bidi test page. >> Use some random Arabic or Japanese words if you prefer. >> >> 2016-11-21 22:40 GMT+01:00 Asmus Freytag (c) : >> >>> On 11/21/2016 1:17 PM, Philippe Verdy wrote: >>> >>> Examples were in the initial post I sent in this thread, or in other >>> replies. >>> >>> In encoded order, it should be testing this: >>> >>> ARABIC-ONE "?japanese1?japanese2: ?english1, ? french1 ?, or? japanese3? >>> ??" ARABIC-TWO >>> >>> Replacing "japanese" by its translation in Japanese, and translating > ARABIC-ONE and TWO into Arabic (Note: japanese3 is been also translated in > Arabic): > > ??????? ????? "????1????2: ?english1, ? french1 ?, or?????????? ??????" > ???????-????? > > > The CJK square quote are not mirrored, they are just swapped, but still do > not embed their content as pairs... > This is an example of where the simple assignement of direction for quotes > from the paragraph direction only does not work, and where detecting pairs > or quotes would be necessary to fix their enclosure as isolates at inner > levels.n > > > I get > > > where is the problem? > A./ > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: fhmbobjniphfamjk.png Type: image/png Size: 1707 bytes Desc: not available URL: From asmusf at ix.netcom.com Tue Nov 22 09:15:08 2016 From: asmusf at ix.netcom.com (Asmus Freytag (c)) Date: Tue, 22 Nov 2016 07:15:08 -0800 Subject: Bidi: inserting Japanese paragraphs in Arabic/Farsi document In-Reply-To: References: <7be20863-8c28-221f-d240-0cc5e9531352@simon-cozens.org> <83shqm9nkw.fsf@gnu.org> <83h9729kfw.fsf@gnu.org> <834m329fqf.fsf@gnu.org> <31c61d2a-8911-9503-139b-9497137e2dff@ix.netcom.com> Message-ID: On 11/21/2016 5:47 PM, Philippe Verdy wrote: > Look at where the Asian quotes are partially "moved" by the ASCII > quotes in Chrome. How does Chrome enter into this? (What I posted is a screenshot from Thunderbird on Windows 7). It seems to fully match up the the example using the UPPER/lower case convention. A./ > > May be the reason is that Chrome still does not use the new rules. > > You probably use another browser that implement other rules. > > 2016-11-21 23:58 GMT+01:00 Asmus Freytag >: > > On 11/21/2016 2:17 PM, Philippe Verdy wrote: >> >> >> 2016-11-21 23:02 GMT+01:00 Philippe Verdy > >: >> >> You don't need them, I just used lowercase letters for strong >> LTR characters and uppercase for RTL, just like in the >> existing Bidi test page. Use some random Arabic or Japanese >> words if you prefer. >> >> 2016-11-21 22:40 GMT+01:00 Asmus Freytag (c) >> >: >> >> On 11/21/2016 1:17 PM, Philippe Verdy wrote: >>> Examples were in the initial post I sent in this thread, >>> or in other replies. >>> >>> In encoded order, it should be testing this: >>> >>> ARABIC-ONE "?japanese1?japanese2: ?english1, ? french1 >>> ?, or? japanese3???" ARABIC-TWO >> >> Replacing "japanese" by its translation in Japanese, and >> translating ARABIC-ONE and TWO into Arabic (Note: japanese3 is >> been also translated in Arabic): >> >> ??????? ????? "????1????2: ?english1, ? french1 ?, >> or?????????? ??????" ???????-????? >> >> >> The CJK square quote are not mirrored, they are just swapped, but >> still do not embed their content as pairs... >> This is an example of where the simple assignement of direction >> for quotes from the paragraph direction only does not work, and >> where detecting pairs or quotes would be necessary to fix their >> enclosure as isolates at inner levels.n > > I get > > > where is the problem? > A./ > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 1707 bytes Desc: not available URL: From tom at osg.samsung.com Tue Nov 22 06:07:16 2016 From: tom at osg.samsung.com (Tom Hacohen) Date: Tue, 22 Nov 2016 12:07:16 +0000 Subject: Potential contradiction between the WordBreak test data and UAX #29 Message-ID: Dear, I recently updated libunibreak[1] according to unicode 9.0.0. I thought I implemented it correctly, however it fails against two of the tests in the reference test data: ? 200D ? 0308 ? 2764 ? # ? [0.2] ZERO WIDTH JOINER (ZWJ_FE) ? [4.0] COMBINING DIAERESIS (Extend_FE) ? [999.0] HEAVY BLACK HEART (Glue_After_Zwj) ? [0.3] and ? 200D ? 0308 ? 1F466 ? # ? [0.2] ZERO WIDTH JOINER (ZWJ_FE) ? [4.0] COMBINING DIAERESIS (Extend_FE) ? [999.0] BOY (EBG) ? [0.3] More specifically, it fails in both after the "combining diaeresis". My implementation marks it as a break, whereas the test data as not. The reference implementation, as expected, agrees with the test data. However, looking at the test case and the UAX[2], this does not look correct. More specifically, because of rule 4: ZWJ Extended GAZ -> ZWJ GAZ And then according to rule 3c, there should be no break opportunity between them. The reference implementation, however, uses rule 999 here, which I believe is incorrect. Am I missing anything, or is this an issue with the reference test data and reference implementation? Thanks, Tom. [1]: https://github.com/adah1972/libunibreak [2]: http://www.unicode.org/reports/tr29/#WB1 From verdy_p at wanadoo.fr Tue Nov 22 20:49:08 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Wed, 23 Nov 2016 03:49:08 +0100 Subject: Potential contradiction between the WordBreak test data and UAX #29 In-Reply-To: References: Message-ID: IMHO, the ZWJ should glue with the last symbol following your examples. But the combining diaeresis following the ZWJ extends it (even if in my opinion it is "defective" and would likely display on a dotted ciurcle in renderers, but not defective for the string definition of combining sequences). So ignore it and test whever the last symbols glues with ZWJ (it should, so there's no break in the reference implementation). WB4: X (Extend | Format | ZWJ)*?X Extend: [ExtendGrapheme_Extend=Yes] This includes: General_Category = Nonspacing_Mark (this includes the combining diaeresis) General_Category = Enclosing_Mark U+200C ZERO WIDTH NON-JOINER plus a few General_Category = Spacing_Mark needed for canonical equivalence. So yes we have: ZWJ "COMBINING DIERESIS" (EBG|Glue_After_Zwj) ? ZWJ (EBG| Glue_After_Zwj) from rule WB4 eliminate the combining mark from the input queue But rule WB3c comes before and prohibits it: WB3c: ZWJ ? (Glue_After_Zwj | EBG) This means that you have first: ZWJ "COMBINING DIERESIS" GAZ ? ZWJ ? "COMBINING DIERESIS" EBG and this does not match the rule WB4 which is not matching for: X ? (Extend | Format | ZWJ)*?X (it cannot remove the extenders if there's a no-break before them, it is valid only when the break oppotunity is still unspecified. As soon as a rule as produced a "break here" or "nobreak here" at a given position, you must advance after this position (the rules are based on a small finite state machine). So after : ZWJ "COMBINING DIERESIS" GAZ ? ZWJ ? "COMBINING DIERESIS" EBG it just remains in your input queue: "COMBINING DIERESIS" EBG (because "ZWJ ?" is already processed, and so ZWJ is elminated) Now comes WB4: X (Extend | Format | ZWJ)* ? X There's no more any "X" to match before the combining diaeresis: your input queue starts by the combining diareasis matching "X", the following character (EBG) does not match within "(Extend | Format | ZWJ)*" (which matches an empty string and does not contain the combining diaresis already matched in "X"), rule WB4 has then no replacement effect and preserves the initial "X" (i.e. the combining diaeresis) . 2016-11-22 13:07 GMT+01:00 Tom Hacohen : > Dear, > > I recently updated libunibreak[1] according to unicode 9.0.0. I thought I > implemented it correctly, however it fails against two of the tests in the > reference test data: > > ? 200D ? 0308 ? 2764 ? # ? [0.2] ZERO WIDTH JOINER (ZWJ_FE) ? [4.0] > COMBINING DIAERESIS (Extend_FE) ? [999.0] HEAVY BLACK HEART > (Glue_After_Zwj) ? [0.3] > > and > > ? 200D ? 0308 ? 1F466 ? # ? [0.2] ZERO WIDTH JOINER (ZWJ_FE) ? [4.0] > COMBINING DIAERESIS (Extend_FE) ? [999.0] BOY (EBG) ? [0.3] > > > More specifically, it fails in both after the "combining diaeresis". My > implementation marks it as a break, whereas the test data as not. The > reference implementation, as expected, agrees with the test data. > > > However, looking at the test case and the UAX[2], this does not look > correct. More specifically, because of rule 4: > ZWJ Extended GAZ -> ZWJ GAZ > And then according to rule 3c, there should be no break opportunity > between them. The reference implementation, however, uses rule 999 here, > which I believe is incorrect. > > > Am I missing anything, or is this an issue with the reference test data > and reference implementation? > > Thanks, > Tom. > > [1]: https://github.com/adah1972/libunibreak > [2]: http://www.unicode.org/reports/tr29/#WB1 > -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Tue Nov 22 20:56:39 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Wed, 23 Nov 2016 03:56:39 +0100 Subject: Potential contradiction between the WordBreak test data and UAX #29 In-Reply-To: References: Message-ID: Note also this statement at the begining of the specification: Single boundaries. Each rule has exactly one boundary position. This restriction is more a limitation on the specification methods, because a rule with multiple boundaries could be expressed instead as multiple rules. For example: * ?a b ? c d ? e f? could be broken into two rules ?a b ? c d e f? and ?a b c d ? e f? * ?a b ? c d ? e f? could be broken into two rules ?a b ? c d e f? and ?a b c d ? e f? The rules are not built to allow keeping and processing multiple boundary positions. Only one is considered: once a break or no-break decision is made on a position, everything that is before that position is discarded from the input and will no longer be used in further rule. The engines loops at the first rule, just from that new boundary position to find matching rules, without ever looking backward. 2016-11-23 3:49 GMT+01:00 Philippe Verdy : > IMHO, the ZWJ should glue with the last symbol following your examples. > But the combining diaeresis following the ZWJ extends it (even if in my > opinion it is "defective" and would likely display on a dotted ciurcle in > renderers, but not defective for the string definition of combining > sequences). > So ignore it and test whever the last symbols glues with ZWJ (it should, > so there's no break in the reference implementation). > > WB4: X (Extend | Format | ZWJ)*?X > > Extend: [ExtendGrapheme_Extend=Yes] This includes: > General_Category = Nonspacing_Mark (this includes the combining > diaeresis) > General_Category = Enclosing_Mark > U+200C ZERO WIDTH NON-JOINER > plus a few General_Category = Spacing_Mark needed for canonical > equivalence. > > So yes we have: ZWJ "COMBINING DIERESIS" (EBG|Glue_After_Zwj) ? ZWJ (EBG| > Glue_After_Zwj) from rule WB4 eliminate the combining mark from the input > queue > > But rule WB3c comes before and prohibits it: > > WB3c: ZWJ ? (Glue_After_Zwj | EBG) > > This means that you have first: > > ZWJ "COMBINING DIERESIS" GAZ ? ZWJ ? "COMBINING DIERESIS" EBG > > and this does not match the rule WB4 which is not matching for: > > X ? (Extend | Format | ZWJ)*?X > > (it cannot remove the extenders if there's a no-break before them, it is > valid only when the break oppotunity is still unspecified. As soon as a > rule as produced a "break here" or "nobreak here" at a given position, you > must advance after this position (the rules are based on a small finite > state machine). So after : > > ZWJ "COMBINING DIERESIS" GAZ ? ZWJ ? "COMBINING DIERESIS" EBG > > it just remains in your input queue: > > "COMBINING DIERESIS" EBG (because "ZWJ ?" is already processed, and so > ZWJ is elminated) > > Now comes WB4: X (Extend | Format | ZWJ)* ? X > > There's no more any "X" to match before the combining diaeresis: your > input queue starts by the combining diareasis matching "X", the following > character (EBG) does not match within "(Extend | Format | ZWJ)*" (which > matches an empty string and does not contain the combining diaresis already > matched in "X"), rule WB4 has then no replacement effect and preserves the > initial "X" (i.e. the combining diaeresis) > > . > > > > > > > 2016-11-22 13:07 GMT+01:00 Tom Hacohen : > >> Dear, >> >> I recently updated libunibreak[1] according to unicode 9.0.0. I thought I >> implemented it correctly, however it fails against two of the tests in the >> reference test data: >> >> ? 200D ? 0308 ? 2764 ? # ? [0.2] ZERO WIDTH JOINER (ZWJ_FE) ? [4.0] >> COMBINING DIAERESIS (Extend_FE) ? [999.0] HEAVY BLACK HEART >> (Glue_After_Zwj) ? [0.3] >> >> and >> >> ? 200D ? 0308 ? 1F466 ? # ? [0.2] ZERO WIDTH JOINER (ZWJ_FE) ? [4.0] >> COMBINING DIAERESIS (Extend_FE) ? [999.0] BOY (EBG) ? [0.3] >> >> >> More specifically, it fails in both after the "combining diaeresis". My >> implementation marks it as a break, whereas the test data as not. The >> reference implementation, as expected, agrees with the test data. >> >> >> However, looking at the test case and the UAX[2], this does not look >> correct. More specifically, because of rule 4: >> ZWJ Extended GAZ -> ZWJ GAZ >> And then according to rule 3c, there should be no break opportunity >> between them. The reference implementation, however, uses rule 999 here, >> which I believe is incorrect. >> >> >> Am I missing anything, or is this an issue with the reference test data >> and reference implementation? >> >> Thanks, >> Tom. >> >> [1]: https://github.com/adah1972/libunibreak >> [2]: http://www.unicode.org/reports/tr29/#WB1 >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Wed Nov 23 03:05:11 2016 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Wed, 23 Nov 2016 09:05:11 +0000 Subject: Line-Breaking Hyphenation Message-ID: <20161123090511.1b691ece@JRWUBU2> What is 'line-breaking hyphenation'? In particular, I am trying to determine the meaning of the TUS statement 'There is no line-breaking hyphenation' referring to the Lanna script at the end of TUS Section 16.7. One possibility is that it means that visible text does not distinguish line breaks within words from line breaks at word boundaries, which would be a statement about the prevalent style. Another possibility is that it means that automatic line-breaking does not split words. I am not sure if 'opportunities for line breaking are lexical' nevertheless allows for the use of hyphenation dictionaries. The statement 'Opportunities for line breaking are lexical, but a line break may not be inserted between a base letter and a combining diacritic' confuses me. Is it saying that a clitic may not be separated from a word if so doing would break a vertical stack? Perhaps it is also saying that there is no line break between words if thay share a vertical stack, as can happen in Pali. Richard. From tom at osg.samsung.com Wed Nov 23 03:13:28 2016 From: tom at osg.samsung.com (Tom Hacohen) Date: Wed, 23 Nov 2016 09:13:28 +0000 Subject: Potential contradiction between the WordBreak test data and UAX #29 In-Reply-To: References: Message-ID: You said: > So ignore it and test whever the last symbols glues with ZWJ (it should, > so there's no break in the reference implementation). Which makes me think you misread the example I quoted. There is a break in the reference implementation, though I argue (like you just did) that there shouldn't be. So I think you agree with me and also think it's broken. Otherwise, I'm not sure I fully understand what you are saying, but if what you are saying is correct, then following the same logic, other rules would fail, specifically: ? 0061 ? 2060 ? 0030 ? # ? [0.2] LATIN SMALL LETTER A (ALetter) ? [4.0] WORD JOINER (Format_FE) ? [9.0] DIGIT ZERO (Numeric) ? [0.3] After the FE here there's no BREAK because: ALetter Format Numeric -> ALetter Numeric Which then following rule 9.0 is a no-break. This is exactly the rule (4) as described in my previous email, just with a different follow-up rule (9 instead of 3c). I don't see how rule precedence would matter here, as there is no case for which two rules apply. -- Tom. On 23/11/16 02:49, Philippe Verdy wrote: > IMHO, the ZWJ should glue with the last symbol following your examples. > But the combining diaeresis following the ZWJ extends it (even if in my > opinion it is "defective" and would likely display on a dotted ciurcle > in renderers, but not defective for the string definition of combining > sequences). > So ignore it and test whever the last symbols glues with ZWJ (it should, > so there's no break in the reference implementation). > > WB4: X (Extend | Format | ZWJ)*?X > > Extend: [ExtendGrapheme_Extend=Yes] This includes: > General_Category = Nonspacing_Mark (this includes the combining diaeresis) > General_Category = Enclosing_Mark > U+200C ZERO WIDTH NON-JOINER > plus a few General_Category = Spacing_Mark needed for canonical > equivalence. > > So yes we have: ZWJ "COMBINING DIERESIS" (EBG|Glue_After_Zwj) ? ZWJ > (EBG|Glue_After_Zwj) from rule WB4 eliminate the combining mark from the > input queue > > But rule WB3c comes before and prohibits it: > > WB3c: ZWJ ? (Glue_After_Zwj | EBG) > > This means that you have first: > > ZWJ "COMBINING DIERESIS" GAZ ? ZWJ ? "COMBINING DIERESIS" EBG > > and this does not match the rule WB4 which is not matching for: > > X ? (Extend | Format | ZWJ)*?X > > (it cannot remove the extenders if there's a no-break before them, it is > valid only when the break oppotunity is still unspecified. As soon as a > rule as produced a "break here" or "nobreak here" at a given position, > you must advance after this position (the rules are based on a small > finite state machine). So after : > > ZWJ "COMBINING DIERESIS" GAZ ? ZWJ ? "COMBINING DIERESIS" EBG > > it just remains in your input queue: > > "COMBINING DIERESIS" EBG (because "ZWJ ?" is already processed, and so > ZWJ is elminated) > > Now comes WB4: X (Extend | Format | ZWJ)* ? X > > There's no more any "X" to match before the combining diaeresis: your > input queue starts by the combining diareasis matching "X", the > following character (EBG) does not match within "(Extend | Format | > ZWJ)*" (which matches an empty string and does not contain the combining > diaresis already matched in "X"), rule WB4 has then no replacement > effect and preserves the initial "X" (i.e. the combining diaeresis) > > . > > > > > > > > 2016-11-22 13:07 GMT+01:00 Tom Hacohen >: > > Dear, > > I recently updated libunibreak[1] according to unicode 9.0.0. I > thought I implemented it correctly, however it fails against two of > the tests in the reference test data: > > ? 200D ? 0308 ? 2764 ? # ? [0.2] ZERO WIDTH JOINER (ZWJ_FE) ? [4.0] > COMBINING DIAERESIS (Extend_FE) ? [999.0] HEAVY BLACK HEART > (Glue_After_Zwj) ? [0.3] > > and > > ? 200D ? 0308 ? 1F466 ? # ? [0.2] ZERO WIDTH JOINER (ZWJ_FE) ? > [4.0] COMBINING DIAERESIS (Extend_FE) ? [999.0] BOY (EBG) ? [0.3] > > > More specifically, it fails in both after the "combining diaeresis". > My implementation marks it as a break, whereas the test data as not. > The reference implementation, as expected, agrees with the test data. > > > However, looking at the test case and the UAX[2], this does not look > correct. More specifically, because of rule 4: > ZWJ Extended GAZ -> ZWJ GAZ > And then according to rule 3c, there should be no break opportunity > between them. The reference implementation, however, uses rule 999 > here, which I believe is incorrect. > > > Am I missing anything, or is this an issue with the reference test > data and reference implementation? > > Thanks, > Tom. > > [1]: https://github.com/adah1972/libunibreak > > [2]: http://www.unicode.org/reports/tr29/#WB1 > > > From daniel.buenzli at erratique.ch Wed Nov 23 04:01:59 2016 From: daniel.buenzli at erratique.ch (=?utf-8?Q?Daniel_B=C3=BCnzli?=) Date: Wed, 23 Nov 2016 11:01:59 +0100 Subject: Potential contradiction between the WordBreak test data and UAX #29 In-Reply-To: References: Message-ID: <34DEC7A2F6EC43DD9766B06D8E558CD7@erratique.ch> On Tuesday 22 November 2016 at 13:07, Tom Hacohen wrote: > However, looking at the test case and the UAX[2], this does not look > correct. More specifically, because of rule 4: > ZWJ Extended GAZ -> ZWJ GAZ > And then according to rule 3c, there should be no break opportunity > between them. I'd say this is not the right operational model. From [1]: "The rules are processed from top to bottom. As soon as a rule matches and produces a boundary status (boundary or no boundary) for that offset, the process is terminated." So in this case between COMBINING DIAERESIS and HEAVY BLACK HEART rule WB4 quicks in. It does not produce a boundary status, it only changes your offset context to ZWJ GAZ, as you mention. Now you continue applying the rules sequentially with WB6 which does not match, with WB7 which does not match,... and you'll get to WB999 which matches and produces a boundary status. After WB4 you do not restart the matching process from the beginning, as you do, leading you to say that WB3c should apply. Best, Daniel [1] http://www.unicode.org/reports/tr29/#Notation From tom at osg.samsung.com Wed Nov 23 04:22:59 2016 From: tom at osg.samsung.com (Tom Hacohen) Date: Wed, 23 Nov 2016 10:22:59 +0000 Subject: Potential contradiction between the WordBreak test data and UAX #29 In-Reply-To: <34DEC7A2F6EC43DD9766B06D8E558CD7@erratique.ch> References: <34DEC7A2F6EC43DD9766B06D8E558CD7@erratique.ch> Message-ID: <11941b77-414c-4831-f02a-179f6582a522@osg.samsung.com> On 23/11/16 10:01, Daniel B?nzli wrote: > On Tuesday 22 November 2016 at 13:07, Tom Hacohen wrote: >> However, looking at the test case and the UAX[2], this does not look >> correct. More specifically, because of rule 4: >> ZWJ Extended GAZ -> ZWJ GAZ >> And then according to rule 3c, there should be no break opportunity >> between them. > > I'd say this is not the right operational model. From [1]: > > "The rules are processed from top to bottom. As soon as a rule matches and produces a boundary status (boundary or no boundary) for that offset, the process is terminated." > > So in this case between COMBINING DIAERESIS and HEAVY BLACK HEART rule WB4 quicks in. It does not produce a boundary status, it only changes your offset context to ZWJ GAZ, as you mention. Now you continue applying the rules sequentially with WB6 which does not match, with WB7 which does not match,... and you'll get to WB999 which matches and produces a boundary status. > > After WB4 you do not restart the matching process from the beginning, as you do, leading you to say that WB3c should apply. Hey Daniel, Thank you for your reply, but I don't think the UAX, specifically the line you quoted implies that. The line you quoted says that the process is terminated when a rule matches and produces a boundary status. In Table 1[1], the right-arrow (which is used in rule 4) is listed as a boundary symbol, so I would argue that one should stop the process and start it again from the start. Furthermore, in the clarification to rule 4[2] it clearly states: "The main purpose of this rule is to always treat a grapheme cluster as a single character?that is, as if it were simply the first character of the cluster". This again sides with my understanding that: X Extendend Y should behave exactly the same as X Y after the extended part. Which is exactly what I'm arguing for. -- Tom [1] http://www.unicode.org/reports/tr29/#Table_Boundary_Symbols [2] http://www.unicode.org/reports/tr29/#Grapheme_Cluster_and_Format_Rules From daniel.buenzli at erratique.ch Wed Nov 23 04:52:56 2016 From: daniel.buenzli at erratique.ch (=?utf-8?Q?Daniel_B=C3=BCnzli?=) Date: Wed, 23 Nov 2016 11:52:56 +0100 Subject: Potential contradiction between the WordBreak test data and UAX #29 In-Reply-To: <11941b77-414c-4831-f02a-179f6582a522@osg.samsung.com> References: <34DEC7A2F6EC43DD9766B06D8E558CD7@erratique.ch> <11941b77-414c-4831-f02a-179f6582a522@osg.samsung.com> Message-ID: <012E41802C7842F386529FBA99969391@erratique.ch> On Wednesday 23 November 2016 at 11:22, Tom Hacohen wrote: > Thank you for your reply, but I don't think the UAX, specifically the > line you quoted implies that. The line you quoted says that the process > is terminated when a rule matches and produces a boundary status. In > Table 1[1], the right-arrow (which is used in rule 4) is listed as a > boundary symbol, Precisely, rules with this *symbol* do not produce a boundary *status* which is either boundary or not boundary as mentioned in parens in the line I quoted. > so I would argue that one should stop the process and start it again from the start. At least in the current UAX there is no mention of an idea of stopping and restarting the process at all. Best, Daniel From tom at osg.samsung.com Wed Nov 23 05:00:53 2016 From: tom at osg.samsung.com (Tom Hacohen) Date: Wed, 23 Nov 2016 11:00:53 +0000 Subject: Potential contradiction between the WordBreak test data and UAX #29 In-Reply-To: <012E41802C7842F386529FBA99969391@erratique.ch> References: <34DEC7A2F6EC43DD9766B06D8E558CD7@erratique.ch> <11941b77-414c-4831-f02a-179f6582a522@osg.samsung.com> <012E41802C7842F386529FBA99969391@erratique.ch> Message-ID: On 23/11/16 10:52, Daniel B?nzli wrote: > On Wednesday 23 November 2016 at 11:22, Tom Hacohen wrote: >> Thank you for your reply, but I don't think the UAX, specifically the >> line you quoted implies that. The line you quoted says that the process >> is terminated when a rule matches and produces a boundary status. In >> Table 1[1], the right-arrow (which is used in rule 4) is listed as a >> boundary symbol, > > Precisely, rules with this *symbol* do not produce a boundary *status* which is either boundary or not boundary as mentioned in parens in the line I quoted. This looks like a mistake statement rather than a binding rule. > >> so I would argue that one should stop the process and start it again from the start. > > At least in the current UAX there is no mention of an idea of stopping and restarting the process at all. Even if that's true, look at my second statement (which you redacted in your reply): Furthermore, in the clarification to rule 4[2] it clearly states: "The main purpose of this rule is to always treat a grapheme cluster as a single character?that is, as if it were simply the first character of the cluster". This again sides with my understanding that: X Extendend Y should behave exactly the same as X Y after the extended part. Which is exactly what I'm arguing for. Also take another look at http://www.unicode.org/reports/tr29/#Grapheme_Cluster_and_Format_Rules specifically the table that shows another way of writing the ignore rule. This again shows my understanding of rule 4 is correct. Specially look at the following equivalence: X Y ? Z W ? X (Extend | Format)* Y (Extend | Format)* ? Z (Extend | Format)* W -- Tom From daniel.buenzli at erratique.ch Wed Nov 23 05:11:55 2016 From: daniel.buenzli at erratique.ch (=?utf-8?Q?Daniel_B=C3=BCnzli?=) Date: Wed, 23 Nov 2016 12:11:55 +0100 Subject: Potential contradiction between the WordBreak test data and UAX #29 In-Reply-To: References: <34DEC7A2F6EC43DD9766B06D8E558CD7@erratique.ch> <11941b77-414c-4831-f02a-179f6582a522@osg.samsung.com> <012E41802C7842F386529FBA99969391@erratique.ch> Message-ID: <1A9F4765A03D446AA6D6DD4B31465072@erratique.ch> On Wednesday 23 November 2016 at 12:00, Tom Hacohen wrote: > This looks like a mistake statement rather than a binding rule. Well at least to me it's pretty clear that this is not the case. > Even if that's true, look at my second statement (which you redacted in > your reply): I'm not arguing whether the boundaries produced by this process is good or not. I'm just saying that to me, the test data is consistent with the operational model and rules of UAX#29 as it exists. Best, Daniel From tom at osg.samsung.com Wed Nov 23 05:14:09 2016 From: tom at osg.samsung.com (Tom Hacohen) Date: Wed, 23 Nov 2016 11:14:09 +0000 Subject: Potential contradiction between the WordBreak test data and UAX #29 In-Reply-To: <1A9F4765A03D446AA6D6DD4B31465072@erratique.ch> References: <34DEC7A2F6EC43DD9766B06D8E558CD7@erratique.ch> <11941b77-414c-4831-f02a-179f6582a522@osg.samsung.com> <012E41802C7842F386529FBA99969391@erratique.ch> <1A9F4765A03D446AA6D6DD4B31465072@erratique.ch> Message-ID: On 23/11/16 11:11, Daniel B?nzli wrote: > > On Wednesday 23 November 2016 at 12:00, Tom Hacohen wrote: >> This looks like a mistake statement rather than a binding rule. > Well at least to me it's pretty clear that this is not the case. > > >> Even if that's true, look at my second statement (which you redacted in >> your reply): > > I'm not arguing whether the boundaries produced by this process is good or not. I'm just saying that to me, the test data is consistent with the operational model and rules of UAX#29 as it exists. I'm arguing it's not, and I still don't agree with your understanding of the operational model, again, take a look at what I wrote in my last email: Also take another look at http://www.unicode.org/reports/tr29/#Grapheme_Cluster_and_Format_Rules specifically the table that shows another way of writing the ignore rule. This again shows my understanding of rule 4 is correct. Specially look at the following equivalence: X Y ? Z W ? X (Extend | Format)* Y (Extend | Format)* ? Z (Extend | Format)* W -- Tom From verdy_p at wanadoo.fr Wed Nov 23 05:14:51 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Wed, 23 Nov 2016 12:14:51 +0100 Subject: Potential contradiction between the WordBreak test data and UAX #29 In-Reply-To: References: Message-ID: You say "theres's no case where two rules apply". I don't think this is right, rules apply in the precedence order as long as they've not produced a decision for generating a "break here" or no break here". This is especially important for rules that generate only a replacement, that are executed in the displayed order. because multiple rules may have their left side member match simultaneously. You have to read them as if this was a: if (condition1) then (replacement1) else if (condition2) then (replacement2) else if (condition3) then (replacement3) ... else if (conditionnN) then (replacementN) The order of conditions (i.e. the order of rules) is significant when several one may be true simultaneously. Then when handling the replacement, of course you restart from the begining. But what happens on the input stream is very different if it contains a "break here" or "no break here" (e.g. rule WB3c), or not (e.g. rule WB4): in the first case, the substitution will not advance the input stream, it just transforms it (it changes the internal parser state only), in the second case, the state is transformed but all elements in the put stream before the "break here" or no-break here" are discarded from the input stream, leaving only those on the left part of the "break here"/"nobreak here". The input state is a FIFO stack where each element contains: { a text buffer (or equivalently an index pointing to the relative end position in the input stream buffer) cumulating all characters (or bytes) from the input to which the WB class was assigned; a WB class (a small integer) to which this input string was mapped } and the input strema buffer. The automata processes each rule in the listed order: to see if a rule match it just uses the seond element (the WB class) of elements starting from those of the bottom of the stack. If there's not enough elements in trhe FIFO stack to match a rule completely (in "hungry" mode if that matching rule contains "*" or "+") it will read additional bytes or a character from the input stream, to append to the top of the input buffer until it can assign it a WB class, and that element will just contain that character and that WB class that will be pushed to the top of the FIFO stack. When a tested rule matches one or more elements starting from the bottom of the FIFO, * the replacement will transform only these elements in the FIFO: all characters in their internal text buffers are combined if needed if the replacement reduces the number of WB class items, otherwise the WB class is just replaced in the relevant element of the FIFO stack, but characters are kept unchanged. * Then if the replacement in that matched rule contains a "break here'" or "no-break" item, all characters in the bottom of the FIFO up to that position are output: they are popped out from the FIFO, but other items in the FIFO are kept. An automata can optimize this FIFO so that the set of rules (equivalent to an ordered set of regexps) becomes a finite state automata. But as the set of regexp is ordered, it is possible that from some input some common prefix in multiple regexps will match simulteneously: their order is significant. This is more complex than in the initial specification of word breakers where there was no "hungry" regexps and matching occured only on pairs of characters, so that you did not need a FIFO (or the FIFO always contained a single element, never more, and the text buffer in that element was reduced to just one character or their encoded bytes): in that case there was still a significant order or rules, so that only if multiple ones were potentialy matching the input pair, their order in the specification determined their precedence (in that case it was possibly to summarize the ordered set of rules with a simple 2D lookup table). But if you look at rule WB4: X (Extend | Format | ZWJ)*?X (which is "hungry" and not bound in length, and which does not pop out any characters from the input FIFO but still cumulate them in the input state until it no longer matches longer inputs with "X (Extend | Format | ZWJ)*), the simple 2D lookup table array approach does no work: it will match partial input at the same time as other concurrent rules, but concurrent rules must be ignored if their precedence is lower (because their rule number is higher). So the automata cannot be a finite-state automata whose state is represented only by a single integer in a small bound set (the set of WB class values). Note also that the input stream is complemented with additional pseudo-characters "sot" and "eot" surrounding it: the automata will be initialized by pushing a {"", sot} element in the FIFO and when the end of strem is reached, it will push a {"", eot} element to the FIFO. This is needed for rules WB1 and WB2 (that have the highest precedence in the set of regexps to match). The last rule "WB999: Any ? Any" is not "hungry" but is equivalent to a match-all pairs regexp "..", and because it is the last rule, it has the lowest precedence: it will always match simultaneously with other rules matching pairs, but will be ignored unless none of the previous rules match. Not all rules are matching pairs (or longer sequences), notably not rules WB3a, WB3b that match isolated newlines, but all other rules are matching at least a pair of character, this means that rules WB3a and WB3b are in fact those that have the highest precedence. These rules not matching pairs are: WB3a: (Newline | CR | LF)? WB3b: ?(Newline | CR | LF) They are in compact form but are equivalent to the expanded form showing their replacement: WB3a: (Newline | CR | LF)? ? sot Effectively this is the only rule that matches a single character, all other rules are matching pairs. Rule WB999 will match "sot eot" and will discard "sot" from the FIFO, leaving "eot" alone. ("Any eot" is matched in rule WB2). There's an additional final (implicit) rule needed to match "eot" alone: it will terminate the automata. So all other rules are considering at least one pair and WB999 will match all of them. 2016-11-23 10:13 GMT+01:00 Tom Hacohen : > You said: > > So ignore it and test whever the last symbols glues with ZWJ (it should, > > so there's no break in the reference implementation). > > Which makes me think you misread the example I quoted. There is a break in > the reference implementation, though I argue (like you just did) that there > shouldn't be. So I think you agree with me and also think it's broken. > > Otherwise, I'm not sure I fully understand what you are saying, but if > what you are saying is correct, then following the same logic, other rules > would fail, specifically: > > ? 0061 ? 2060 ? 0030 ? # ? [0.2] LATIN SMALL LETTER A (ALetter) ? [4.0] > WORD JOINER (Format_FE) ? [9.0] DIGIT ZERO (Numeric) ? [0.3] > > After the FE here there's no BREAK because: > ALetter Format Numeric -> ALetter Numeric > Which then following rule 9.0 is a no-break. > > This is exactly the rule (4) as described in my previous email, just with > a different follow-up rule (9 instead of 3c). I don't see how rule > precedence would matter here, as there is no case for which two rules apply. > > -- > Tom. > > > On 23/11/16 02:49, Philippe Verdy wrote: > >> IMHO, the ZWJ should glue with the last symbol following your examples. >> But the combining diaeresis following the ZWJ extends it (even if in my >> opinion it is "defective" and would likely display on a dotted ciurcle >> in renderers, but not defective for the string definition of combining >> sequences). >> So ignore it and test whever the last symbols glues with ZWJ (it should, >> so there's no break in the reference implementation). >> >> WB4: X (Extend | Format | ZWJ)*?X >> >> Extend: [ExtendGrapheme_Extend=Yes] This includes: >> General_Category = Nonspacing_Mark (this includes the combining >> diaeresis) >> General_Category = Enclosing_Mark >> U+200C ZERO WIDTH NON-JOINER >> plus a few General_Category = Spacing_Mark needed for canonical >> equivalence. >> >> So yes we have: ZWJ "COMBINING DIERESIS" (EBG|Glue_After_Zwj) ? ZWJ >> (EBG|Glue_After_Zwj) from rule WB4 eliminate the combining mark from the >> input queue >> >> But rule WB3c comes before and prohibits it: >> >> WB3c: ZWJ ? (Glue_After_Zwj | EBG) >> >> This means that you have first: >> >> ZWJ "COMBINING DIERESIS" GAZ ? ZWJ ? "COMBINING DIERESIS" EBG >> >> and this does not match the rule WB4 which is not matching for: >> >> X ? (Extend | Format | ZWJ)*?X >> >> (it cannot remove the extenders if there's a no-break before them, it is >> valid only when the break oppotunity is still unspecified. As soon as a >> rule as produced a "break here" or "nobreak here" at a given position, >> you must advance after this position (the rules are based on a small >> finite state machine). So after : >> >> ZWJ "COMBINING DIERESIS" GAZ ? ZWJ ? "COMBINING DIERESIS" EBG >> >> it just remains in your input queue: >> >> "COMBINING DIERESIS" EBG (because "ZWJ ?" is already processed, and so >> ZWJ is elminated) >> >> Now comes WB4: X (Extend | Format | ZWJ)* ? X >> >> There's no more any "X" to match before the combining diaeresis: your >> input queue starts by the combining diareasis matching "X", the >> following character (EBG) does not match within "(Extend | Format | >> ZWJ)*" (which matches an empty string and does not contain the combining >> diaresis already matched in "X"), rule WB4 has then no replacement >> effect and preserves the initial "X" (i.e. the combining diaeresis) >> >> . >> >> >> >> >> >> >> >> 2016-11-22 13:07 GMT+01:00 Tom Hacohen > >: >> >> >> Dear, >> >> I recently updated libunibreak[1] according to unicode 9.0.0. I >> thought I implemented it correctly, however it fails against two of >> the tests in the reference test data: >> >> ? 200D ? 0308 ? 2764 ? # ? [0.2] ZERO WIDTH JOINER (ZWJ_FE) ? [4.0] >> COMBINING DIAERESIS (Extend_FE) ? [999.0] HEAVY BLACK HEART >> (Glue_After_Zwj) ? [0.3] >> >> and >> >> ? 200D ? 0308 ? 1F466 ? # ? [0.2] ZERO WIDTH JOINER (ZWJ_FE) ? >> [4.0] COMBINING DIAERESIS (Extend_FE) ? [999.0] BOY (EBG) ? [0.3] >> >> >> More specifically, it fails in both after the "combining diaeresis". >> My implementation marks it as a break, whereas the test data as not. >> The reference implementation, as expected, agrees with the test data. >> >> >> However, looking at the test case and the UAX[2], this does not look >> correct. More specifically, because of rule 4: >> ZWJ Extended GAZ -> ZWJ GAZ >> And then according to rule 3c, there should be no break opportunity >> between them. The reference implementation, however, uses rule 999 >> here, which I believe is incorrect. >> >> >> Am I missing anything, or is this an issue with the reference test >> data and reference implementation? >> >> Thanks, >> Tom. >> >> [1]: https://github.com/adah1972/libunibreak >> >> [2]: http://www.unicode.org/reports/tr29/#WB1 >> >> >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Wed Nov 23 05:20:44 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Wed, 23 Nov 2016 12:20:44 +0100 Subject: Potential contradiction between the WordBreak test data and UAX #29 In-Reply-To: References: <34DEC7A2F6EC43DD9766B06D8E558CD7@erratique.ch> <11941b77-414c-4831-f02a-179f6582a522@osg.samsung.com> <012E41802C7842F386529FBA99969391@erratique.ch> Message-ID: 2016-11-23 12:00 GMT+01:00 Tom Hacohen : > > Also take another look at http://www.unicode.org/reports > /tr29/#Grapheme_Cluster_and_Format_Rules specifically the table that > shows another way of writing the ignore rule. This again shows my > understanding of rule 4 is correct. > > Specially look at the following equivalence: > X Y ? Z W ? X (Extend | Format)* Y (Extend | Format)* ? Z > (Extend | Format)* W > This expansion does not occur before rule WB4; it cannot be used to transform rules WB1 to WB3c; this is explicitly stated in the algorithm. And because the rule WB3c handles your case, you are misinterpreting the specs as if it was applying there too... -------------- next part -------------- An HTML attachment was scrubbed... URL: From tom at osg.samsung.com Wed Nov 23 05:28:41 2016 From: tom at osg.samsung.com (Tom Hacohen) Date: Wed, 23 Nov 2016 11:28:41 +0000 Subject: Potential contradiction between the WordBreak test data and UAX #29 In-Reply-To: References: <34DEC7A2F6EC43DD9766B06D8E558CD7@erratique.ch> <11941b77-414c-4831-f02a-179f6582a522@osg.samsung.com> <012E41802C7842F386529FBA99969391@erratique.ch> Message-ID: <941085bf-5c67-e4d4-9263-2d897fd8915b@osg.samsung.com> On 23/11/16 11:20, Philippe Verdy wrote: > 2016-11-23 12:00 GMT+01:00 Tom Hacohen >: > > > Also take another look at > http://www.unicode.org/reports/tr29/#Grapheme_Cluster_and_Format_Rules > > specifically the table that shows another way of writing the ignore > rule. This again shows my understanding of rule 4 is correct. > > Specially look at the following equivalence: > X Y ? Z W ? X (Extend | Format)* Y (Extend | Format)* ? > Z (Extend | Format)* W > > > This expansion does not occur before rule WB4; it cannot be used to > transform rules WB1 to WB3c; this is explicitly stated in the algorithm. > And because the rule WB3c handles your case, you are misinterpreting the > specs as if it was applying there too... > I took a look at the ICU sources, and they explicitly mention this case, so it seems I was mistaken with interpreting the intention of the UAX. I still find it confusing, but based on this thread, it seems to just be me. Sorry for the noise. The comment from the ICU source code: # Rule 3c ZWJ x (Extended_Pict | EmojiNRK). Precedes WB4, so no intervening Extend chars allowed. Thanks for your help, Tom From daniel.buenzli at erratique.ch Wed Nov 23 05:45:04 2016 From: daniel.buenzli at erratique.ch (=?utf-8?Q?Daniel_B=C3=BCnzli?=) Date: Wed, 23 Nov 2016 12:45:04 +0100 Subject: Potential contradiction between the WordBreak test data and UAX #29 In-Reply-To: <941085bf-5c67-e4d4-9263-2d897fd8915b@osg.samsung.com> References: <34DEC7A2F6EC43DD9766B06D8E558CD7@erratique.ch> <11941b77-414c-4831-f02a-179f6582a522@osg.samsung.com> <012E41802C7842F386529FBA99969391@erratique.ch> <941085bf-5c67-e4d4-9263-2d897fd8915b@osg.samsung.com> Message-ID: On Wednesday 23 November 2016 at 12:28, Tom Hacohen wrote: > I took a look at the ICU sources, and they explicitly mention this case, > so it seems I was mistaken with interpreting the intention of the UAX. I > still find it confusing, but based on this thread, it seems to just be me. It's not only you, I also sometimes get confused by it (see for example [1] and subsequent messages). Maybe the operational model could be clarified a bit. I also think it would be better if the UAX29 didn't use ignore rules at all, so that going from rules to implementation is more straightforward --- though I understand it may make the spec harder to maintain. Best, Daniel [1] http://www.unicode.org/mail-arch/unicode-ml/y2016-m06/0088.html From tom at osg.samsung.com Wed Nov 23 06:04:30 2016 From: tom at osg.samsung.com (Tom Hacohen) Date: Wed, 23 Nov 2016 12:04:30 +0000 Subject: Potential contradiction between the WordBreak test data and UAX #29 In-Reply-To: References: <34DEC7A2F6EC43DD9766B06D8E558CD7@erratique.ch> <11941b77-414c-4831-f02a-179f6582a522@osg.samsung.com> <012E41802C7842F386529FBA99969391@erratique.ch> <941085bf-5c67-e4d4-9263-2d897fd8915b@osg.samsung.com> Message-ID: On 23/11/16 11:45, Daniel B?nzli wrote: > On Wednesday 23 November 2016 at 12:28, Tom Hacohen wrote: >> I took a look at the ICU sources, and they explicitly mention this case, >> so it seems I was mistaken with interpreting the intention of the UAX. I >> still find it confusing, but based on this thread, it seems to just be me. > > It's not only you, I also sometimes get confused by it (see for example [1] and subsequent messages). Maybe the operational model could be clarified a bit. The comment I quoted from the ICU sources clarifies the intention. Maybe a comment similar to one would be helpful? Also, thinking about it a bit more, the operational order makes sense when you consider the CR LF case and extended characters, however it is still not obvious from the wording. Thanks again. -- Tom. From everson at evertype.com Wed Nov 23 07:13:02 2016 From: everson at evertype.com (Michael Everson) Date: Wed, 23 Nov 2016 13:13:02 +0000 Subject: Line-Breaking Hyphenation In-Reply-To: <20161123090511.1b691ece@JRWUBU2> References: <20161123090511.1b691ece@JRWUBU2> Message-ID: On 23 Nov 2016, at 09:05, Richard Wordingham wrote: > > What is 'line-breaking hyphenation'? In particular, I am trying to determine the meaning of the TUS statement 'There is no line-breaking hyphenation' referring to the Lanna script at the end of TUS Section 16.7. ?inserting a visible hyphen at a line boundary? Michael Everson From jameskasskrv at gmail.com Wed Nov 23 09:15:55 2016 From: jameskasskrv at gmail.com (James Kass) Date: Wed, 23 Nov 2016 07:15:55 -0800 Subject: Manatee emoji? Message-ID: http://patch.com/florida/southtampa/petition-drive-aims-raise-manatee-awareness-adorable-way If enough people sign the petition, will Unicode add a manatee emoji? And, how about wolverines and lemmings? Are any petitions underway for them? How many signatures on a petition would be needed before Unicode would consider adding a non-existent character to the repertoire? Best regards, James Kass From Shawn.Steele at microsoft.com Wed Nov 23 10:38:56 2016 From: Shawn.Steele at microsoft.com (Shawn Steele) Date: Wed, 23 Nov 2016 16:38:56 +0000 Subject: Manatee emoji? In-Reply-To: References: Message-ID: I'm not sure I've ever heard of a "save the lemmings" campaign. Considering how much effort Florida puts into protecting Manatees and their occurrence on signs, I'm actually sort of surprised there isn't already a Manatee emoji. Had emoji been "invented" in Florida there certainly would've been one already! -Shawn -----Original Message----- From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of James Kass Sent: Wednesday, November 23, 2016 7:16 AM To: Unicode Public Subject: Manatee emoji? http://patch.com/florida/southtampa/petition-drive-aims-raise-manatee-awareness-adorable-way If enough people sign the petition, will Unicode add a manatee emoji? And, how about wolverines and lemmings? Are any petitions underway for them? How many signatures on a petition would be needed before Unicode would consider adding a non-existent character to the repertoire? Best regards, James Kass From kenwhistler at att.net Wed Nov 23 10:39:49 2016 From: kenwhistler at att.net (Ken Whistler) Date: Wed, 23 Nov 2016 08:39:49 -0800 Subject: Manatee emoji? In-Reply-To: References: Message-ID: <4f5a2ef4-f815-fc02-3a39-73e51ac20f8d@att.net> James, On 11/23/2016 7:15 AM, James Kass wrote: > How many signatures on a petition would be needed before > Unicode would consider adding a non-existent character to the > repertoire? I would say somewhat more than zero (which could hardly be considered a petition) and less than 7,466,363,069 (current estimate of the world population). BTW, from the selection factors page: http://www.unicode.org/emoji/selection.html#Selection_Factors_Requested "Petitions are only considered as possible indications of potential frequency of usage, among the other selection factors." BTW, U+1F984 UNICORN FACE was a "non-existent character" for a non-existent animal before it made the selection review cut and was actually encoded as a new emoji. That doesn't mean, a priori, that it was a bad choice to encode. Nor did the existence or non-existence of a petition to encode this particular non-existent animal as an emoji character make much difference, anyway. --Ken From Shawn.Steele at microsoft.com Wed Nov 23 10:59:57 2016 From: Shawn.Steele at microsoft.com (Shawn Steele) Date: Wed, 23 Nov 2016 16:59:57 +0000 Subject: Manatee emoji? In-Reply-To: <4f5a2ef4-f815-fc02-3a39-73e51ac20f8d@att.net> References: <4f5a2ef4-f815-fc02-3a39-73e51ac20f8d@att.net> Message-ID: Well, I'd suggest "more than one" as the lower limit since change.org counts the original person as #1 and Unicode'd probably want at least one other person to agree with them ;-) If I knew how to draw a Manatee glyph, I'd propose it for them ;0) However preemptively proposing this emoji wouldn't help address their concern of "raising awareness." Their change.org petition is probably doing at least as much to raise awareness as encoding an emoji without any hubbub would be. To help raise the most awareness, Unicode should probably deny it a few times so that they can raise awareness even more. (I'm joking about the last in case that wasn't obvious). But, more seriously, it's a fair point and we shouldn't use their Manatee proposal to try to preemptively encode emoji for other similar scenarios. Let them petition for each one. *I* personally would find a Manatee emoji more useful than many of the other ones that are already encoded. That said, I've never missed having it in the repertoire (until now). Encoding glyphs for all fauna (& flora) obviously can't happen though. I wonder where the line is? Waiting for petitions seems like a reasonable gating factor, at least until that proves problematic somehow. -Shawn -----Original Message----- From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Ken Whistler Sent: Wednesday, November 23, 2016 8:40 AM To: James Kass Cc: unicode at unicode.org Subject: Re: Manatee emoji? James, On 11/23/2016 7:15 AM, James Kass wrote: > How many signatures on a petition would be needed before Unicode would > consider adding a non-existent character to the repertoire? I would say somewhat more than zero (which could hardly be considered a petition) and less than 7,466,363,069 (current estimate of the world population). BTW, from the selection factors page: http://www.unicode.org/emoji/selection.html#Selection_Factors_Requested "Petitions are only considered as possible indications of potential frequency of usage, among the other selection factors." BTW, U+1F984 UNICORN FACE was a "non-existent character" for a non-existent animal before it made the selection review cut and was actually encoded as a new emoji. That doesn't mean, a priori, that it was a bad choice to encode. Nor did the existence or non-existence of a petition to encode this particular non-existent animal as an emoji character make much difference, anyway. --Ken From andrewcwest at gmail.com Wed Nov 23 12:47:51 2016 From: andrewcwest at gmail.com (Andrew West) Date: Wed, 23 Nov 2016 18:47:51 +0000 Subject: Manatee emoji? In-Reply-To: <4f5a2ef4-f815-fc02-3a39-73e51ac20f8d@att.net> References: <4f5a2ef4-f815-fc02-3a39-73e51ac20f8d@att.net> Message-ID: On 23 November 2016 at 16:39, Ken Whistler wrote: > On 11/23/2016 7:15 AM, James Kass wrote: >> >> How many signatures on a petition would be needed before >> Unicode would consider adding a non-existent character to the >> repertoire? > > I would say somewhat more than zero (which could hardly be considered a > petition) and less than 7,466,363,069 (current estimate of the world > population). Well, based on http://www.unicode.org/L2/L2016/16295r-animal-emoji.pdf I would say between 4,737 and 6,941. Andrew From leoboiko at namakajiri.net Wed Nov 23 13:10:03 2016 From: leoboiko at namakajiri.net (Leonardo Boiko) Date: Wed, 23 Nov 2016 17:10:03 -0200 Subject: Manatee emoji? In-Reply-To: References: Message-ID: I support the creation of manatee emoji, but only if it?s accompanied by a new modifier for emoji size, coming in the varieties: TINY, SMALL, LARGE, HUGE. This would allow us to say "oh, the [HUGE MANATEE]" in emoji. 2016-11-23 13:15 GMT-02:00 James Kass : > http://patch.com/florida/southtampa/petition-drive-aims-raise-manatee-awareness-adorable-way > > If enough people sign the petition, will Unicode add a manatee emoji? > And, how about wolverines and lemmings? Are any petitions underway > for them? How many signatures on a petition would be needed before > Unicode would consider adding a non-existent character to the > repertoire? > > Best regards, > > James Kass From doug at ewellic.org Wed Nov 23 13:44:58 2016 From: doug at ewellic.org (Doug Ewell) Date: Wed, 23 Nov 2016 12:44:58 -0700 Subject: Manatee =?UTF-8?Q?emoji=3F?= Message-ID: <20161123124458.665a7a7059d7ee80bb4d670165c8327d.7ac8a1b9e0.wbe@email03.godaddy.com> Leonardo Boiko wrote: > I support the creation of manatee emoji, but only if it?s accompanied > by a new modifier for emoji size, coming in the varieties: TINY, > SMALL, LARGE, HUGE. > > This would allow us to say "oh, the [HUGE MANATEE]" in emoji. Leonardo immediately wins the award for best sort-of-Unicode-related pun ever. Just retire the trophy now. But I am expecting a full array of modifiers and ZWJ sequences, to meet the user need for a female factory-worker manatee with dark skin and red hair, or families of manatees with arbitrary combinations of attributes. -- Doug Ewell | Thornton, CO, US | ewellic.org From christoph.paeper at crissov.de Wed Nov 23 15:30:08 2016 From: christoph.paeper at crissov.de (=?utf-8?Q?Christoph_P=C3=A4per?=) Date: Wed, 23 Nov 2016 22:30:08 +0100 Subject: Manatee emoji? In-Reply-To: References: Message-ID: James Kass : > > And, how about [other emoji]? Are any petitions underway for them? For what it?s worth, several weeks ago (before UTC149), I collected all emoji petitions I could find online (and that were in languages I can at least somewhat decipher). I?m excluding anything moot added in or before Unicode 9.0 and Emoji 4.0, but am including current candidate emoji in the list below (Markdown format). In some cases, I think, it?s at least as valuable to see how many people are proposing some emoji character independently than how many co-sign a single public petition. Emoticons, Actions, People, Body and Clothing/Fashion Emojis ============================================================ - [Itching](http://www.ipetitions.com/petition/demand-an-itching-emoji) ?? Emoticon Faces ---------------- - [Grimacing face with smiling eyes](https://www.change.org/p/apple-change-the-grinning-emoji-back-to-how-it-was-in-ios-9) - [Face wearing makeup](https://www.change.org/p/kik-team-help-kik-com-http-www-kik-com-contact-kik-interactive-inc-this-drawing-needs-to-be-a-real-kik-emoji) Puke Emoji, Vomit Emoji, Barf Emoji Disgust Emoji, Sick Emoji ------------------------------------------------------------- [FACE WITH OPEN MOUTH VOMITING](http://unicode.org/emoji/charts-beta/emoji-candidates.html#1f92e) - [Face vomiting](https://www.change.org/p/you-vomit-emoji) - [Face vomiting](https://www.change.org/p/facebook-add-puke-or-ill-to-available-reactions-for-posts) - [Face vomiting](https://www.change.org/p/mark-zuckerberg-add-the-barf-disgust-reaction-to-facebook) - [Face vomiting](https://www.change.org/p/apple-an-emoji-symbolizing-someone-throwing-up) - [Face vomiting](https://www.openpetition.de/petition/online/wir-moechten-ein-kotzendes-smiley-bei-whatsapp) ?? Professions, Roles, Costumes, Clich?s Emoji ---------------------------------------------- - [Emo](https://www.change.org/p/emoji-makers-emo-emoji) - [Ninja](https://www.change.org/p/apple-create-a-ninja-emoji) - [Pet lover](https://www.change.org/p/mark-davis-queremos-el-petloveremoji-we-want-a-petloveremoji) - [Alien laughing](https://www.change.org/p/facebook-facebook-needs-a-alien-laughing-emoji) - [Bachelor, Bachelorette etc.](https://www.change.org/p/apple-lets-make-these-hen-and-stag-emojis-happen) - [Rabbi](http://www.thepetitionsite.com/148/741/454/make-a-rabbi-emoji/) - [Viking](http://www.ipetitions.com/petition/apple-needs-to-add-a-viking-emoji) - [Stoner](http://www.ipetitions.com/petition/stoner-emojis) [MAGE](http://unicode.org/emoji/charts-beta/emoji-candidates.html#1f9d9) - [Wizard](https://www.change.org/p/the-get-a-wizard-emoji-for-skype) ### Fandom Emoji - [Fanboy and Fangirl](https://www.change.org/p/google-fangirl-fanboy-emoji) - [Fangirl](http://www.ipetitions.com/petition/fangirl-emoji) - [Fandom](http://www.ipetitions.com/petition/fandom-emoji-added-to-apples-emojis) ?? Hair and Skin Colors, Ethnicity ---------------------------------- - [Curly hair](https://www.change.org/p/unicode-consortium-there-should-be-curly-people-emojis) ### Redhead or Ginger and Freckles Emoji - [Red hair](https://www.change.org/p/apple-redheads-should-have-emoji-too) - [Red hair](https://www.change.org/p/apple-redhead-emoji-ae5c74fe-1429-4e72-a835-2508d189132c) - [Red hair](https://www.change.org/p/apple-redhead-emoji-684224ac-e260-4f2d-92ff-8653323d5675) - [Red hair](https://www.change.org/p/apple-a-red-head-emoji) - [Red hair](https://www.change.org/p/apple-a-ginger-hair-emoji-girl) - [Red hair](https://www.change.org/p/apple-make-apple-create-an-emoji-of-a-person-with-ginger-hair) - [Red hair](https://www.change.org/p/apple-make-red-haired-emoji-s-happen) - [Red hair](https://www.change.org/p/apple-redhead-representation-f8628a12-1d26-475c-bb8d-20ee166854c1) - [Red hair](https://www.change.org/p/apple-fighting-for-red-headed-emojis) - [Red hair](https://www.change.org/p/apple-justice-for-gingers-with-a-ginger-emoji) - [Red hair](https://www.change.org/p/us-redhead-emojis) - [Red hair](http://www.thepetitionsite.com/258/258/593/apple-needs-a-ginger-emoji/) - [Red hair](http://www.gopetition.com/petitions/redhead-emoji-needed.html) - [Red hair](http://www.ipetitions.com/petition/red-haired-emoji) - [Red hair](http://www.ipetitions.com/petition/redheads-should-have-emojitoo) - [Red hair](http://www.ipetitions.com/petition/things-the-world-needs) - [Red hair](http://www.petitions24.com/we_want_redhead_emojis) ?? Body Part Emojis ------------------- - [Leg](https://www.change.org/p/apple-to-make-a-leg-emoji) - [Kidney](https://www.change.org/p/apple-samsung-android-kidney-emoji) - [Vagina](http://www.ipetitions.com/petition/add-a-vagina-emoji) (seems to have been deleted from site) [BRAIN](http://unicode.org/emoji/charts-beta/emoji-candidates.html#1f9e0) - [Brain](https://www.change.org/p/whatsapp-volem-l-emoticona-d-un-cervell-a-whatsapp) Beard and Mustache Emoji ------------------------ [BEARDED PERSON](http://unicode.org/emoji/charts-beta/emoji-candidates.html#1f9d4) - [Beard](https://www.change.org/p/unicode-please-release-a-beardemoji - [Beard](https://www.change.org/p/the-unicode-consortium-apple-create-a-beard-emoji) - [not mustache](https://www.change.org/p/apple-new-beard-emoji) - [Beards](https://www.change.org/p/unicode-apple-blackberry-google-microsoft-emojis-need-beards-too) - [Bearded person and bald person](https://www.change.org/p/whatsapp-queremos-un-emoji-de-whatsapp-calvo-y-barbudo - [Beard](http://www.beardemoji.com - [Bearded person](http://www.ipetitions.com/petition/we-want-a-bearded-emoji Health and Illness Emoji ------------------------ - [Tumor](http://www.ipetitions.com/petition/facebook-should-add-a-tumor-emoji-to-emojis) ?? Headwear and Hats Emoji -------------------------- - [Person wearing sombrero](https://www.change.org/p/google-inc-mexican-wearing-sombrero-emoji) ### Fedora Emoji - [Fedora](https://www.change.org/p/my-friend-justin-approve-this-fedora-emoji) - [Fedora](https://www.change.org/p/president-of-the-united-states-obtain-a-fedora-emoji-for-all-mobile-devices) ### Headscarf or Hijab Emoji [PERSON WITH HEADSCARF](http://unicode.org/emoji/charts-beta/emoji-candidates.html#1f9d5) - [Hijab](https://www.change.org/p/apple-add-a-hijab-emoji-for-muslims) - [Hijab](https://www.change.org/p/unicode-consortium-mark-davis-rachel-martin-i-want-the-hijab-emoji-i-want-diversity) - [Hijab](http://www.ipetitions.com/petition/hijabi-emojis) ?? Footwear and Shoe Emoji -------------------------- - [Ballet shoes](https://www.change.org/p/whats-app-create-a-ballerina-emoji-shoe-emojis-are-either-red-high-heels-sex-heavy-boots-masculine-or-frumpy-sandals) ### Socks and Stockings [SOCKS](http://unicode.org/emoji/charts-beta/emoji-candidates.html#1f9e6) - [Sock](https://www.change.org/p/apple-or-android-or-whoever-create-a-sock-emoji) - [Sock](https://www.change.org/p/national-rifle-association-add-a-sock-emoji-to-iphone) Waistwear and Belts ------------------- - [Championship belt](https://www.change.org/p/apple-a-championship-belt-emoji-should-be-a-standard-emoji) Clothing Emoji -------------- - [Cardigan](https://www.change.org/p/tim-cook-apple-needs-a-party-cardigan-emoji) Gestures and Poses Emoji ======================== - [Left-handed](https://www.change.org/p/apple-apple-make-a-left-handed-emoji) ? ?? U+1F58E (not an emoji yet) ?? Greeting or Salute Emoji --------------------------- - [Tip of the hat](https://www.change.org/p/apple-add-a-tip-of-the-hat-emoji) ?? Two Fingers Emoji ------------------- ### Finger Gun Emoji Hand with Thumb and Index Finger Extended, Pointing Sidewards - [Finger gun](https://www.change.org/p/all-of-those-who-support-awkward-finger-guns-as-answers-to-all-questions-there-needs-to-be-a-finger-guns-emoji) - [Finger gun](https://www.change.org/p/skype-make-a-finger-guns-emoji-on-skype) - [Finger gun](http://www.ipetitions.com/petition/we-need-a-finger-guns-emoji) - [Finger gun](http://www.petitions24.com/apple_give_us_a_finger_gun_emoji) ### Shaka Emoji Hand with Thumb and Pinky Finger Extended, Pointing Sidewards - [Shaka](https://www.change.org/p/the-unicode-consortium-let-shaka-be-in-emoji) - [Shaka](https://www.change.org/p/apple-add-the-shaka-emoji) - [Shaka](https://www.change.org/p/apple-apple-to-put-out-a-thumb-and-pinky-emoji) - [Shaka](https://www.change.org/p/all-the-shakka-people-steph-saves-shakkas) - [Shaka](https://www.change.org/p/apple-shakas-emoji-for-ios) - [Shaka](http://www.ipetitions.com/petition/shaka-brah-emoji) ?? Three Fingers Emoji ---------------------- - [Scout sign: Index, Middle and Ring Fingers Extended](https://www.change.org/p/whatsapp-queremos-un-emoji-para-la-se?a-scout-i-we-want-an-emoji-of-the-scout-signal - [Shocker: Index, Middle and Pinky Fingers Extended](https://www.change.org/p/apple-make-the-shocker-hand-an-emoji) [I LOVE YOU HAND SIGN](http://unicode.org/emoji/charts-beta/emoji-candidates.html#1f91f) - [ASL ILY: Thumb, Index and Pinky Fingers extended](https://www.change.org/p/unicode-consortium-we-want-the-i-love-you-asl-handshape-emoji) Phan, Ladders Gesture Emoji --------------------------- ? - [Ladders hand](https://www.change.org/p/apple-phan-ladders-emoji) - [Ladders hand](https://www.change.org/p/apple-make-ladders-pinof-7-an-emoji) ?? Poses Emoji -------------- - [Person with Hand under Chin](https://www.change.org/p/emoji-people-made-a-hand-under-chin-pose-emoji) ### Dab Emoji Dab is probably more appropriately filed under Fad, though. - [Dab](https://www.change.org/p/apple-dab-emote-for-whatsapp) - [Dab](https://www.change.org/p/google-make-a-dab-emoji-46f1e448-7cfd-4359-bcad-78630bfbf55f) - [Dab](https://www.change.org/p/emoji-a-dab-emoji-must-be-added-to-the-emoji-keyboard) - [Dab](https://www.change.org/p/mark-zuckerberg-pour-la-cr?ation-d-un-emoji-qui-dab) - [Dab](https://www.change.org/p/whatsapp-whatsapp-incluya-el-dab-smiley) - [Dab](http://www.ipetitions.com/petition/make-a-dab-emoji) ?? Food Emojis ============== - [Dip](https://www.change.org/p/apple-inc-make-an-onion-dip-emoji-2) - [Brunch](https://www.change.org/p/lovers-join-the-campaign-to-make-a-brunching-emoji-happen) - [Cheese curd](https://www.change.org/p/apple-people-for-an-iphone-cheese-curd-emoji) - [Soup](https://www.change.org/p/apple-for-apple-co-to-make-a-soup-emoji) ? ?? U+1F372 / ?? U+1F35C - [Jam](https://www.change.org/p/the-creator-of-emojis-the-creation-of-the-jar-of-jam-emoji) - [Corndog](https://www.change.org/p/apple-inc-we-need-corn-dog-emojis) - [Lasagna](https://www.change.org/p/whatsapp-mark-zuckerberg-queremos-emoji-de-lasanha-no-whatsapp) - [Sausage](http://www.ipetitions.com/petition/sausage-emoji) ? ?? U+1F32D - [Dolma: grape/wine leaves](http://www.ipetitions.com/petition/add-a-dolma-emoji-to-ios) ?? Chicken Nuggets or Wings Emoji --------------------------------- - [Chicken Nugget](https://www.change.org/p/to-provide-society-with-a-long-awaited-chicken-nugget-emoji) - [Chicken Nugget](https://www.change.org/p/emoji-create-a-chicken-nugget-emoji) - [Chicken Nugget](https://www.change.org/p/apple-apple-please-produce-a-chicken-nugget-emoji) - [Chicken Nugget](https://www.change.org/p/apple-chicken-dildos-as-emojis-now) - [Chicken Nugget](https://www.change.org/p/me-apple-to-make-a-chicken-nugget-emoji) - [Chicken Nugget](http://www.ipetitions.com/petition/petition-for-a-chicken-nugget-emoji) Waffle Emoji ------------ - [Waffle](https://www.change.org/p/apple-add-a-waffle-emoji) - [Waffle](https://www.change.org/p/waffle-emoji-help-us-create-a-waffle-emoji-b3a4aef9-9508-4562-954b-98135c404320) ?? Pastries Emoji ----------------- - [Dough](https://www.change.org/p/steve-jobs-apple-dough-quality) - [Chocolate cake](https://www.change.org/p/apple-add-a-chocolate-cake-emoji) [PIE](http://unicode.org/emoji/charts-beta/emoji-candidates.html#1f967) - [Pie](https://www.change.org/p/unicode-consortium-create-a-pie-emoji-seriously) ### Muffin and Cupcake Emoji - [Cupcake](https://www.change.org/p/unicode-consortium-we-need-a-cupcake-emoji-stat) - [Cupcake](https://www.change.org/p/apple-cupcake-emoji) - [Muffin](http://www.ipetitions.com/petition/we-want-a-muffin-emoji) - [Muffin](http://www.ipetitions.com/petition/muffin-emoji-2) - [Muffin](http://www.ipetitions.com/petition/muffin-emoji-3) - [Muffin](http://www.ipetitions.com/petition/fabio-needs-a-muffin-emoji) ?? Bread Emoji -------------- ### Bagel Emoji - [Bagel](http://www.ipetitions.com/petition/we-need-a-bagel-emoji) - [Bacon Bagel](http://www.ipetitions.com/petition/bakel-emoji-needed) ? U+1F953 Bacon ?? ### Garlic Bread Emoji - [Garlic Bread](https://www.change.org/p/apple-garlic-bread-emoji) - [Garlic Bread](https://www.change.org/p/garlic-bread-fanatics-garlic-bread-emoji) - [Garlic Bread](https://www.change.org/p/google-make-garlic-bread-an-emoji-on-all-platforms) ### Sandwich or Sub Emoji [SANDWICH](http://unicode.org/emoji/charts-beta/emoji-candidates.html#1f96a) - [Smore](https://www.change.org/p/apple-create-a-s-more-emoji) - [Meatball sub](https://www.change.org/p/apple-meatball-sub-emoji) - [Croque Monsieur](https://www.change.org/p/emoji-pour-un-emoji-croque-monsieur) Pretzel Emoji ------------- [PRETZEL](http://unicode.org/emoji/charts-beta/emoji-candidates.html#1f968) - [Pretzel](https://www.change.org/p/unicode-consortium-pretzel-emoji-the-perfect-twist) Dumpling Emoji and similar Stuffed Pasta ---------------------------------------- [DUMPLING](http://unicode.org/emoji/charts-beta/emoji-candidates.html#1f95f) - [Samosa](https://www.change.org/p/apple-samosa-emoji) - [Dumpling](https://www.change.org/p/unicode-consortium-we-need-a-dumpling-emoji) - [Ravioli](https://www.change.org/p/everyone-a-ravioli-emoji) - [Pizza roll](https://www.change.org/p/dani-add-a-pizza-roll-emoji) - [Dumpling](https://www.change.org/p/the-peopo-have-apple-add-a-dumpring-emoji) - [Empanada](https://www.change.org/p/apple-a?adan-el-emoji-de-empanada) ?? Edible Fruit, Plants and Seeds --------------------------------- - [Zucchini](https://www.change.org/p/apple-create-the-courgette-emoji) - [Papaya](https://www.change.org/p/apple-make-a-papaya-emoji) - [Guava](https://www.change.org/p/people-who-work-for-the-emoji-company-there-should-be-a-guava-emoji) - [Grapefruit](http://www.ipetitions.com/petition/grapefruit-emoji-4-messenger) - [Macadamia nut (and peanut)](http://www.thepetitionsite.com/204/444/591/demand-peanut-and-macadamia-nut-emoji-now/) ? U+1F95C Peanuts ?? [COCONUT](http://unicode.org/emoji/charts-beta/emoji-candidates.html#1f965) - [Coconut](http://www.ipetitions.com/petition/coconut-emoji) ### Broccoli Emoji [BROCCOLI](http://unicode.org/emoji/charts-beta/emoji-candidates.html#1f966) - [Broccoli](https://www.change.org/p/apple-give-broccoli-the-emoji-it-deserves) - [Broccoli](http://www.ipetitions.com/petition/officially-add-broccoli-as-an-emoji) ### Mango Emoji - [Mango](https://www.change.org/p/apple-create-a-mango-emoji) - [Mango](https://www.change.org/p/jamal-geeemawl-create-a-mango-emoji) ### (Baked) Bean Emoji - [Bean](https://www.change.org/p/apple-bean-emoji-bae32ca4-3545-494d-a9f5-284d876286a9) - [Bean](https://www.change.org/p/apple-give-us-a-bean-emoji) - [Bean](https://www.change.org/p/apple-get-apple-to-create-a-bean-emoji) - [Baked beans](http://www.ipetitions.com/petition/petition-for-baked-bean-emoji) ### Garlic or Onion Emoji - [Garlic](https://www.change.org/p/unicode-add-a-garlic-emoji-to-the-emoji-library-to-help-us-better-express-our-culinary-lives) ### Blueberry - [Blueberry](https://www.change.org/p/donald-trump-blueberry-emoji) - [Blueberry](http://www.ipetitions.com/petition/blueberry-emoji) ?? Beverages and Drinks ----------------------- - [Mate](https://www.change.org/p/unicode-unicode-consortium-mate-chimarr?o-emoji) - [White wine](https://www.change.org/p/unicode-create-a-white-wine-emoji) ? U+1F377 Wine Glass ??, U+1F347 Grapes ?? - [Porr?](https://www.change.org/p/jan-koum-volem-que-el-porr?-sigui-una-emoticona-de-whatsapp-porr?emoji) (Catalan wine glass) - [Gin Tonic](https://www.change.org/p/jan-koum-queremos-el-emoji-de-gin-tonic-en-whatsapp-we-want-gin-tonic-emoji-in-whatsapp) - [Raki (shot) glass](https://www.change.org/p/rak?-barda??-emojisi-istiyoruz-unicode) Seasoning Emoji --------------- - [Pepper and Carrot](http://www.ipetitions.com/petition/no-pepper-or-carrot-emoji) ? U+1F955 Carrot ?? - [Salt shaker, Mustard and Ketchup](https://www.change.org/p/you-petition-to-apple-to-add-salt-shaker-mustard-and-ketchup-emoji) ### Salt (Shaker) Emoji - [Salt shaker](https://www.change.org/p/apple-i-want-to-bring-a-salt-emoji-to-the-emoji-keyboard-on-apple-devices) - [Salt shaker](https://www.change.org/p/apple-help-us-get-apple-to-give-us-a-salt-shaker-emoji) - [Salt](http://www.ipetitions.com/petition/make-a-salt-emoji) Cooking Emoji ------------- - [Kettle](http://www.ipetitions.com/petition/introduce-a-kettle-emoji-on-apple-phones) Sports, Hobbies and Activities ============================== ?? Sport Emoji -------------- - [Hula hoop](https://www.change.org/p/apple-make-a-hula-hooper-emoji) - [Marching band](https://www.change.org/p/apple-recognition-for-marching-band-emoji) - [Skateboard](https://www.change.org/p/apple-unicode-emoji-a-skateboard-skateboarder-emoji-should-be-included-amongst-the-many-other-sports) - [Roller skates](https://www.change.org/p/unicode-consortium-unicode-consortium-give-us-roller-skates) - [Australian Football](http://www.ipetitions.com/petition/bring-the-afl-football-emoji-to-life) ### Lacrosse Emoji - [Lacrosse](https://www.change.org/p/apple-lacrosse-emoji-845dd849-b7b2-4615-add0-c965a4021923) - [Lacrosse stick](https://www.change.org/p/apple-apple-needs-to-add-a-lacrosse-stick) - [Lacrosse](http://www.ipetitions.com/petition/lacrosse-emoji) ### Frisbee Emoji - [Frisbee](https://www.change.org/p/apple-apple-add-a-frisbee-emoji) - [Frisbee](https://www.change.org/p/apple-please-make-a-frisbee-emoji) ### Softball ? U+26BE Baseball ?? - [Softball](https://www.change.org/p/apple-softball-needs-an-emoji-before-2020) - [Softball](http://www.ipetitions.com/petition/softball-emoji-like-now) ### Gym Emoji - [Ergometer](https://www.change.org/p/make-apple-have-an-erg-emoji) ?? Dance, Song and Music Emoji ----------------------------- - [Vinyl record, LP](https://www.change.org/p/unicode-create-a-vinyl-emoji-for-music-lovers) - [Bellydancer](https://www.change.org/p/snapchat-snapchat-bellydancer-emoji) - [Ballet](https://www.change.org/p/whats-app-create-a-ballerina-emoji-shoe-emojis-are-either-red-high-heels-sex-heavy-boots-masculine-or-frumpy-sandals) ?? Activity ----------- ### Breastfeeding Emoji [BREAST-FEEDING](http://unicode.org/emoji/charts-beta/emoji-candidates.html#1f931) - [Breastfeeding](https://www.change.org/p/for-more-breastfeeding-on-the-world-we-wish-an-emoji-a-mum-breastfeeding-her-baby-por-un-emotic?n-pro-lactancia-materna) - [Breastfeeding](https://www.change.org/p/apple-where-is-the-breastfeeding-emoji) ?? Machines, Tools and Objects ============================== - [Passport](https://www.change.org/p/steve-dowling-vice-president-of-communications-apple-inc-create-a-passport-emoji) - [Noose](https://www.change.org/p/mark-zuckerberg-noose-emoji-on-facebook) - [Typewriter](https://www.change.org/p/apple-typewriter-emoji-for-the-ios) - [Spork](https://www.change.org/p/apple-we-as-a-union-ad-people-need-a-spork-emoji-now) - [Treasure chest](http://www.thepetitionsite.com/741/585/286/we-want-a-treasure-chest-emoji/) - [Bucket](http://www.ipetitions.com/petition/bucket-emoji) ?? Gavel vs. Hammer Emoji ------------------------- - [Gavel](https://www.change.org/p/apple-bring-back-the-gavel-emoji) - [Gavel](https://www.change.org/p/apple-bring-back-the-gavel-emoji-3e238579-4d95-44ed-9e7b-74453ac2f56e) ?? Crafts Emoji --------------- - [Sewing machine](https://www.change.org/p/http-www-emojifoundation-com-sewing-machine-emoji-emoji-machine-?-coudre) - [Sewing](https://www.change.org/p/apple-create-craft-emoji-scissors-only-sew-unfair) ?? Musical Instrument Emoji --------------------------- - [Euphonium](https://www.change.org/p/apple-samsung-make-a-euphonium-emoji) ### Flute Emoji - [Flute](http://www.ipetitions.com/petition/flute-emoji) - [Flute](http://www.petitions24.com/the_flute_emoji) ? Weapons Emoji ---------------- - [Lightsaber](https://www.change.org/p/the-unicode-consortium-facebook-apple-google-inc-google-htc-lightsaber-emojis-we-would-love-them-let-s-make-it-happen) ?? Vehicle Emoji ---------------- - [Caravan](https://www.change.org/p/apple-make-a-caravan-emoji) - [Tank](https://www.change.org/p/tim-cook-necessitem-l-emojitanc-necesitamos-el-emojitanc-we-need-the-emojitanc-69e2bf66-62e6-4b17-a2a6-f7d83d4539bc) ?? Furniture Emoji ------------------ - [Magic Carpet](https://www.change.org/p/apple-help-motivate-apple-to-design-a-magic-carpet-emoji-think-of-the-possibilities) - [Stool](https://www.change.org/p/android-pour-un-emoji-tabouret) - [Pillow](https://www.change.org/p/a-internet-el-emoji-de-almohada) Animal Emoji ============ - [Elephant with tusks](https://www.change.org/p/the-perfect-world-foundation-give-the-emoji-elephant-back-its-tusks) - [Chameleon](http://www.ipetitions.com/petition/I-need-a-chameleon-emoji) ? ?? U+1F98E Lizard ?? Cat Emoji ------------ - [More cats](https://www.change.org/p/apple-more-cat-emoji-s) - [Black cat](https://www.emojirequest.com/r/BlackCatEmoji) ?? Insects and Bugs Emoji ------------------------- [CRICKET](http://unicode.org/emoji/charts-beta/emoji-candidates.html#1f997) - [Crickets](https://www.change.org/p/facebook-new-crickets-emoji) ?? Birds Emoji -------------- - [Seagull](https://www.change.org/p/apple-get-a-seagull-emoji) - [Crying vulture](https://www.change.org/p/emoji-companies-there-should-be-a-crying-vulture-emoji) - [Parakeet](https://www.change.org/p/computer-people-make-a-parakeet-emoji-for-a-girl-i-m-interested-in) ### Flamingo Emoji - [Flamingo](https://www.change.org/p/apple-flamingo-emoji) - [Flamingo](https://www.change.org/p/apple-add-a-flamingo-emoji) - [Flamingo](https://www.sophiawebster.com/flamingo-emoji-petition) - [Flamingo](https://www.emojirequest.com/r/FlamingoEmoji) ### Ostrich Emoji - [Ostrich](https://www.change.org/p/instagram-create-an-ostrich-emoji) ### Swan Emoji - [Swan and Goose](https://www.change.org/p/michelle-obama-i-wany-a-goose-emoji) - [Swan](https://www.change.org/p/unicode-consortium-unicode-consortium-p-o-box-391476-mountain-view-ca-94039-1476-u-s-a-whattsapp-hinzuf?gen-des-schwan-emojis-including-the-swan-emoji) Dog Breeds Emoji ---------------- ### Pug - [Pug](https://www.change.org/p/everyone-make-a-sloth-and-pug-emoji) - [Pug](http://www.ipetitions.com/petition/pugs-and-emojis) ### Shiba - [Shibs](https://www.change.org/p/facebook-shibs-4-messenger) - [Shiba and Husky](https://www.change.org/p/mark-zuckerberg-emojis-shiba-et-huksy-sur-facebook) Ferret and Weasel Emoji ----------------------- - [Ferret](https://www.change.org/p/shigetaka-kurita-add-ferret-emoji) - [Ferret](https://www.change.org/p/make-a-ferret-emoji-happen) - [Ferret](https://www.change.org/p/apple-ferret-emoji-please-unicode) ?? Lobster Emoji ---------------- ? U+1F980 - [Lobster](https://www.change.org/p/unicode-a-lobster-emoji) - [Lobster](https://www.change.org/p/we-have-a-crab-emoji-now-it-s-time-for-a-lobster-emoji) Giraffe Emoji ------------- [GIRAFFE FACE](http://unicode.org/emoji/charts-beta/emoji-candidates.html#1f992) - [Giraffe](https://www.change.org/p/apple-microsoft-facebook-giraffe-emoji) - [Giraffe](https://www.change.org/p/apple-apple-to-create-a-giraffe-emoji) - [Giraffe](https://www.change.org/p/tim-cook-a-giraffe-emoji) - [Giraffe](https://www.change.org/p/apple-make-a-giraffe-emoji-a5a0e90f-45c5-44dc-aa52-674a884dda45) - [Giraffe](https://www.change.org/p/apple-apple-make-a-giraffe-emoji) - [Giraffe](https://www.change.org/p/apple-petition-to-have-a-giraffe-emoji) - [Giraffe](https://www.change.org/p/apple-get-apple-to-make-a-giraffe-emoji) - [Giraffe](https://www.change.org/p/unicode-consortium-there-should-be-a-giraffe-emoji) - [Giraffe](https://www.change.org/p/emoji-there-needs-to-be-a-giraffe-emoji-who-s-with-me) - [Giraffe](https://www.change.org/p/whatsapp-inc-para-que-whatsapp-inlcuya-un-emoji-de-jirafa-en-la-secci?n-de-animales) - [Giraffe](https://www.change.org/p/apple-we-want-giraffe-emoji-s) - [Giraffe](http://www.thepetitionsite.com/701/400/453/make-a-giraffe-emoji-apple/) - [Giraffe](http://www.ipetitions.com/petition/giraffe-emoji-2) - [Giraffe](http://www.ipetitions.com/petition/giraffe-emoji-3) - [Giraffe](http://www.ipetitions.com/petition/giraffe-emoji-4) - [Giraffe](http://www.ipetitions.com/petition/help-create-the-giraffe-emoji) - [Giraffe](http://www.ipetitions.com/petition/we-need-a-giraffe-emoji) - [Giraffe Face](https://www.emojirequest.com/r/GiraffeFaceEmoji) Hedgehog Emoji -------------- [HEDGEHOG](http://unicode.org/emoji/charts-beta/emoji-candidates.html#1f994) - [Hedgehog](http://www.ipetitions.com/petition/hedgehog-emoji) - [Hedgehog](https://www.emojirequest.com/r/HedgehogFaceEmoji) Zebra Emoji ----------- [ZEBRA FACE](http://unicode.org/emoji/charts-beta/emoji-candidates.html#1f993) - [ZEBRA FACE](https://www.emojirequest.com/r/ZebraFaceEmoji) ?? Dinosaurs Emoji ------------------ [SAUROPOD](http://unicode.org/emoji/charts-beta/emoji-candidates.html#1f995) [T-REX](http://unicode.org/emoji/charts-beta/emoji-candidates.html#1f996) - [Dinosaur](https://www.change.org/p/apple-dinosaur-emoji) - [Dinosaur](https://www.change.org/p/apple-have-a-dinosaur-emoji) - [Dinosaur](https://www.change.org/p/apple-let-s-get-a-dinosaur-emoji) - [Dinosaur](https://www.change.org/p/apple-help-us-get-a-velociraptor-emoji) - [Dinosaur](https://www.change.org/p/apple-apple-where-tf-is-the-dinosaur-emoji-tho) - [Dinosaur](https://www.change.org/p/apple-let-s-get-a-dinosaur-emoji-on-apple-devices) - [Dinosaur](https://www.change.org/p/apple-get-apple-to-give-us-a-dinosaur-emoji-06440e0b-d456-464c-a1e7-4d48e44d9f19) - [Dinosaur](http://www.ipetitions.com/petition/dino-emoji) - [Dinosaur](https://www.emojirequest.com/r/DinosaurEmoji) Llama or Alpaka Emoji --------------------- - [Llama](https://www.change.org/p/apple-llama-emoji-335d92b3-baf3-4447-99a9-a24ce3853db0) - [Llama](https://www.change.org/p/apple-we-demand-apple-to-make-a-llama-emoji) - [Alpaca](https://www.change.org/p/unicode-wir-fordern-einen-alpaka-emoji) - [Llama](https://www.change.org/p/whatsapp-emoji-de-una-llama) - [Llama](http://www.ipetitions.com/petition/llama-emoji) - [Llama](http://www.ipetitions.com/petition/llama-emoji-2) - [Alpaca](http://www.ipetitions.com/petition/alpaca-emoji-for-whatsapp) Otter Emoji ----------- - [Otter](https://www.change.org/p/apple-unicode-consortium-we-need-an-otter-emoji) - [Otter](https://www.change.org/p/apple-let-s-get-otters-their-own-emoji) - [Otter](https://www.change.org/p/whatsapp-bring-an-otter-to-whatsapp-emojis) Manatee or Walrus Emoji ----------------------- - [Manatee](https://www.change.org/p/apple-com-add-a-manatee-emoji) - [Manatee](https://www.change.org/p/anyone-donald-trump-manatee-emoji) ?? Whale and Dolphin Emoji -------------------------- ? ?? U+1F40B, ?? U+1F433, ?? U+1F42C - [Orca](https://www.change.org/p/apple-make-a-killer-whale-emoji-in-apple-s-emoji-board) ### Narwhal Emoji - [Narwhal](https://www.change.org/p/unicode-consortium-a-petition-to-create-a-narwhal-emoji-proposal-to-unicode-consortrium) - [Narwhal](https://www.change.org/p/michellebyang111-gmail-com-can-we-add-a-narwhal-emoji-in-the-new-ios) Hippopotamus Emoji ------------------ - [Hippo](http://www.ipetitions.com/petition/HippoEmoji) Kangaroo Emoji -------------- - [Kangaroo](https://www.change.org/p/apple-add-a-kangaroo-emoji-to-the-iphone) - [Kangaroo](https://www.change.org/p/where-s-our-kangaroo-emoji) ?? Polar Bear Emoji ------------------- - [Polar bear](https://www.change.org/p/apple-inc-make-a-polar-bear-emoji) - [Polar bear](http://www.ipetitions.com/petition/polar-bear-emoji) ?? Squirrel or Rodent Emoji --------------------------- ? U+1F43F - [Squirrel](https://www.change.org/p/unicode-give-us-the-squirrel-emoji-now) Opossum Emoji ------------- - [Opossum](https://www.change.org/p/apple-possum-emoticon) Raccoon Emoji ------------- - [Raccoon](https://www.change.org/p/apple-have-apple-make-a-raccoon-emoji) - [Raccoon](https://www.change.org/p/unicode-consortium-enough-is-enough-raccoons-need-equal-representation-in-the-emoji-keyboard-now) - [Raccoon](http://www.ipetitions.com/petition/racoon-emojis-to-make-a-difference) - [Raccoon](http://www.ipetitions.com/petition/raccoon-emojis) - [Raccoon](http://www.ipetitions.com/petition/raccoon-emojis-for-freedom) - [Raccoon](https://www.gopetition.com/petitions/raccoon-emoji.html) (Honey) Badger Emoji -------------------- - [(Hufflepuff) Badger](https://www.change.org/p/facebook-we-want-a-badger-emoji-for-hufflepuff-facebook-group-chats-and-we-want-it-now) - [Honey Badger](http://www.gopetition.com/petitions/petition-to-make-honey-badger-emoji-on-fb-messenger.html) Sloth Emoji ----------- - [Sloth](https://www.change.org/p/apple-verizon-sprint-apple-should-make-a-sloth-emoji) - [Sloth](https://www.change.org/p/steve-jobs-apple-apple-inc-make-a-sloth-emoji-for-the-ios-devices) - [Sloth](https://www.change.org/p/everyone-make-a-sloth-and-pug-emoji) ?? Plants and Flowers ===================== not for eating Poppy ----- - [Poppy](https://www.change.org/p/make-a-poppy-emoji-for-rememberance-day-petition) - [Poppy](https://www.change.org/p/a-poppy-emoji-for-remembrance-day) Recreational Drugs ------------------ - [Weed](https://www.change.org/p/apple-inc-add-a-weed-emoji-to-iphone-and-android-devices) - [Blunt](https://www.change.org/p/apple-make-a-blunt-emoji-cafd1ce2-f01a-4f89-b0ea-5af7da25a018) - [Marijuana](http://www.ipetitions.com/petition/create-a-marijuana-emoji) - [Stoner](http://www.ipetitions.com/petition/stoner-emojis) ?? Flags ======== ???? Countries of United Kingdom ------------------------------- ### Scotland/Alba or Saltire or St. Andrew Cross Flag - [Scotland](https://www.change.org/p/nicola-sturgeon-saltire-flag-emoji) - [Scotland](https://www.change.org/p/facebook-create-a-scottish-saltire-flag-emoticon-standrewsday) - [Scotland](https://www.change.org/p/international-organisation-we-want-a-scotland-emoji-flag) - [Scotland](http://www.ipetitions.com/petition/change-the-icon-on-facebook-for-burns-supper-from) - [Scotland and Northern Ireland](http://www.petitions24.com/scotland_and_northern_ireland_to_get_an_emoji) ### Wales/Cymru or Dragon Flag - [Wales](https://www.change.org/p/apple-get-apple-to-add-a-welsh-flag-emoji) - [Wales](https://www.change.org/p/the-unicode-consortium-welsh-flag-emoji-appeal) - [Wales](https://www.change.org/p/apple-i-really-want-apple-to-aknowledge-wales-as-a-country-and-give-us-our-flag-emoji) - [Wales](https://www.change.org/p/kane-to-add-the-welsh-flag-to-emoji) - [Wales](http://www.gopetition.com/petitions/get-the-welsh-flag-on-the-emoji-lists.html) - [Wales](http://www.ipetitions.com/petition/welsh-flag-emoji-for-apple-emoji-keyboard) - [Wales](http://www.ipetitions.com/petition/facebook-welsh-flag-emoji) ### Northern Ireland Flag - [Northern Ireland](https://www.change.org/p/apple-why-isn-t-there-an-northern-ireland-flag-emoji-let-s-ensure-apple-knows-we-exist - [Northern Ireland and Scotland](http://www.petitions24.com/scotland_and_northern_ireland_to_get_an_emoji ### England or St. George Cross Flag - [England](https://www.gopetition.com/petitions/england-flag-emoji.html) ???? US States Flags -------------------- - [Texas](https://www.change.org/p/ted-cruz-need-texas-emoji) - [Confederate States of America]http://www.ipetitions.com/petition/have-apple-make-a-confederate-flag-emoji) Natives and Aboriginals Flags ----------------------------- - [Torres Strai Islander](https://www.change.org/p/shigetaka-kurita-the-aboriginal-torres-strait-islander-flag-emojis) - [Aboriginal](https://www.change.org/p/apple-an-aboriginal-flag-emoji-needs-to-be-released) Independence Movements Transnational ------------------------------------ - [No Israel](https://www.change.org/p/tim-cook-ceo-of-apple-justify-the-addition-of-the-israeli-flag-on-the-new-ios-emoji-keyboard) - [Kurdistan](http://www.thepetitionsite.com/689/689/556/kurdish-flag-emoji/) Independence Movements Intranational ------------------------------------ - [Oromo](https://www.change.org/p/apple-i-want-to-have-an-oromo-flag-emoji) - [South Vietnam](https://www.change.org/p/apple-unicode-representation-vietnamese-heritage-and-freedom-flag-emoji) - [Aramea](https://www.change.org/p/apple-inc-we-want-apple-to-add-the-syriac-aramean-flag-in-ios) - [Sicily](http://www.thepetitionsite.com/781/027/921/making-the-sicilian-flag-an-emoji/) ? Equality, Diversity, Sexuality and Gender =========================================== - [Transgender flag](https://www.change.org/p/unicode-add-transgender-pride-flag-emoji- [](https://www.change.org/p/obama-ya-mama-stop-emoji-racism Gender Equality (New Versions of Existing Emojis) ------------------------------------------------- - [Gender equality](https://www.change.org/p/apple-gender-equality-emojis-9a6b29ca-98f3-4bb9-bbb7-14ee6a50ff65 - [Gender equality](https://www.change.org/p/emojiquality - [Runner](http://www.ipetitions.com/petition/athletic-running-woman-emoji) - [Runner](http://www.ipetitions.com/petition/petition-for-female-runnig-emoji) - [King](http://www.ipetitions.com/petition/King-Emoji) ? ?? U+1F934 Prince - [](https://www.change.org/p/petici?n-para-por-el-derecho-de-las-mujeres-a-ir-solas-por-el-mundo Femojis ------- - https://www.change.org/p/femojis-uk-2 - https://www.change.org/p/femojis-uk - https://www.change.org/p/femojis-fr - https://www.change.org/p/femojis-it Complexion and Ethniticy ------------------------ - [African-American](http://www.thepetitionsite.com/450/195/279/demand-apple-to-include-african-american-emoji/ - [African-American](http://www.ipetitions.com/petition/african-american-emojis) - [Ethnicity](http://www.ipetitions.com/petition/represent-all-ethnicities-as-emojis) - [](http://www.ipetitions.com/petition/love-doesnt-see-color-it-just-knows-how-to-mix) ?? Faith, Religion, Belief ------------------------- - [Illuminati](http://www.ipetitions.com/petition/illuminati-emoji) - [Pentagram](http://www.ipetitions.com/petition/make-a-pentagram-emoji) ### Khanda ? ? U+262C (Adi Shakti; Sikh symbol) - [Khanda](https://www.change.org/p/apple-create-the-khanda-emoji) - [Khanda](http://www.ipetitions.com/petition/khandaemoji) Symbols, Signs and Icons ======================== - [Sponsored Content ?#Ad?](http://corp.izea.com/emoji/) - [Planets (not symbols)](http://www.petitions24.com/planet_emojis) ? Hearts Emoji -------------- ### Orange Heart Emoji [ORANGE HEART](http://unicode.org/emoji/charts-beta/emoji-candidates.html#1f9e1) - [Orange heart](https://www.change.org/p/apple-make-an-orange-heart-emoji) - [Orange heart](https://www.change.org/p/apple-make-apple-and-android-create-an-orange-heart) - [Orange heart](http://www.ipetitions.com/petition/add-an-orange-heart-emoji-to-apples-emoji) - [Orange heart](http://www.ipetitions.com/petition/heart-rainbow) ?? Money Emoji -------------- - [Bitcoin](http://www.ipetitions.com/petition/bitcoin-emoji) # New and unsorted as of 14 Nov 2016 - https://www.change.org/p/door-hinge-memes-petition-to-have-facebook-font-makers-add-a-door-hinge-emoji-to-their-pictograph-library - https://www.change.org/p/green-balloon-emoji-for-glenn - https://www.change.org/p/apple-get-apple-to-make-a-match-emoji - https://www.change.org/p/zoe-taylor-make-the-eggplant-emoji-respectable-again - https://www.change.org/p/unicode-dat-boi-emoji-added-to-uni-code-character-set - https://www.change.org/p/facebook-add-praying-emoticon-to-facebook-like From mark at kli.org Wed Nov 23 17:59:17 2016 From: mark at kli.org (Mark E. Shoulson) Date: Wed, 23 Nov 2016 18:59:17 -0500 Subject: Manatee emoji? In-Reply-To: References: Message-ID: <9b7010bf-ccf9-69a3-4d06-64db9b8faad1@kli.org> On 11/23/2016 10:15 AM, James Kass wrote: > http://patch.com/florida/southtampa/petition-drive-aims-raise-manatee-awareness-adorable-way > > If enough people sign the petition, will Unicode add a manatee emoji? > And, how about wolverines and lemmings? Are any petitions underway > for them? How many signatures on a petition would be needed before > Unicode would consider adding a non-existent character to the > repertoire? Aren't many emoji "non-existent[sic]" characters prior to their adoption? ~mark From Shawn.Steele at microsoft.com Wed Nov 23 18:13:07 2016 From: Shawn.Steele at microsoft.com (Shawn Steele) Date: Thu, 24 Nov 2016 00:13:07 +0000 Subject: Manatee emoji? In-Reply-To: <9b7010bf-ccf9-69a3-4d06-64db9b8faad1@kli.org> References: <9b7010bf-ccf9-69a3-4d06-64db9b8faad1@kli.org> Message-ID: Short answer: not really :) Most of (particularly the initial batch) of emoji were used in other contexts before Unicode. Most notably the Japanese mobile telephone companies added them. They also differentiated between carriers by types and features of supported characters. That led to incompatibilities between the companies and an incentive to standardize them in Unicode. Since then, there are other systems that provide emoji outside of Unicode or other mechanisms. (Like gifs or special codes in messaging software). So they keep evolving. I'm probably skipping other ways these shapes get created that evolve into Unicode emoji. -Shawn -----Original Message----- From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Mark E. Shoulson Sent: Wednesday, November 23, 2016 3:59 PM To: unicode at unicode.org Subject: Re: Manatee emoji? On 11/23/2016 10:15 AM, James Kass wrote: > http://patch.com/florida/southtampa/petition-drive-aims-raise-manatee- > awareness-adorable-way > > If enough people sign the petition, will Unicode add a manatee emoji? > And, how about wolverines and lemmings? Are any petitions underway > for them? How many signatures on a petition would be needed before > Unicode would consider adding a non-existent character to the > repertoire? Aren't many emoji "non-existent[sic]" characters prior to their adoption? ~mark From zelpahd at gmail.com Thu Nov 24 02:23:51 2016 From: zelpahd at gmail.com (zelpa) Date: Thu, 24 Nov 2016 19:23:51 +1100 Subject: Manatee emoji? In-Reply-To: References: Message-ID: On Thu, Nov 24, 2016 at 8:30 AM, Christoph P?per < christoph.paeper at crissov.de> wrote: > James Kass : > > > > And, how about [other emoji]? Are any petitions underway for them? > > For what it?s worth, several weeks ago (before UTC149), I collected all > emoji petitions I could find online (and that were in languages I can at > least somewhat decipher). I?m excluding anything moot added in or before > Unicode 9.0 and Emoji 4.0, but am including current candidate emoji in the > list below (Markdown format). In some cases, I think, it?s at least as > valuable to see how many people are proposing some emoji character > independently than how many co-sign a single public petition. > > ### Finger Gun Emoji > > Hand with Thumb and Index Finger Extended, Pointing Sidewards > > - [Finger gun](https://www.change.org/p/all-of-those-who-support- > awkward-finger-guns-as-answers-to-all-questions- > there-needs-to-be-a-finger-guns-emoji) > - [Finger gun](https://www.change.org/p/skype-make-a-finger-guns- > emoji-on-skype) > - [Finger gun](http://www.ipetitions.com/petition/we-need-a-finger- > guns-emoji) > - [Finger gun](http://www.petitions24.com/apple_give_us_a_finger_gun_emoji > ) > Wow I don't think I realised how much I've wanted a finger gun emoji until this point. Might consider writing up a proper proposal for it. -------------- next part -------------- An HTML attachment was scrubbed... URL: From wjgo_10009 at btinternet.com Thu Nov 24 05:59:20 2016 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Thu, 24 Nov 2016 11:59:20 +0000 (GMT) Subject: Manatee emoji? In-Reply-To: References: Message-ID: <6371660.22581.1479988760419.JavaMail.defaultUser@defaultHost> Leonardo Boiko wrote: > I support the creation of manatee emoji, but only if it?s accompanied by a new modifier for emoji size, coming in the varieties: TINY, SMALL, LARGE, HUGE. > This would allow us to say "oh, the [HUGE MANATEE]" in emoji. I have produced some designs for tiny, small, large and huge and also for medium size. The designs and some notes about how I produced them are in the following web page. http://www.users.globalnet.co.uk/~ngo/abstract_emoji.htm The web page is listed in the form of a diary and the designs are within some text headed Thursday 24 November 2016. I have attached the designs to this post as well so that they will be conserved in the archive. The design with the most purple in the upper right quadrant is for the adjective huge and the design with the most purple in the lower right quadrant is for the adjective tiny. In use, the emoji of the adjective would be after the emoji of the noun in a piece of text. William Overington Thursday 24 November 2016 -------------- next part -------------- A non-text attachment was scrubbed... Name: emoji_tiny.png Type: image/png Size: 3032 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: emoji_small.png Type: image/png Size: 3029 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: emoji_medium_size.png Type: image/png Size: 3064 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: emoji_large.png Type: image/png Size: 3065 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: emoji_huge.png Type: image/png Size: 3094 bytes Desc: not available URL: From richard.wordingham at ntlworld.com Thu Nov 24 23:49:51 2016 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Fri, 25 Nov 2016 05:49:51 +0000 Subject: Bidi: inserting Japanese paragraphs in Arabic/Farsi document In-Reply-To: References: <7be20863-8c28-221f-d240-0cc5e9531352@simon-cozens.org> <83shqm9nkw.fsf@gnu.org> <83h9729kfw.fsf@gnu.org> <834m329fqf.fsf@gnu.org> <31c61d2a-8911-9503-139b-9497137e2dff@ix.netcom.com> Message-ID: <20161125054951.6a4825b6@JRWUBU2> On Tue, 22 Nov 2016 02:47:10 +0100 Philippe Verdy wrote: > Look at where the Asian quotes are partially "moved" by the ASCII > quotes in Chrome. I presume this is referring to the attached file 00000007.fhmbobjniphfamjk.png. There are two problems with using this example. (1) The closing curved quote U+201D appears to have gone missing. (2) The paragraph is a LTR paragraph. Remember that the overall directionality of a paragraph can be determined by a "higher level protocol" rather than the content. In the text shown in the attachment, a higher level protocol is specifying LTR - the leftmost text is 'ARABIC-ONE'. Richard. -------------- next part -------------- A non-text attachment was scrubbed... Name: 00000007.fhmbobjniphfamjk.png Type: image/png Size: 1707 bytes Desc: not available URL: From jsbien at mimuw.edu.pl Fri Nov 25 08:38:44 2016 From: jsbien at mimuw.edu.pl (Janusz S. =?utf-8?Q?Bie=C5=84?=) Date: Fri, 25 Nov 2016 15:38:44 +0100 Subject: The usage of Z WITH STROKE Message-ID: <86wpfrzksr.fsf@mimuw.edu.pl> Hi! There are two comments to the character(s) in the U0180 chart: 1. Pan-Turkic Latin orthography 2. handwritten variant of Latin ?z? Ad 1. Do I understand correctly that the Pan-Turkic Latin ortography refers to the initiative described in the post to the Linguist list: https://linguistlist.org/issues/4/4-187.html If so, where to find more information about it? I found already another post to the Linguist list https://linguistlist.org/issues/5/5-739.html but it contains only very general information. Ad 2. I'm curious how widespread, in time and space, is/was this convention. Can you suggest to me where to search for this information? Best regards Janusz -- , Prof. dr hab. Janusz S. Bien - Uniwersytet Warszawski (Katedra Lingwistyki Formalnej) Prof. Janusz S. Bien - University of Warsaw (Formal Linguistics Department) jsbien at uw.edu.pl, jsbien at mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/ From jknappen at web.de Fri Nov 25 09:05:50 2016 From: jknappen at web.de (=?UTF-8?Q?=22J=C3=B6rg_Knappen=22?=) Date: Fri, 25 Nov 2016 16:05:50 +0100 Subject: Aw: The usage of Z WITH STROKE In-Reply-To: <86wpfrzksr.fsf@mimuw.edu.pl> References: <86wpfrzksr.fsf@mimuw.edu.pl> Message-ID: An HTML attachment was scrubbed... URL: From frederic.grosshans at gmail.com Fri Nov 25 10:18:37 2016 From: frederic.grosshans at gmail.com (=?UTF-8?Q?Fr=c3=a9d=c3=a9ric_Grosshans?=) Date: Fri, 25 Nov 2016 17:18:37 +0100 Subject: The usage of Z WITH STROKE In-Reply-To: <86wpfrzksr.fsf@mimuw.edu.pl> References: <86wpfrzksr.fsf@mimuw.edu.pl> Message-ID: Le 25/11/2016 ? 15:38, Janusz S. Bie? a ?crit : > Hi! > > There are two comments to the character(s) in the U0180 chart: > > 1. Pan-Turkic Latin orthography > 2. handwritten variant of Latin ?z? > > Ad 1. > > Do I understand correctly that the Pan-Turkic Latin ortography > refers to the initiative described in the post to the Linguist list: > > https://linguistlist.org/issues/4/4-187.html > > If so, where to find more information about it? I found already another > post to the Linguist list > > https://linguistlist.org/issues/5/5-739.html > > but it contains only very general information. The use of Latin (vs Arabic or Cyrillic) alphabets in Turkic languages has been a heavily political subject for the whole 20th century. You can find a lots of information of the pre-1991 situation in Mark Dickens? article ?Soviet Language Policy in Central Asia? http://www.oxuscom.com/lang-policy.htm#alphabet . The end of USSR in 1991 was the occasion of new reform, but some were cancelled, like for Tatar, since the only official alphabet allowed in Russia is Cyrillic (see https://en.wikipedia.org/wiki/Tatar_alphabet). However, the modern (1990?s) turkic alphabets do not contain ? https://en.wikipedia.org/wiki/Common_Turkic_Alphabet . It was used for waht is know written with j in the 1930?s USSR?s uniform Turkic alphabet aka Ja?alif https://en.wikipedia.org/wiki/Ya%C3%B1alif. The Wikipedia pages of Azerbaijani, Turkman, Crieman Tatar anad Usbek alphabets mention this historical use https://en.wikipedia.org/wiki/Azerbaijani_alphabet , https://en.wikipedia.org/wiki/Turkmen_alphabet , https://en.wikipedia.org/wiki/Crimean_Tatar_alphabet , https://en.wikipedia.org/wiki/Uzbek_alphabet . This letter was also used for other orthographies : The 1931?41 Latin Mongolian orthography (https://en.wikipedia.org/wiki/Mongolian_Latin_alphabet), and a 1992 Latin orthography used by secessionist Chechens > > Ad 2. > > I'm curious how widespread, in time and space, is/was this > convention. Can you suggest to me where to search for this information? I was told in elementary (French) school too write Z this way. I guess you should look at elementary schoolbooks for various languages, or since it?s a handwritten convention, on references about calligraphy and/or paleography. From verdy_p at wanadoo.fr Fri Nov 25 10:35:53 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Fri, 25 Nov 2016 17:35:53 +0100 Subject: The usage of Z WITH STROKE In-Reply-To: References: <86wpfrzksr.fsf@mimuw.edu.pl> Message-ID: And the cursive form of uppercase Z also has a stroke to distinguish it from the cursive form of uppercase L... So this is not just for maths. 2016-11-25 16:05 GMT+01:00 "J?rg Knappen" : > Some anecdotal evidence: > > I was taught by my math teacher (Germany, 1970s) to stroke all z's (upper > or lowercase) in order to > distinguish them from the digit "2" > -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Fri Nov 25 16:34:07 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Fri, 25 Nov 2016 23:34:07 +0100 Subject: Bidi: inserting Japanese paragraphs in Arabic/Farsi document In-Reply-To: <20161125054951.6a4825b6@JRWUBU2> References: <7be20863-8c28-221f-d240-0cc5e9531352@simon-cozens.org> <83shqm9nkw.fsf@gnu.org> <83h9729kfw.fsf@gnu.org> <834m329fqf.fsf@gnu.org> <31c61d2a-8911-9503-139b-9497137e2dff@ix.netcom.com> <20161125054951.6a4825b6@JRWUBU2> Message-ID: Initially my thread was really about Japanese in Arabic documents (or Arabic paragraph), where Asian quotation marks were swapped (but not mirrored), and where other Arabic contents had their own quotation marks misplaced. The result was unreadable, including pairs of Arabic quotes with empty content. The Japanese citation was broken, as well as the overall Arabic one containing it. And once again you're testing it in Firefox (which apparently uses its own higher protocol): I said the problem occured in Chrome (which apparently still does not use the updated Bidi algorithm). This also brings a question about Asian quotes, that are not mirrorable but still swapped by Bidi ! If they are not mirrorable, they should have a strong LTR direction (like other kana or kanji characters). 2016-11-25 6:49 GMT+01:00 Richard Wordingham < richard.wordingham at ntlworld.com>: > On Tue, 22 Nov 2016 02:47:10 +0100 > Philippe Verdy wrote: > > > Look at where the Asian quotes are partially "moved" by the ASCII > > quotes in Chrome. > > I presume this is referring to the attached file > 00000007.fhmbobjniphfamjk.png. There are two problems with using this > example. > (1) The closing curved quote U+201D appears to have gone missing. > (2) The paragraph is a LTR paragraph. > > Remember that the overall directionality of a paragraph can be determined > by a "higher level protocol" rather than the content. In the text shown in > the attachment, a higher level protocol is specifying LTR - the leftmost > text is 'ARABIC-ONE'. > > Richard. -------------- next part -------------- An HTML attachment was scrubbed... URL: From jsbien at mimuw.edu.pl Sat Nov 26 00:20:55 2016 From: jsbien at mimuw.edu.pl (Janusz S. =?utf-8?Q?Bie=C5=84?=) Date: Sat, 26 Nov 2016 07:20:55 +0100 Subject: The usage of Z WITH STROKE In-Reply-To: <86wpfrzksr.fsf@mimuw.edu.pl> ("Janusz S. =?utf-8?Q?Bie=C5=84?= =?utf-8?Q?=22's?= message of "Fri, 25 Nov 2016 15:38:44 +0100") References: <86wpfrzksr.fsf@mimuw.edu.pl> Message-ID: <86polizrqw.fsf@mimuw.edu.pl> Thanks for all the interesting asnwers. I will focus now on my first question. On Fri, Nov 25 2016 at 15:38 CET, jsbien at mimuw.edu.pl writes: > Hi! > > There are two comments to the character(s) in the U0180 chart: > > 1. Pan-Turkic Latin orthography > 2. handwritten variant of Latin ?z? > > Ad 1. > > Do I understand correctly that the Pan-Turkic Latin ortography > refers to the initiative described in the post to the Linguist list: > > https://linguistlist.org/issues/4/4-187.html [...] The initiative was made in March 1993, the character appeared already in Unicode 1.1.0 in June 1993. Do you think it is possible and/or probable that the comment refers to the very initiative? On Fri, Nov 25 2016 at 16:05 CET, jknappen at web.de writes: [...] > P.S. What pan-turkic orthography is concerned, there were also a lot > of pan-turkic Latin alphabets in revolutionary > Soviet Union (1920s) before Cyrillic alphabets were introduced in the > Stalin era. > P.P.S. You are certainly aware of this article: > https://en.wikipedia.org/wiki/Z_with_stroke On Fri, Nov 25 2016 at 17:18 CET, frederic.grosshans at gmail.com writes: > The use of Latin (vs Arabic or Cyrillic) alphabets in Turkic > languages has been a heavily political subject for the whole 20th > century. You can find a lots of information of the pre-1991 situation > in Mark Dickens? article ?Soviet Language Policy in Central Asia? > http://www.oxuscom.com/lang-policy.htm#alphabet . The end of USSR in > 1991 was the occasion of new reform, but some were cancelled, like for > Tatar, since the only official alphabet allowed in Russia is Cyrillic > (see https://en.wikipedia.org/wiki/Tatar_alphabet). > > However, the modern (1990?s) turkic alphabets do not contain ? > https://en.wikipedia.org/wiki/Common_Turkic_Alphabet . It was used for > waht is know written with j in the 1930?s USSR?s uniform Turkic > alphabet aka Ja?alif https://en.wikipedia.org/wiki/Ya%C3%B1alif. > The Wikipedia pages of Azerbaijani, Turkman, Crieman Tatar anad Usbek > alphabets mention this historical use > https://en.wikipedia.org/wiki/Azerbaijani_alphabet , > https://en.wikipedia.org/wiki/Turkmen_alphabet , > https://en.wikipedia.org/wiki/Crimean_Tatar_alphabet , > https://en.wikipedia.org/wiki/Uzbek_alphabet . > > This letter was also used for other orthographies : The 1931?41 Latin > Mongolian orthography > (https://en.wikipedia.org/wiki/Mongolian_Latin_alphabet), and a 1992 > Latin orthography used by secessionist Chechens Thanks for all the information and the links (I was familiar with some of them, but not all). Now there is a follow-up question: why the character was included in Unicode 1.1.0? And there are also two other related questions: 1. Is there an easy way to check whether the character existed already in pre-Unicode character sets? I'm aware about a difficult way, i.e. browsing International Register of Coded Character Sets to be Used with Escape Sequences. 2. Which characters codes were included in the Unicode round-trip test? Was the list ever published somewhere? There used to be available the files containing mappings from some legacy codes to Unicode, I can't find them now. Perhaps the mappings where prepared just for the round-trip codes? Best regards Janusz -- , Prof. dr hab. Janusz S. Bien - Uniwersytet Warszawski (Katedra Lingwistyki Formalnej) Prof. Janusz S. Bien - University of Warsaw (Formal Linguistics Department) jsbien at uw.edu.pl, jsbien at mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/ From eliz at gnu.org Sat Nov 26 01:10:14 2016 From: eliz at gnu.org (Eli Zaretskii) Date: Sat, 26 Nov 2016 09:10:14 +0200 Subject: Bidi: inserting Japanese paragraphs in Arabic/Farsi document In-Reply-To: (message from Philippe Verdy on Fri, 25 Nov 2016 23:34:07 +0100) References: <7be20863-8c28-221f-d240-0cc5e9531352@simon-cozens.org> <83shqm9nkw.fsf@gnu.org> <83h9729kfw.fsf@gnu.org> <834m329fqf.fsf@gnu.org> <31c61d2a-8911-9503-139b-9497137e2dff@ix.netcom.com> <20161125054951.6a4825b6@JRWUBU2> Message-ID: <83poli3eeh.fsf@gnu.org> > From: Philippe Verdy > Date: Fri, 25 Nov 2016 23:34:07 +0100 > Cc: unicode Unicode Discussion > > This also brings a question about Asian quotes, that are not mirrorable but still swapped by Bidi ! If they are > not mirrorable, they should have a strong LTR direction (like other kana or kanji characters). That's not how this stuff works in RTL locales. It works by changing the character produced by the keyboard keys assigned to these characters, when the keyboard is configured for an RTL language. E.g., a key labeled ? should produce ? when the current language is RTL. That's how this works with mirrored characters as well, because when you type in an RTL language, you will press ) when you want an opening parenthesis, since that's what you expect to see on display. With a suitably configured keyboard (or input method, for that matter), the problem you mention doesn't exist, and therefore there's no relation between whether characters are swapped and whether they are mirrored. From verdy_p at wanadoo.fr Sat Nov 26 02:25:16 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Sat, 26 Nov 2016 09:25:16 +0100 Subject: Bidi: inserting Japanese paragraphs in Arabic/Farsi document In-Reply-To: <83poli3eeh.fsf@gnu.org> References: <7be20863-8c28-221f-d240-0cc5e9531352@simon-cozens.org> <83shqm9nkw.fsf@gnu.org> <83h9729kfw.fsf@gnu.org> <834m329fqf.fsf@gnu.org> <31c61d2a-8911-9503-139b-9497137e2dff@ix.netcom.com> <20161125054951.6a4825b6@JRWUBU2> <83poli3eeh.fsf@gnu.org> Message-ID: No, I was speaking at the encoding level. Even if your Arabic keyboard displays a ")", and you type it, it will output/encode an open parenthesis "(", that will then be mirrored to display a ")" glyph, matching your key input. The Bidi algorithm will still render it RTL (i.e. it will reorder it/"swap it" so that it will render to the right of Arabic characters entered after it. That encoded open parenthesis character is then both reordered and rendered mirrored. However with Asian parentheses in this context, they are also reordered... but not mirrored when in fact they should be treated as strong LTR, and not reordered (and not mirrored at all) For Asian parentheses this is less a problem (you do not see the difference if the two parentheses are already symetric) than with Asian square-angle quotation marks: the effect of the absence of mirroring when swapping them becomes evidently wrong: but they are still reordered ("swapped" visually) as if they were Bidi-neutral, but as they are not symetric and not mirrored, they are oriented the wrong way. 2016-11-26 8:10 GMT+01:00 Eli Zaretskii : > > From: Philippe Verdy > > Date: Fri, 25 Nov 2016 23:34:07 +0100 > > Cc: unicode Unicode Discussion > > > > This also brings a question about Asian quotes, that are not mirrorable > but still swapped by Bidi ! If they are > > not mirrorable, they should have a strong LTR direction (like other kana > or kanji characters). > > That's not how this stuff works in RTL locales. It works by changing > the character produced by the keyboard keys assigned to these > characters, when the keyboard is configured for an RTL language. > E.g., a key labeled ? should produce ? when the current language is > RTL. That's how this works with mirrored characters as well, because > when you type in an RTL language, you will press ) when you want an > opening parenthesis, since that's what you expect to see on display. > > With a suitably configured keyboard (or input method, for that > matter), the problem you mention doesn't exist, and therefore there's > no relation between whether characters are swapped and whether they > are mirrored. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From eliz at gnu.org Sat Nov 26 02:57:29 2016 From: eliz at gnu.org (Eli Zaretskii) Date: Sat, 26 Nov 2016 10:57:29 +0200 Subject: Bidi: inserting Japanese paragraphs in Arabic/Farsi document In-Reply-To: (message from Philippe Verdy on Sat, 26 Nov 2016 09:25:16 +0100) References: <7be20863-8c28-221f-d240-0cc5e9531352@simon-cozens.org> <83shqm9nkw.fsf@gnu.org> <83h9729kfw.fsf@gnu.org> <834m329fqf.fsf@gnu.org> <31c61d2a-8911-9503-139b-9497137e2dff@ix.netcom.com> <20161125054951.6a4825b6@JRWUBU2> <83poli3eeh.fsf@gnu.org> Message-ID: <837f7q39fq.fsf@gnu.org> > From: Philippe Verdy > Date: Sat, 26 Nov 2016 09:25:16 +0100 > Cc: Richard Wordingham , > unicode Unicode Discussion > > No, I was speaking at the encoding level. Even if your Arabic keyboard displays a ")", and you type it, it will > output/encode an open parenthesis "(", that will then be mirrored to display a ")" glyph, matching your key > input. Yes. > The Bidi algorithm will still render it RTL (i.e. it will reorder it/"swap it" so that it will render to the right of Arabic > characters entered after it. That encoded open parenthesis character is then both reordered and rendered > mirrored. > However with Asian parentheses in this context, they are also reordered... but not mirrored when in fact they > should be treated as strong LTR, and not reordered (and not mirrored at all) You were originally talking about quotes, not parentheses. Which one is it? I responded to the quotes issue. > For Asian parentheses this is less a problem (you do not see the difference if the two parentheses are already > symetric) than with Asian square-angle quotation marks: the effect of the absence of mirroring when > swapping them becomes evidently wrong: but they are still reordered ("swapped" visually) as if they were > Bidi-neutral, but as they are not symetric and not mirrored, they are oriented the wrong way. They will be effectively "mirrored" by the keyboard, as I described. From richard.wordingham at ntlworld.com Sun Nov 27 08:09:17 2016 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Sun, 27 Nov 2016 14:09:17 +0000 Subject: Bidi: inserting Japanese paragraphs in Arabic/Farsi document In-Reply-To: <837f7q39fq.fsf@gnu.org> References: <83shqm9nkw.fsf@gnu.org> <83h9729kfw.fsf@gnu.org> <834m329fqf.fsf@gnu.org> <31c61d2a-8911-9503-139b-9497137e2dff@ix.netcom.com> <20161125054951.6a4825b6@JRWUBU2> <83poli3eeh.fsf@gnu.org> <837f7q39fq.fsf@gnu.org> Message-ID: <20161127140917.5cee547d@JRWUBU2> On Sat, 26 Nov 2016 10:57:29 +0200 Eli Zaretskii wrote: > > From: Philippe Verdy > > Date: Sat, 26 Nov 2016 09:25:16 +0100 > > For Asian parentheses this is less a problem (you do not see the > > difference if the two parentheses are already symetric) than with > > Asian square-angle quotation marks: the effect of the absence of > > mirroring when swapping them becomes evidently wrong: but they are > > still reordered ("swapped" visually) as if they were Bidi-neutral, > > but as they are not symetric and not mirrored, they are oriented > > the wrong way. They (U+300C LEFT CORNER BRACKET and U+300D RIGHT CORNER BRACKET) are bidi-neutral (bidi class ON) and have bidi-mirroring, as you should see from the nonsense string ???????? (0628 300C 0629 0630 0638 0638 300D 0629), whichever the paragraph-level embedding. Whether a top-left corner (?) should be mirrored to a bottom-right corner (?) is a matter of taste, which will probably not bother those who think that bidi-mirroring is a matter of character substitution. They are listed as a pair in both BidiBrackets.txt and BidiMirroring.txt. > They will be effectively "mirrored" by the keyboard, as I described. Except that a visual keyboard for an RTL writing system is highly unlikely to have U+300C and U+300D. Richard. From verdy_p at wanadoo.fr Sun Nov 27 10:33:12 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Sun, 27 Nov 2016 17:33:12 +0100 Subject: Bidi: inserting Japanese paragraphs in Arabic/Farsi document In-Reply-To: <20161127140917.5cee547d@JRWUBU2> References: <83shqm9nkw.fsf@gnu.org> <83h9729kfw.fsf@gnu.org> <834m329fqf.fsf@gnu.org> <31c61d2a-8911-9503-139b-9497137e2dff@ix.netcom.com> <20161125054951.6a4825b6@JRWUBU2> <83poli3eeh.fsf@gnu.org> <837f7q39fq.fsf@gnu.org> <20161127140917.5cee547d@JRWUBU2> Message-ID: 2016-11-27 15:09 GMT+01:00 Richard Wordingham < richard.wordingham at ntlworld.com>: > > > They will be effectively "mirrored" by the keyboard, as I described. > > Except that a visual keyboard for an RTL writing system is highly unlikely > to > have U+300C and U+300D. > I spoke about multilingula documents where you'll mix Japanese into Arabic (or the reverse). The keyboard capability does not matter at all because a "keyboard for an RTL writing system" will also not support any one of the characters needed for a Japanese citation (this is not just these two punctuation characters). -------------- next part -------------- An HTML attachment was scrubbed... URL: From kojiishi at gmail.com Mon Nov 28 07:45:36 2016 From: kojiishi at gmail.com (Koji Ishii) Date: Mon, 28 Nov 2016 22:45:36 +0900 Subject: Bidi: inserting Japanese paragraphs in Arabic/Farsi document In-Reply-To: References: <83shqm9nkw.fsf@gnu.org> <83h9729kfw.fsf@gnu.org> <834m329fqf.fsf@gnu.org> <31c61d2a-8911-9503-139b-9497137e2dff@ix.netcom.com> <20161125054951.6a4825b6@JRWUBU2> <83poli3eeh.fsf@gnu.org> <837f7q39fq.fsf@gnu.org> <20161127140917.5cee547d@JRWUBU2> Message-ID: Hi, I work on Chrome. I have to acknowledge that our implementation on UBA 6.3 is still not completed yet , nor the paired brackets either . We're working on improving it. It's not very clear to me whether this thread is discussing on paired brackets (BD14-16) or mirrored glyphs (L4), I tried to reproduce but the steps and expectations are not very clear to me, since my understanding on UBA is still not high enough. But as long as it's an implementation issue, we're happy to investigate further. It'd be great if you could provide reproducing HTML at . /koji 2016-11-28 1:33 GMT+09:00 Philippe Verdy : > > 2016-11-27 15:09 GMT+01:00 Richard Wordingham < > richard.wordingham at ntlworld.com>: > >> >> > They will be effectively "mirrored" by the keyboard, as I described. >> >> Except that a visual keyboard for an RTL writing system is highly >> unlikely to >> have U+300C and U+300D. >> > > I spoke about multilingula documents where you'll mix Japanese into Arabic > (or the reverse). > > The keyboard capability does not matter at all because a "keyboard for an > RTL writing system" will also not support any one of the characters needed > for a Japanese citation (this is not just these two punctuation characters). > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From kenwhistler at att.net Mon Nov 28 09:48:44 2016 From: kenwhistler at att.net (Ken Whistler) Date: Mon, 28 Nov 2016 07:48:44 -0800 Subject: The usage of Z WITH STROKE In-Reply-To: <86polizrqw.fsf@mimuw.edu.pl> References: <86wpfrzksr.fsf@mimuw.edu.pl> <86polizrqw.fsf@mimuw.edu.pl> Message-ID: On 11/25/2016 10:20 PM, Janusz S. Bie? wrote: > Now there is a follow-up question: why the character was included in > Unicode 1.1.0? Well, it was included in Unicode 1.1 because it was published in Unicode 1.0 already. So that is the proximate reason. That inevitably will raise the question, "Why was it included in Unicode 1.0?" Well, the proximate cause for that was the presence of z with stroke in the XCCS character set, which was the source for a lot of the early Unicode 1.0 repertoire. More precisely: XCCS (= Xerox Character Code Standard) 1990 contained: 0x23 0x48 Azerbaijani capital letter Z 0x23 0x68 Azerbaijani small letter Z So that also answers the next question, "Why was it included in XCCS?" Note that XCCS 1990 is the 2.0 version. The 1.0 version of XCCS was dated 1980. I don't have access to that one, so cannot tell for sure whether it contained the "character set 43_8 " content (i.e. the 0x23 .. character block) or not. At any rate, see here: https://en.wikipedia.org/wiki/Azerbaijani_alphabet The additions from the XCCS "character set 43_8 " included the schwa, the gha, and the z-stroke from the old Azerbaijani Latin alphabet, documented there as in use from 1929 until 1939. And from XCCS, all of them made it into Unicode 1.0. So that should pretty definitively answer the origin question for z with stroke. > And there are also two other related questions: > > 1. Is there an easy way to check whether the character existed already > in pre-Unicode character sets? I'm aware about a difficult way, > i.e. browsing International Register of Coded Character Sets to be Used > with Escape Sequences. The International Register is *not* a particularly fruitful source. Much more of the Unicode 1.0 material actually came from corporate sets, including, but not limited to XCCS and the large collection of IBM code pages. > > 2. Which characters codes were included in the Unicode round-trip test? > Was the list ever published somewhere? There used to be available the > files containing mappings from some legacy codes to Unicode, I can't > find them now. Perhaps the mappings where prepared just for the > round-trip codes? Currently maintained mappings (and some historic materials) are posted at: http://www.unicode.org/Public/MAPPINGS/ For the really old mapping pertinent to the original decisions about inclusion in Unicode 1.0, the mapping data for East Asian were distributed in a 3.5" floppy diskette on request. Probably very hard to locate (or read) one of those now. But you can refer to the *scanned* version of Chapter 6 of Unicode 1.0, which is available online. That was a printed copy of many of the cross-mapping tables to external standards. See: http://www.unicode.org/versions/Unicode1.0.0/ch06.pdf For the cross-mapping of the Unicode 1.0, Volume 2 unified CJK, that is also scanned and available online: http://www.unicode.org/versions/Unicode1.0.0/HanCharts2.pdf That table is known to have errors in it, so for CJK it should not be considered currently definitive in any meaningful way -- it is of historic interest. --Ken > -------------- next part -------------- An HTML attachment was scrubbed... URL: From asmusf at ix.netcom.com Mon Nov 28 10:30:22 2016 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Mon, 28 Nov 2016 08:30:22 -0800 Subject: Manatee emoji? In-Reply-To: <20161123124458.665a7a7059d7ee80bb4d670165c8327d.7ac8a1b9e0.wbe@email03.godaddy.com> References: <20161123124458.665a7a7059d7ee80bb4d670165c8327d.7ac8a1b9e0.wbe@email03.godaddy.com> Message-ID: <1f211e28-4dc2-02c5-bf8d-732197dfecf1@ix.netcom.com> On 11/23/2016 11:44 AM, Doug Ewell wrote: > Leonardo Boiko wrote: > >> I support the creation of manatee emoji, but only if it?s accompanied >> by a new modifier for emoji size, coming in the varieties: TINY, >> SMALL, LARGE, HUGE. >> >> This would allow us to say "oh, the [HUGE MANATEE]" in emoji. > Leonardo immediately wins the award for best sort-of-Unicode-related pun > ever. Just retire the trophy now. > > But I am expecting a full array of modifiers and ZWJ sequences, to meet > the user need for a female factory-worker manatee with dark skin and red > hair, or families of manatees with arbitrary combinations of attributes. > > Manatee families are where it's at. A./ From asmusf at ix.netcom.com Mon Nov 28 10:32:55 2016 From: asmusf at ix.netcom.com (Asmus Freytag (c)) Date: Mon, 28 Nov 2016 08:32:55 -0800 Subject: Manatee emoji? In-Reply-To: <20161123124458.665a7a7059d7ee80bb4d670165c8327d.7ac8a1b9e0.wbe@email03.godaddy.com> References: <20161123124458.665a7a7059d7ee80bb4d670165c8327d.7ac8a1b9e0.wbe@email03.godaddy.com> Message-ID: On 11/23/2016 11:44 AM, Doug Ewell wrote: > Leonardo Boiko wrote: > >> I support the creation of manatee emoji, but only if it?s accompanied >> by a new modifier for emoji size, coming in the varieties: TINY, >> SMALL, LARGE, HUGE. >> >> This would allow us to say "oh, the [HUGE MANATEE]" in emoji. > Leonardo immediately wins the award for best sort-of-Unicode-related pun > ever. Just retire the trophy now. > > But I am expecting a full array of modifiers and ZWJ sequences, to meet > the user need for a female factory-worker manatee with dark skin and red > hair, or families of manatees with arbitrary combinations of attributes. > > Manatee families are where it's at. A./ PS: "experts agree that the manatee is more developed than any other marine mammal in the world" (from: http://www.manatee-world.com/manatee-social-structure/) From jsbien at mimuw.edu.pl Tue Nov 29 05:57:32 2016 From: jsbien at mimuw.edu.pl (Janusz S. =?utf-8?Q?Bie=C5=84?=) Date: Tue, 29 Nov 2016 12:57:32 +0100 Subject: The usage of Z WITH STROKE In-Reply-To: <86wpfrzksr.fsf@mimuw.edu.pl> ("Janusz S. =?utf-8?Q?Bie=C5=84?= =?utf-8?Q?=22's?= message of "Fri, 25 Nov 2016 15:38:44 +0100") References: <86wpfrzksr.fsf@mimuw.edu.pl> Message-ID: <86zikifqhf.fsf@mimuw.edu.pl> On Fri, Nov 25 2016 at 15:38 CET, jsbien at mimuw.edu.pl writes: > Hi! > > There are two comments to the character(s) in the U0180 chart: > > 1. Pan-Turkic Latin orthography [...] On Mon, Nov 28 2016 at 16:48 CET, kenwhistler at att.net writes: > On 11/25/2016 10:20 PM, Janusz S. Bie? wrote: > > Now there is a follow-up question: why the character was included in > Unicode 1.1.0? Thank you very much for the detailed answer! [...] > Well, the proximate cause for that was the presence of z with stroke > in the XCCS character set, which was the source for a lot of the early > Unicode 1.0 repertoire. More precisely: > > XCCS (= Xerox Character Code Standard) 1990 contained: > > 0x23 0x48 Azerbaijani capital letter Z > 0x23 0x68 Azerbaijani small letter Z [...] > https://en.wikipedia.org/wiki/Azerbaijani_alphabet So "Pan-Turkic Latin orthograhy" in the comment shoud be understood as The Uniform Turkic Alphabet (https://en.wikipedia.org/wiki/Common_Turkic_Alphabet#In_the_USSR)? Best regards Janusz -- , Prof. dr hab. Janusz S. Bien - Uniwersytet Warszawski (Katedra Lingwistyki Formalnej) Prof. Janusz S. Bien - University of Warsaw (Formal Linguistics Department) jsbien at uw.edu.pl, jsbien at mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/