From charupdate at orange.fr Mon Jan 2 12:19:04 2017 From: charupdate at orange.fr (Marcel Schneider) Date: Mon, 2 Jan 2017 19:19:04 +0100 (CET) Subject: Marking up hexadecimal numbers (was: Re: a character for an unknown character) In-Reply-To: <551789226.12010.1483218242564.JavaMail.www@wwinf1p10> References: <717069d1-2cdc-fe61-6c2e-b8e3f12f1dfc@cs.tut.fi> <20161227000522.7bb95f3e@JRWUBU2> <230214984.10599.1482854625210.JavaMail.www@wwinf1p10> <20161229014759.5a51c747@JRWUBU2> <1339076265.17001.1483057435430.JavaMail.www@wwinf1p12> <20161230123727.52a00633@JRWUBU2> <5CF85CCB-4321-4620-BE57-C7914C84BCC7@crissov.de> <551789226.12010.1483218242564.JavaMail.www@wwinf1p10> Message-ID: <2114913271.21841.1483381144279.JavaMail.www@wwinf1p19> On Sat, 31 Dec 2016 22:04:02 +0100 (CET), I wrote: > On Sat, 31 Dec 2016 11:01:16 +0100, Christoph P?per wrote: > > > > Richard Wordingham : > > > > > >> Perhaps the letters for hexadecimal digits should have been encoded > > >> separately? > > > > > > The idea has been rejected several times. > > > > It has indeed. That?s why two different technologies have to be used to get > > typographically harmonic hexadecimal numbers, e.g. in CSS: > > > > .hex {font-variant-numeric: oldstyle-nums; text-transform: lowercase;} > > .hex {font-variant-numeric: lining-nums; text-transform: uppercase;} > > > > This works well enough for ?01ef? or ?01EF?, but will fail for conventions like > > ?0x01ef? and ?01EFh?. Hence: > > > > .hex::before {content: "0x"; text-transform: none;} > > .hex::after {content: "h"; text-transform: none;} > > .hex::after {content: "?";} > > .hex::after {content: "16"; vertical-align: sub; font-size: smaller; line-height: normal;} > > .hex::after {content: "16"; font-variant-position: sub;} > > .hex::after {content: "??";} > > Thank you for the code. I didn?t know this, so I?ve tried and found that > the automatic prefixes/suffixes cannot be copied from the web page. > That seems to me a disadvantage. > > Among the possibilities, you include Unicode subscripts. Is this current > practice? That seems to me very interesting to follow up, as it documents > that the stable representation scheme is already adopted. I?m curious to > what extent it is so. > [?] > > I note that the "U+" prefix is missing in the list, obviously because it > denotes more than just a hexadecimal number, and is to be hard-coded. [?] Alternatively, the CSS style derived from the above could be: .unicode {font-variant-numeric: lining-nums; text-transform: uppercase;} .unicode::before {content: "U+"; text-transform: none;} But again, when the reader copies such a scalar value, he gets it without 'U+'. Hence the idea that the '[[H]H]HHHH' could be parsed to add the prefix after the open-tag, so as to be able to skip the second line above. Similarly, the 'HHHH' can be complemented with '??', or with '0x' or '\x' or whatever, as hard-coded additions by a parser. This has IMO two advantages: 1) When the user copies hex numbers from the browser, hex numbers stay prefixed or suffixed as such. 2) When the user pastes hex numbers into a text editor, they?re not messed up (applies to the '??' suffix, vs '_{16}' suffix). Otherwise, a hex number like '1A19??' is turned to '1A1916'. The actual policy is certainly based on the classification of hexadecimal numbers (and numbers in other non-decimal numeral systems) as mathematical notation, rather than technical notation. In a wide lecture of TUS, all measurement units are granted the use of superscript digits '?' and '?'. Could this policy be extended to include subscript '?' and '?'? This may seem an odd question, and responding it positively would eventually throw the door open to wider use of Latin superscripts in historical data first ('V? s.'), in more general data next. As the upside I see content stability and streamlined input (provided that the input interface is up-to-date). Disparity in display may be considered a downside, since only fonts that have reduced capitals (Consolas, Lucida Console, Courier) have modifier letters accurately like superscripts / ordinal indicators. I?ve started getting habits with using modifier letters in abbreviations, and I find they look good in other fonts too. Right now, it?s just up to put them on the keyboard and tell the user ?please use them if you are comfortable with; original encoding for phonetics does not preclude re-use and diversification of usage conventions.? There is a need of some explanation to be delivered, because people who know something about Unicode typically oppose the sometimes passionate refrain saying that these characters are for use in phonetics only. Definitely, by the actual wording of the relevant parts of the Unicode Standard, Unicode is fueling its own misperception. Some hints in the opposite way, ideally in TUS 10.0 to be published this year 2017, would (in my opinion) be highly appreciated. Though of course that is not enough to make people really happy. Marcel From charupdate at orange.fr Mon Jan 2 14:57:46 2017 From: charupdate at orange.fr (Marcel Schneider) Date: Mon, 2 Jan 2017 21:57:46 +0100 (CET) Subject: Marking up hexadecimal numbers (was: Re: a character for an unknown character) In-Reply-To: <2114913271.21841.1483381144279.JavaMail.www@wwinf1p19> References: <717069d1-2cdc-fe61-6c2e-b8e3f12f1dfc@cs.tut.fi> <20161227000522.7bb95f3e@JRWUBU2> <230214984.10599.1482854625210.JavaMail.www@wwinf1p10> <20161229014759.5a51c747@JRWUBU2> <1339076265.17001.1483057435430.JavaMail.www@wwinf1p12> <20161230123727.52a00633@JRWUBU2> <5CF85CCB-4321-4620-BE57-C7914C84BCC7@crissov.de> <551789226.12010.1483218242564.JavaMail.www@wwinf1p10> <2114913271.21841.1483381144279.JavaMail.www@wwinf1p19> Message-ID: <383840487.25884.1483390666160.JavaMail.www@wwinf1p19> I?ve messed up my e-mail by not converting HTML to text. Please disregard. The used webmail applies HTML tags and deletes all unknown ones. Sorry. On Sat, 31 Dec 2016 22:04:02 +0100 (CET), I wrote: > On Sat, 31 Dec 2016 11:01:16 +0100, Christoph P?per wrote: > > > > Richard Wordingham : > > > > > >> Perhaps the letters for hexadecimal digits should have been encoded > > >> separately? > > > > > > The idea has been rejected several times. > > > > It has indeed. That?s why two different technologies have to be used to get > > typographically harmonic hexadecimal numbers, e.g. in CSS: > > > > .hex {font-variant-numeric: oldstyle-nums; text-transform: lowercase;} > > .hex {font-variant-numeric: lining-nums; text-transform: uppercase;} > > > > This works well enough for ?01ef? or ?01EF?, but will fail for conventions like > > ?0x01ef? and ?01EFh?. Hence: > > > > .hex::before {content: "0x"; text-transform: none;} > > .hex::after {content: "h"; text-transform: none;} > > .hex::after {content: "?";} > > .hex::after {content: "16"; vertical-align: sub; font-size: smaller; line-height: normal;} > > .hex::after {content: "16"; font-variant-position: sub;} > > .hex::after {content: "??";} > > Thank you for the code. I didn?t know this, so I?ve tried and found that > the automatic prefixes/suffixes cannot be copied from the web page. > That seems to me a disadvantage. > > Among the possibilities, you include Unicode subscripts. Is this current > practice? That seems to me very interesting to follow up, as it documents > that the stable representation scheme is already adopted. I?m curious to > what extent it is so. > [?] > > I note that the "U+" prefix is missing in the list, obviously because it > denotes more than just a hexadecimal number, and is to be hard-coded. [?] Alternatively, the CSS style derived from the above could be: .unicode {font-variant-numeric: lining-nums; text-transform: uppercase;} .unicode::before {content: "U+"; text-transform: none;} But again, when the reader copies such a scalar value, he gets it without 'U+'. Hence the idea that the '[[H]H]HHHH' could be parsed to add the prefix after the open-tag, so as to be able to skip the second line above. Similarly, the 'HHHH' can be complemented with '??', or with '0x' or '\x' or whatever, as hard-coded additions by a parser. This has IMO two advantages: 1) When the user copies hex numbers from the browser, hex numbers stay prefixed or suffixed as such. 2) When the user pastes hex numbers into a text editor, they?re not messed up (applies to the '??' suffix, vs '_{16}' suffix). Otherwise, a hex number like '1A19??' is turned to '1A1916'. The actual policy is certainly based on the classification of hexadecimal numbers (and numbers in other non-decimal numeral systems) as mathematical notation, rather than technical notation. In a wide lecture of TUS, all measurement units are granted the use of superscript digits '?' and '?'. Could this policy be extended to include subscript '?' and '?'? This may seem an odd question, and responding it positively would eventually throw the door open to wider use of Latin superscripts in historical data first ('V? s.'), in more general data next. As the upside I see content stability and streamlined input (provided that the input interface is up-to-date). Disparity in display may be considered a downside, since only fonts that have reduced capitals (Consolas, Lucida Console, Courier) have modifier letters accurately like superscripts / ordinal indicators. I?ve started getting habits with using modifier letters in abbreviations, and I find they look good in other fonts too. Right now, it?s just up to put them on the keyboard and tell the user ?please use them if you are comfortable with; original encoding for phonetics does not preclude re-use and diversification of usage conventions.? There is a need of some explanation to be delivered, because people who know something about Unicode typically oppose the sometimes passionate refrain saying that these characters are for use in phonetics only. Definitely, by the actual wording of the relevant parts of the Unicode Standard, Unicode is fueling its own misperception. Some hints in the opposite way, ideally in TUS 10.0 to be published this year 2017, would (in my opinion) be highly appreciated. Though of course that is not enough to make people really happy. Marcel From christoph.paeper at crissov.de Tue Jan 3 02:31:42 2017 From: christoph.paeper at crissov.de (=?utf-8?Q?Christoph_P=C3=A4per?=) Date: Tue, 3 Jan 2017 09:31:42 +0100 Subject: a character for an unknown character In-Reply-To: <551789226.12010.1483218242564.JavaMail.www@wwinf1p10> References: <717069d1-2cdc-fe61-6c2e-b8e3f12f1dfc@cs.tut.fi> <20161227000522.7bb95f3e@JRWUBU2> <230214984.10599.1482854625210.JavaMail.www@wwinf1p10> <20161229014759.5a51c747@JRWUBU2> <1339076265.17001.1483057435430.JavaMail.www@wwinf1p12> <20161230123727.52a00633@JRWUBU2> <5CF85CCB-4321-4620-BE57-C7914C84BCC7@crissov.de> <551789226.12010.1483218242564.JavaMail.www@wwinf1p10> Message-ID: Marcel Schneider : > On Sat, 31 Dec 2016 11:01:16 +0100, Christoph P?per wrote: >> >> It has indeed. That?s why two different technologies have to be used to get >> typographically harmonic hexadecimal numbers, e.g. in CSS: ? > > Thank you for the code. I didn?t know this, Well, case-insensitivity was intended as *an* argument in favor of encoding digits A?F/a?f, although I know that there are also good arguments against it. (There are certainly also arguments in favor of encoding 0?9 another time just for hexadecimal numbers.) > so I?ve tried and found that > the automatic prefixes/suffixes cannot be copied from the web page. Browsers are still disagreeing about that, but yes, since the affix is generated content by CSS it is considered style and is likely to not get pasted into plain text environments. One could also argue that CSS should be able to render numbers in different styles and bases, but that?s currently neither supported nor planned. > Among the possibilities, you include Unicode subscripts. Just for the sake of completeness. > The font-variant-numeric: oldstyle-nums seems not to work with any font. Browser and font support is required and limited, but not as much as few years ago. > I note that the "U+" prefix is missing in the list, obviously because it > denotes more than just a hexadecimal number, and is to be hard-coded. Yes, I was talking about hexadecimal numbers in general, not limit to Unicode code points. From drott at google.com Tue Jan 3 07:14:26 2017 From: drott at google.com (=?UTF-8?Q?Dominik_R=C3=B6ttsches?=) Date: Tue, 3 Jan 2017 15:14:26 +0200 Subject: Leading ZWJ in Emoji sequences page Message-ID: Hi Mark, others, in http://unicode.org/emoji/charts/emoji-zwj-sequences.html as well as in the beta 5.0 version of this page, some of the "Browser" fields have a leading ZWJ. Compare copying the full cell contents to the URL bar after "codepoints.net/" for example and it shows the leading ZWJ. I suggest to remove those as this can lead to unepxected text selection behavior in browsers for example. Regards, Dominik -------------- next part -------------- An HTML attachment was scrubbed... URL: From mark at macchiato.com Tue Jan 3 07:25:52 2017 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Tue, 3 Jan 2017 14:25:52 +0100 Subject: Leading ZWJ in Emoji sequences page In-Reply-To: References: Message-ID: Thanks for catching this! Mark On Tue, Jan 3, 2017 at 2:14 PM, Dominik R?ttsches wrote: > Hi Mark, others, > > in http://unicode.org/emoji/charts/emoji-zwj-sequences.html as well as in > the beta 5.0 version of this page, some of the "Browser" fields have a > leading ZWJ. > > Compare copying the full cell contents to the URL bar after " > codepoints.net/" for example and it shows the leading ZWJ. > > I suggest to remove those as this can lead to unepxected text selection > behavior in browsers for example. > > Regards, > > Dominik > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From charupdate at orange.fr Tue Jan 3 18:24:52 2017 From: charupdate at orange.fr (Marcel Schneider) Date: Wed, 4 Jan 2017 01:24:52 +0100 (CET) Subject: Superscript and Subscript Characters in General Use (was: Re: a character for an unknown character) In-Reply-To: References: <717069d1-2cdc-fe61-6c2e-b8e3f12f1dfc@cs.tut.fi> <20161227000522.7bb95f3e@JRWUBU2> <230214984.10599.1482854625210.JavaMail.www@wwinf1p10> <20161229014759.5a51c747@JRWUBU2> <1339076265.17001.1483057435430.JavaMail.www@wwinf1p12> <20161230123727.52a00633@JRWUBU2> <5CF85CCB-4321-4620-BE57-C7914C84BCC7@crissov.de> <551789226.12010.1483218242564.JavaMail.www@wwinf1p10> Message-ID: <842947588.29513.1483489492387.JavaMail.www@wwinf1p19> On Tue, 3 Jan 2017 09:31:42 +0100, Christoph P?per wrote: > > Among the possibilities, you include Unicode subscripts. > > Just for the sake of completeness. This tends to conclude that preformatted subscripts are really an option here. The TUS snippets [1][2] and common practice show that whatever characters are on the keyboard, are used or re-used for superscripts, such as the degree sign as superscript o, and the feminine ordinal indicator as superscript a. Layouts are baffling inconsistent across countries; so the Belgian AZERTY layout has superscript three where its French (France) counterpart has an empty shift state, while SUPERSCRIPT ONE is missing on both, despite of the AltGr shift state being partially used, and all three being a part of Latin-1. Thus, the consciousness of the usefulness of a given character has not always a tight relation to its presence on the keyboard. In the Unicode era, this may tend to expand to the insight that the availability of an almost complete range of superscripts, and a set of subscripts, including Latin letters, calls the need to add them on national keyboard layouts to cater for the demand of increasingly important user groups and communities. Supporting this does eventually not require the Unicode Standard to be reworded, because TUS mainly reflects encoding principles and usage recommendations, without being a typography manual. TUS 9.0, ?22.4, p. 786, explains that the recommendation not to use preformatted characters outside phonetics is a mere application of a design principle, regardless of the practical usefulness of the scheme. I note that in the snippet quoted below, the digit ??DC0016?? is already messed up by copy-pasting it to plain text. By contrast, copying it from Adobe Reader to Microsoft Word brings the font size difference with it, but not the vertical alignment, presumably because the original specifies a custom subscript style that has no generic subscripting information and is not cross-platform compatible. This example highlights a serious downside of the markup-based representation scheme. As demonstrated with the apostrophe, a recommendation may be changed according to common practice, and reconsidered in the light of differently weighed rules and principles, in favor of what Asmus Freytag pointed on December 28??, 2016, in reply to Richard Wordingham: > > > > Ideal solutions can also be defeated by limited keyboard layouts. As a > > > > result, I have no idea whether the singular of "fithp" (one of Larry > > > > Niven's alien species) should be spelt with U+02BC or U+2019, though in > > > > ASCII I can just write "fi'". > > > > > > The only place where "uni" doesn't apply in Unicode is that there's never > > > just a single principle that applies, but always multiple ones that are > > > in tension --- and in the edge cases, the tension can be felt keenly. > > > As seen in another example in a 2015 thread on plain text custom fractions, the English Microsoft Community website is hosting recommendations on how to insert fractions made of superscripts, subscripts and the fraction slash U+2044, using a list of autocorrections in Word. To test, I?ve added to the autocorrect list four items converting '.s.' to '??', '.n.' to '??', '.r.' to '??', '.t.' to '??'. The result looks fine in Cambria, bad in uncomplete fonts mixed with a fallback font, while Arial has the superscript 'n' in a non-standard way, as a legacy remainder, despite of TUS specifying that all those characters should be harmonized. It?s up to the user to choose the best fitting option depending on usage and environment. As already discussed, formatting is a working solution at the condition that plain text will never be a requirement. I hope that this lengthy contribution may help to straighten the way for the users to feel free to use superscript and subscript characters the way they prefer. Marcel [1] TUS 9.0, ?22.4, p. 786: | | In general, the Unicode Standard does not attempt to describe the positioning | of a character above or below the baseline in typographical layout. | Therefore, the preferred means to encode superscripted letters or digits, | such as ?1st? or ?DC0016?, is by style or markup in rich text. [?] | In addition, superscript digits are used to indicate tone in transliteration | of many languages. The use of superscript two and superscript three is common | legacy practice when referring to units of area and volume in general texts. | http://www.unicode.org/versions/Unicode9.0.0/ch22.pdf#G42931 [2] TUS 9.0, ?7.8, p. 327: | | The superscript forms of the i and n letters can be found in the | Superscripts and Subscripts block (U+2070..U+209F). The fact that the latter | two letters contain the word ?superscript? in their names instead of ?modifier | letter? is an historical artifact of original sources for the characters, and | is not intended to convey a functional distinction in the use of these | characters in the Unicode Standard. | | Superscript modifier letters are intended for cases where the letters carry | a specific meaning, as in phonetic transcription systems, and are not | a substitute for generic styling mechanisms for superscripting of text, | as for footnotes, mathematical and chemical expressions, and the like. | http://www.unicode.org/versions/Unicode9.0.0/ch07.pdf#G24762 From asmusf at ix.netcom.com Tue Jan 3 21:20:42 2017 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Tue, 3 Jan 2017 19:20:42 -0800 Subject: Superscript and Subscript Characters in General Use In-Reply-To: <842947588.29513.1483489492387.JavaMail.www@wwinf1p19> References: <717069d1-2cdc-fe61-6c2e-b8e3f12f1dfc@cs.tut.fi> <20161227000522.7bb95f3e@JRWUBU2> <230214984.10599.1482854625210.JavaMail.www@wwinf1p10> <20161229014759.5a51c747@JRWUBU2> <1339076265.17001.1483057435430.JavaMail.www@wwinf1p12> <20161230123727.52a00633@JRWUBU2> <5CF85CCB-4321-4620-BE57-C7914C84BCC7@crissov.de> <551789226.12010.1483218242564.JavaMail.www@wwinf1p10> <842947588.29513.1483489492387.JavaMail.www@wwinf1p19> Message-ID: On 1/3/2017 4:24 PM, Marcel Schneider wrote: > On Tue, 3 Jan 2017 09:31:42 +0100, Christoph P?per wrote: > >>> Among the possibilities, you include Unicode subscripts. >> Just for the sake of completeness. > This tends to conclude that preformatted subscripts are really an option here. Not so. You yourself quote this statement: | Superscript modifier letters are intended for cases where the letters carry | a specific meaning, as in phonetic transcription systems, and are not | a substitute for generic styling mechanisms for superscripting of text, | as for footnotes, mathematical and chemical expressions, and the like. It is clear that the uses that you advocate go against this intent. Therefore, your conclusion that this is "an option" is nothing more than a very personal opinion on your part (and one that many people here would consider misguided if presented as general recommendation). A./ From john.w.kennedy at gmail.com Tue Jan 3 23:36:38 2017 From: john.w.kennedy at gmail.com (John W Kennedy) Date: Wed, 4 Jan 2017 00:36:38 -0500 Subject: Superscript and Subscript Characters in General Use In-Reply-To: References: <717069d1-2cdc-fe61-6c2e-b8e3f12f1dfc@cs.tut.fi> <20161227000522.7bb95f3e@JRWUBU2> <230214984.10599.1482854625210.JavaMail.www@wwinf1p10> <20161229014759.5a51c747@JRWUBU2> <1339076265.17001.1483057435430.JavaMail.www@wwinf1p12> <20161230123727.52a00633@JRWUBU2> <5CF85CCB-4321-4620-BE57-C7914C84BCC7@crissov.de> <551789226.12010.1483218242564.JavaMail.www@wwinf1p10> <842947588.29513.1483489492387.JavaMail.www@wwinf1p19> Message-ID: > On Jan 3, 2017, at 10:20 PM, Asmus Freytag wrote: > > On 1/3/2017 4:24 PM, Marcel Schneider wrote: >> On Tue, 3 Jan 2017 09:31:42 +0100, Christoph P?per wrote: >> >>>> Among the possibilities, you include Unicode subscripts. >>> Just for the sake of completeness. >> This tends to conclude that preformatted subscripts are really an option here. > > Not so. You yourself quote this statement: > > | Superscript modifier letters are intended for cases where the letters carry > | a specific meaning, as in phonetic transcription systems, and are not > | a substitute for generic styling mechanisms for superscripting of text, > | as for footnotes, mathematical and chemical expressions, and the like. > > It is clear that the uses that you advocate go against this intent. > > Therefore, your conclusion that this is "an option" is nothing more than a very personal > opinion on your part (and one that many people here would consider misguided if > presented as general recommendation). > > A./ As long as this is being discussed, what about the historic practice of using M? (nowadays often seen as M? instead) in Scottish names?e.g., M?Donald?as a typographic substitute for M(superscript c)? -- John W Kennedy Having switched to a Mac in disgust at Microsoft's combination of incompetence and criminality. From asmusf at ix.netcom.com Wed Jan 4 00:48:09 2017 From: asmusf at ix.netcom.com (Asmus Freytag (c)) Date: Tue, 3 Jan 2017 22:48:09 -0800 Subject: Superscript and Subscript Characters in General Use In-Reply-To: References: <717069d1-2cdc-fe61-6c2e-b8e3f12f1dfc@cs.tut.fi> <20161227000522.7bb95f3e@JRWUBU2> <230214984.10599.1482854625210.JavaMail.www@wwinf1p10> <20161229014759.5a51c747@JRWUBU2> <1339076265.17001.1483057435430.JavaMail.www@wwinf1p12> <20161230123727.52a00633@JRWUBU2> <5CF85CCB-4321-4620-BE57-C7914C84BCC7@crissov.de> <551789226.12010.1483218242564.JavaMail.www@wwinf1p10> <842947588.29513.1483489492387.JavaMail.www@wwinf1p19> Message-ID: On 1/3/2017 9:36 PM, John W Kennedy wrote: > As long as this is being discussed, what about the historic practice of using M? (nowadays often seen as M? instead) in Scottish names?e.g., M?Donald?as a typographic substitute for M(superscript c)? What about it? There are dozens, perhaps hundreds of fallbacks that have been used over time, both in hot metal typography as well as with typewriters or digital systems. Some practices may have started in ways similar to a fallback, but have now evolved into standard practice. Other ones remain fallbacks or went out of fashion. It's an interesting example, but what kind of discussion did you have in mind? A./ -------------- next part -------------- An HTML attachment was scrubbed... URL: From duerst at it.aoyama.ac.jp Wed Jan 4 02:12:00 2017 From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J._D=c3=bcrst?=) Date: Wed, 4 Jan 2017 17:12:00 +0900 Subject: IdnaTest.txt and RFC 5893 In-Reply-To: References: Message-ID: Hello Alastair, On 2016/12/06 20:51, Alastair Houghton wrote: > Hi all, > > I must be missing something; in IdnaTest.txt, in the BIDI TESTS section, there are examples like (line 74) Can you tell us where you got IdnaTest.txt from? > B; 0?.\u05D0; ; xn--0-sfa.xn--4db # 0?.? > > which the file alleges are valid, but I cannot for the life of me see why. First, ?0?.?? is clearly a ?Bidi domain name? since it has at least one RTL label, ???. As such, the Bidi Rule (RFC 5893 section 2) should be applied to its labels, and the label ?0?? fails [B1], since the first character has Bidi property EN, not L, R or AL. On first sight, it looks to me as if you're correct. For the exact interpretation of RFC 5893, you'd better write to the mailing list of the former IDNA(bis) WG at idna-update at alvestrand.no. Regards, Martin. > Similarly (line 93) > > B; ??.\u05D0; ; xn--0ca88g.xn--4db # ??.? > > Again, ???.?? is clearly a ?Bidi domain name?, but ???? fails [B6], because ??? has Bidi property ON, not L, EN or NSM. > > Have I misunderstood something fundamental here? Could someone explain why those examples are valid, in spite of RFC 5893? > > Kind regards, > > Alastair. > > -- > http://alastairs-place.net > > > . > -- Prof. Dr.sc. Martin J. D?rst Department of Intelligent Information Technology College of Science and Engineering Aoyama Gakuin University Fuchinobe 5-1-10, Chuo-ku, Sagamihara 252-5258 Japan From verdy_p at wanadoo.fr Wed Jan 4 02:12:43 2017 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Wed, 4 Jan 2017 09:12:43 +0100 Subject: Superscript and Subscript Characters in General Use In-Reply-To: References: <717069d1-2cdc-fe61-6c2e-b8e3f12f1dfc@cs.tut.fi> <20161227000522.7bb95f3e@JRWUBU2> <230214984.10599.1482854625210.JavaMail.www@wwinf1p10> <20161229014759.5a51c747@JRWUBU2> <1339076265.17001.1483057435430.JavaMail.www@wwinf1p12> <20161230123727.52a00633@JRWUBU2> <5CF85CCB-4321-4620-BE57-C7914C84BCC7@crissov.de> <551789226.12010.1483218242564.JavaMail.www@wwinf1p10> <842947588.29513.1483489492387.JavaMail.www@wwinf1p19> Message-ID: This is the traditional use of the apostrophe to be used to marc an elision at end of words. Nothing new. 2017-01-04 6:36 GMT+01:00 John W Kennedy : > > > On Jan 3, 2017, at 10:20 PM, Asmus Freytag wrote: > > > > On 1/3/2017 4:24 PM, Marcel Schneider wrote: > >> On Tue, 3 Jan 2017 09:31:42 +0100, Christoph P?per wrote: > >> > >>>> Among the possibilities, you include Unicode subscripts. > >>> Just for the sake of completeness. > >> This tends to conclude that preformatted subscripts are really an > option here. > > > > Not so. You yourself quote this statement: > > > > | Superscript modifier letters are intended for cases where the letters > carry > > | a specific meaning, as in phonetic transcription systems, and are not > > | a substitute for generic styling mechanisms for superscripting of text, > > | as for footnotes, mathematical and chemical expressions, and the like. > > > > It is clear that the uses that you advocate go against this intent. > > > > Therefore, your conclusion that this is "an option" is nothing more than > a very personal > > opinion on your part (and one that many people here would consider > misguided if > > presented as general recommendation). > > > > A./ > > As long as this is being discussed, what about the historic practice of > using M? (nowadays often seen as M? instead) in Scottish names?e.g., > M?Donald?as a typographic substitute for M(superscript c)? > > -- > John W Kennedy > Having switched to a Mac in disgust at Microsoft's combination of > incompetence and criminality. > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From alastair at alastairs-place.net Wed Jan 4 04:28:38 2017 From: alastair at alastairs-place.net (Alastair Houghton) Date: Wed, 4 Jan 2017 10:28:38 +0000 Subject: IdnaTest.txt and RFC 5893 In-Reply-To: References: Message-ID: <44B684E5-3EC7-43DE-8BFE-19935FEC8946@alastairs-place.net> On 4 Jan 2017, at 08:12, Martin J. D?rst wrote: > > Hello Alastair, > > On 2016/12/06 20:51, Alastair Houghton wrote: >> Hi all, >> >> I must be missing something; in IdnaTest.txt, in the BIDI TESTS section, there are examples like (line 74) > > Can you tell us where you got IdnaTest.txt from? Yes, sorry, I should have included that information. It?s here, with the IDNA mapping table http://www.unicode.org/Public/idna/9.0.0/ which I arrived at from UTS #46 (). >> B; 0?.\u05D0; ; xn--0-sfa.xn--4db # 0?.? >> >> which the file alleges are valid, but I cannot for the life of me see why. First, ?0?.?? is clearly a ?Bidi domain name? since it has at least one RTL label, ???. As such, the Bidi Rule (RFC 5893 section 2) should be applied to its labels, and the label ?0?? fails [B1], since the first character has Bidi property EN, not L, R or AL. > > On first sight, it looks to me as if you're correct. > > For the exact interpretation of RFC 5893, you'd better write to the mailing list of the former IDNA(bis) WG at idna-update at alvestrand.no. RFC 5893 seems pretty clear to me, and the problem really is that the test vectors (which come from unicode.org) seem (to me) to be incorrect. I think the Unicode list is, therefore, the right place to raise this issue, but you?re right that it might attract attention from the right people if I also fire off a mail to the IDNA WG list. >> Similarly (line 93) >> >> B; ??.\u05D0; ; xn--0ca88g.xn--4db # ??.? >> >> Again, ???.?? is clearly a ?Bidi domain name?, but ???? fails [B6], because ??? has Bidi property ON, not L, EN or NSM. >> >> Have I misunderstood something fundamental here? Could someone explain why those examples are valid, in spite of RFC 5893? As an additional data point, ICU?s IDNA demo web page appears to think these names are OK. Kind regards, Alastair. -- http://alastairs-place.net From john.w.kennedy at gmail.com Wed Jan 4 05:44:14 2017 From: john.w.kennedy at gmail.com (John W Kennedy) Date: Wed, 4 Jan 2017 06:44:14 -0500 Subject: Superscript and Subscript Characters in General Use In-Reply-To: References: <717069d1-2cdc-fe61-6c2e-b8e3f12f1dfc@cs.tut.fi> <20161227000522.7bb95f3e@JRWUBU2> <230214984.10599.1482854625210.JavaMail.www@wwinf1p10> <20161229014759.5a51c747@JRWUBU2> <1339076265.17001.1483057435430.JavaMail.www@wwinf1p12> <20161230123727.52a00633@JRWUBU2> <5CF85CCB-4321-4620-BE57-C7914C84BCC7@crissov.de> <551789226.12010.1483218242564.JavaMail.www@wwinf1p10> <842947588.29513.1483489492387.JavaMail.www@wwinf1p19> Message-ID: <3ADB5847-528D-45B0-A963-F0CACC7A69E9@gmail.com> No it isn?t. It isn?t an apostrophe; it?s a left single quote, although some modern printers mistakenly suppose it to be an apostrophe, and substitute one. And it isn?t an elision; it?s meant as a substitute glyph for a superscript c. (I confess that, not being from Scotland, I thought it to be an elision for over fifty years, but when I was preparing a transcription of William Dunlap?s ?Andr?: a Tragedy in Five Acts? [New York, 1798], in which a character named ?M?Donald? plays a major role, I looked into the matter, and was surprised to learn the truth.) > On Jan 4, 2017, at 3:12 AM, Philippe Verdy wrote: > > This is the traditional use of the apostrophe to be used to marc an elision at end of words. Nothing new. > > 2017-01-04 6:36 GMT+01:00 John W Kennedy : >> >> > On Jan 3, 2017, at 10:20 PM, Asmus Freytag wrote: >> > >> > On 1/3/2017 4:24 PM, Marcel Schneider wrote: >> >> On Tue, 3 Jan 2017 09:31:42 +0100, Christoph P?per wrote: >> >> >> >>>> Among the possibilities, you include Unicode subscripts. >> >>> Just for the sake of completeness. >> >> This tends to conclude that preformatted subscripts are really an option here. >> > >> > Not so. You yourself quote this statement: >> > >> > | Superscript modifier letters are intended for cases where the letters carry >> > | a specific meaning, as in phonetic transcription systems, and are not >> > | a substitute for generic styling mechanisms for superscripting of text, >> > | as for footnotes, mathematical and chemical expressions, and the like. >> > >> > It is clear that the uses that you advocate go against this intent. >> > >> > Therefore, your conclusion that this is "an option" is nothing more than a very personal >> > opinion on your part (and one that many people here would consider misguided if >> > presented as general recommendation). >> > >> > A./ >> >> As long as this is being discussed, what about the historic practice of using M? (nowadays often seen as M? instead) in Scottish names?e.g., M?Donald?as a typographic substitute for M(superscript c)? >> >> -- >> John W Kennedy >> Having switched to a Mac in disgust at Microsoft's combination of incompetence and criminality. >> >> >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Wed Jan 4 06:43:50 2017 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Wed, 4 Jan 2017 13:43:50 +0100 Subject: Superscript and Subscript Characters in General Use In-Reply-To: <3ADB5847-528D-45B0-A963-F0CACC7A69E9@gmail.com> References: <717069d1-2cdc-fe61-6c2e-b8e3f12f1dfc@cs.tut.fi> <20161227000522.7bb95f3e@JRWUBU2> <230214984.10599.1482854625210.JavaMail.www@wwinf1p10> <20161229014759.5a51c747@JRWUBU2> <1339076265.17001.1483057435430.JavaMail.www@wwinf1p12> <20161230123727.52a00633@JRWUBU2> <5CF85CCB-4321-4620-BE57-C7914C84BCC7@crissov.de> <551789226.12010.1483218242564.JavaMail.www@wwinf1p10> <842947588.29513.1483489492387.JavaMail.www@wwinf1p19> <3ADB5847-528D-45B0-A963-F0CACC7A69E9@gmail.com> Message-ID: Linguistically, it is an apostrophe, even if it's represented by a single quote (same as in French), because the "letter apostrophe" is not used (that letter apostrohpe was encoded in Unicode very late and there's no desire to change the mappings in French or Scottish). If you think it is a substitute only because the very superficial apparence of that superscript c, I think it is just a hack used by some old printer that did not have that letter in their case box. In 1798 printing a book was expensive and metal fonts were also costly, and writers acepted some minor transforms of their manuscript by the printer (and frequent typos as well). Later reeditions frequently correct these typos. Note that in French the right single quote is normally not used at all as a quotation mark, and when it appears between two letters it is unambiguously an apostrophe. I think the letter apostrophe was addede later in Unicode only for English to allow distrinctions. But I've rarely seen used. Later it was used as a substitute for a glottal stop in some Polynesian/Melanesian languages but the actual character was encoded and is preferable (its glyph is distinctive). 2017-01-04 12:44 GMT+01:00 John W Kennedy : > No it isn?t. It isn?t an apostrophe; it?s a left single quote, although > some modern printers mistakenly suppose it to be an apostrophe, and > substitute one. And it isn?t an elision; it?s meant as a substitute glyph > for a superscript c. (I confess that, not being from Scotland, I thought it > to be an elision for over fifty years, but when I was preparing a > transcription of William Dunlap?s ?Andr?: a Tragedy in Five Acts? [New > York, 1798], in which a character named ?M?Donald? plays a major role, I > looked into the matter, and was surprised to learn the truth.) > > > On Jan 4, 2017, at 3:12 AM, Philippe Verdy wrote: > > This is the traditional use of the apostrophe to be used to marc an > elision at end of words. Nothing new. > > 2017-01-04 6:36 GMT+01:00 John W Kennedy : > >> >> > On Jan 3, 2017, at 10:20 PM, Asmus Freytag >> wrote: >> > >> > On 1/3/2017 4:24 PM, Marcel Schneider wrote: >> >> On Tue, 3 Jan 2017 09:31:42 +0100, Christoph P?per wrote: >> >> >> >>>> Among the possibilities, you include Unicode subscripts. >> >>> Just for the sake of completeness. >> >> This tends to conclude that preformatted subscripts are really an >> option here. >> > >> > Not so. You yourself quote this statement: >> > >> > | Superscript modifier letters are intended for cases where the letters >> carry >> > | a specific meaning, as in phonetic transcription systems, and are not >> > | a substitute for generic styling mechanisms for superscripting of >> text, >> > | as for footnotes, mathematical and chemical expressions, and the like. >> > >> > It is clear that the uses that you advocate go against this intent. >> > >> > Therefore, your conclusion that this is "an option" is nothing more >> than a very personal >> > opinion on your part (and one that many people here would consider >> misguided if >> > presented as general recommendation). >> > >> > A./ >> >> As long as this is being discussed, what about the historic practice of >> using M? (nowadays often seen as M? instead) in Scottish names?e.g., >> M?Donald?as a typographic substitute for M(superscript c)? >> >> -- >> John W Kennedy >> Having switched to a Mac in disgust at Microsoft's combination of >> incompetence and criminality. >> >> >> >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From moyogo at gmail.com Wed Jan 4 07:30:12 2017 From: moyogo at gmail.com (Denis Jacquerye) Date: Wed, 04 Jan 2017 13:30:12 +0000 Subject: Superscript and Subscript Characters in General Use In-Reply-To: References: <717069d1-2cdc-fe61-6c2e-b8e3f12f1dfc@cs.tut.fi> <20161227000522.7bb95f3e@JRWUBU2> <230214984.10599.1482854625210.JavaMail.www@wwinf1p10> <20161229014759.5a51c747@JRWUBU2> <1339076265.17001.1483057435430.JavaMail.www@wwinf1p12> <20161230123727.52a00633@JRWUBU2> <5CF85CCB-4321-4620-BE57-C7914C84BCC7@crissov.de> <551789226.12010.1483218242564.JavaMail.www@wwinf1p10> <842947588.29513.1483489492387.JavaMail.www@wwinf1p19> <3ADB5847-528D-45B0-A963-F0CACC7A69E9@gmail.com> Message-ID: Philippe, you are talking about 0027 APOSTROPHE, 2019 RIGHT SINGLE QUOTATION MARK and 02BC MODIFIER LETTER APOSTROPHE. John is clearly talking about 2018 LEFT SINGLE QUOTATION MARK (or if you want to stretch it 02BB MODIFIER LETTER TURNED COMMA) being used as a substitute for superscript c. They all look alike at small size or in some fonts, which explains your misunderstanding even if John was explicit about it being a left single quote. On Wed, 4 Jan 2017 at 12:48 Philippe Verdy wrote: > Linguistically, it is an apostrophe, even if it's represented by a single > quote (same as in French), because the "letter apostrophe" is not used > (that letter apostrohpe was encoded in Unicode very late and there's no > desire to change the mappings in French or Scottish). If you think it is a > substitute only because the very superficial apparence of that superscript > c, I think it is just a hack used by some old printer that did not have > that letter in their case box. In 1798 printing a book was expensive and > metal fonts were also costly, and writers acepted some minor transforms of > their manuscript by the printer (and frequent typos as well). Later > reeditions frequently correct these typos. > > Note that in French the right single quote is normally not used at all as > a quotation mark, and when it appears between two letters it is > unambiguously an apostrophe. I think the letter apostrophe was addede later > in Unicode only for English to allow distrinctions. But I've rarely seen > used. Later it was used as a substitute for a glottal stop in some > Polynesian/Melanesian languages but the actual character was encoded and is > preferable (its glyph is distinctive). > > > 2017-01-04 12:44 GMT+01:00 John W Kennedy : > > No it isn?t. It isn?t an apostrophe; it?s a left single quote, although > some modern printers mistakenly suppose it to be an apostrophe, and > substitute one. And it isn?t an elision; it?s meant as a substitute glyph > for a superscript c. (I confess that, not being from Scotland, I thought it > to be an elision for over fifty years, but when I was preparing a > transcription of William Dunlap?s ?Andr?: a Tragedy in Five Acts? [New > York, 1798], in which a character named ?M?Donald? plays a major role, I > looked into the matter, and was surprised to learn the truth.) > > > On Jan 4, 2017, at 3:12 AM, Philippe Verdy wrote: > > This is the traditional use of the apostrophe to be used to marc an > elision at end of words. Nothing new. > > 2017-01-04 6:36 GMT+01:00 John W Kennedy : > > > > On Jan 3, 2017, at 10:20 PM, Asmus Freytag wrote: > > > > On 1/3/2017 4:24 PM, Marcel Schneider wrote: > >> On Tue, 3 Jan 2017 09:31:42 +0100, Christoph P?per wrote: > >> > >>>> Among the possibilities, you include Unicode subscripts. > >>> Just for the sake of completeness. > >> This tends to conclude that preformatted subscripts are really an > option here. > > > > Not so. You yourself quote this statement: > > > > | Superscript modifier letters are intended for cases where the letters > carry > > | a specific meaning, as in phonetic transcription systems, and are not > > | a substitute for generic styling mechanisms for superscripting of text, > > | as for footnotes, mathematical and chemical expressions, and the like. > > > > It is clear that the uses that you advocate go against this intent. > > > > Therefore, your conclusion that this is "an option" is nothing more than > a very personal > > opinion on your part (and one that many people here would consider > misguided if > > presented as general recommendation). > > > > A./ > > As long as this is being discussed, what about the historic practice of > using M? (nowadays often seen as M? instead) in Scottish names?e.g., > M?Donald?as a typographic substitute for M(superscript c)? > > -- > John W Kennedy > Having switched to a Mac in disgust at Microsoft's combination of > incompetence and criminality. > > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From charupdate at orange.fr Wed Jan 4 08:20:40 2017 From: charupdate at orange.fr (Marcel Schneider) Date: Wed, 4 Jan 2017 15:20:40 +0100 (CET) Subject: Superscript and Subscript Characters in General Use In-Reply-To: References: <717069d1-2cdc-fe61-6c2e-b8e3f12f1dfc@cs.tut.fi> <20161227000522.7bb95f3e@JRWUBU2> <230214984.10599.1482854625210.JavaMail.www@wwinf1p10> <20161229014759.5a51c747@JRWUBU2> <1339076265.17001.1483057435430.JavaMail.www@wwinf1p12> <20161230123727.52a00633@JRWUBU2> <5CF85CCB-4321-4620-BE57-C7914C84BCC7@crissov.de> <551789226.12010.1483218242564.JavaMail.www@wwinf1p10> <842947588.29513.1483489492387.JavaMail.www@wwinf1p19> Message-ID: <106567611.8574.1483539640797.JavaMail.www@wwinf1k14> On Wed, 4 Jan 2017 00:36:38 -0500, Asmus Freytag wrote: > > On 1/3/2017 4:24 PM, Marcel Schneider wrote: > > On Tue, 3 Jan 2017 09:31:42 +0100, Christoph P?per wrote: > > > >>> Among the possibilities, you include Unicode subscripts. > >> Just for the sake of completeness. > > This tends to conclude that preformatted subscripts are really an option here. > > Not so. You yourself quote this statement: > > | Superscript modifier letters are intended for cases where the letters carry > | a specific meaning, as in phonetic transcription systems, and are not > | a substitute for generic styling mechanisms for superscripting of text, > | as for footnotes, mathematical and chemical expressions, and the like. > > It is clear that the uses that you advocate go against this intent. This is because even complemented with UAXes and TRs, the Core Specifications cannot cover the whole practice. It seems that to stay inside reasonable limits, a significant number of usage cases have been left out, e.g. the mentioned use of plain text for styled custom vulgar fractions is a recognized practice, but stays persistently excluded from TUS. However, since the inclusion of this could consist in adding three lines to the text, there is more to it. Out of technical as well as ethical considerations, Unicode is unable to promote the discussed usages, but without strongly discouraging them. The snippet above [1] would be less harsh at the expense of some redundancy: | Superscript modifier letters are intended for cases where the letters carry | a specific meaning, as in phonetic transcription systems, and are not INTENDED | AS a substitute for generic styling mechanisms for superscripting of text, | as for footnotes, mathematical and chemical expressions, and the like. This resolves to the meaning that super-/subscripting in more or less ordinary text is outside the design principles of the Unicode Standard, because the boundary between the feasible and the unfeasible would be hard to draw, as shown with the recent example of the plain text database for chemical formulas. So to protect itself against the temptation of drawing that boundary (drawing it at risk of being subsequently compelled to move it further), Unicode *declares* those characters as being *intended for* special contexts, according to their very encoding history. Trying to understand to what extent this principle is applicable, I note that the three cited examples currently imply much more formatting than superscripting. This is the case of structural formulae in _chemistry_, complex _mathematical_ expressions, and _footnote_ management and layout. By contrast, when it?s only about super- or subscripting a few digits or Latin letters, markup and use of rich text may be considered overkill. And in the case of content that the reader may wish to copy-paste, things like the ?16? affix of hex numbers should remain distinct. Hence, styling is only ?the preferred means?, not the mandatory way to represent superscript letters or digits.[2] And this is tied to a /design/ principle of the Standard. I believe that /usage/ principles may diverge. > > Therefore, your conclusion that this is "an option" is nothing more than > a very personal opinion on your part (and one that many people here would > consider misguided if presented as general recommendation). Presenting this as general recommendation was indeed what I intended when starting the first thread of this discussion. Thanks to your and other subscribers? replies, I?ve come to the insight that this cannot be recommended throughout, not in a general way. However, this not being "an option" remains still very unclear to me. As a result of prior discussions, we know that other list participants do use e.g. superscript characters in a more extensive way. I think there are two levels of action: (1) to encode new preformatted characters; (2) to encourage re-use of already existing ones. I understand that Unicode is consistently reluctant in both, while ISO/IEC is able to do more in (1) given that they sometimes add (or remove) characters to(/from) the new repertoire, and National Bodies are in a position to do (2) through usage recommendations of their own. Let alone all the other people who may use or not use available preformatted characters for any purpose, eventually sharing the hint and?in the best case?the means to input them efficiently. Or am I missing something? Given that the WG of the French standard keyboard is actually interested in getting encoded a new ordinal indicator (kind of '?'), I feel the more urged to stay tuned, and to comment on subsequent e-mails, too. Marcel [1] TUS 9.0, ?7.8, p. 327. http://www.unicode.org/versions/Unicode9.0.0/ch07.pdf#G24762 [2] TUS 9.0, ?22.4, p. 786. http://www.unicode.org/versions/Unicode9.0.0/ch22.pdf#G42931 From alastair at alastairs-place.net Wed Jan 4 09:13:36 2017 From: alastair at alastairs-place.net (Alastair Houghton) Date: Wed, 4 Jan 2017 15:13:36 +0000 Subject: Superscript and Subscript Characters in General Use In-Reply-To: <106567611.8574.1483539640797.JavaMail.www@wwinf1k14> References: <717069d1-2cdc-fe61-6c2e-b8e3f12f1dfc@cs.tut.fi> <20161227000522.7bb95f3e@JRWUBU2> <230214984.10599.1482854625210.JavaMail.www@wwinf1p10> <20161229014759.5a51c747@JRWUBU2> <1339076265.17001.1483057435430.JavaMail.www@wwinf1p12> <20161230123727.52a00633@JRWUBU2> <5CF85CCB-4321-4620-BE57-C7914C84BCC7@crissov.de> <551789226.12010.1483218242564.JavaMail.www@wwinf1p10> <842947588.29513.1483489492387.JavaMail.www@wwinf1p19> <106567611.8574.1483539640797.JavaMail.www@wwinf1k14> Message-ID: <0A3C029B-3C76-43E8-B6A5-4EF96093B044@alastairs-place.net> On 4 Jan 2017, at 14:20, Marcel Schneider wrote: > As a result of prior discussions, we know that other list participants do use e.g. > superscript characters in a more extensive way. > > I think there are two levels of action: > > (1) to encode new preformatted characters; > (2) to encourage re-use of already existing ones. > > I understand that Unicode is consistently reluctant in both, while ISO/IEC is able > to do more in (1) given that they sometimes add (or remove) characters to(/from) > the new repertoire, and National Bodies are in a position to do (2) through usage > recommendations of their own. Let alone all the other people who may use or not > use available preformatted characters for any purpose, eventually sharing the hint > and?in the best case?the means to input them efficiently. > > Or am I missing something? > > Given that the WG of the French standard keyboard is actually interested in getting > encoded a new ordinal indicator (kind of '?'), I feel the more urged to stay tuned, > and to comment on subsequent e-mails, too. I can understand the desire to encode the new ordinal indicator. Perhaps another option worth contemplating might be to standardise some control code points, to provide a mechanism for ?plain text? to include the necessary minimum of formatting information without additional markup. The advantage of this approach is that it would make it explicitly obvious that Unicode wasn?t going to include further super or subscript forms, while providing everyone that wants them with access to a full set of super or subscripts subject to system (or font) support. A simple form of this might be to encode the new zero-width modifier code points SUBSCRIPT and SUPERSCRIPT that work somewhat like the variation selectors, so e.g. U+0032 DIGIT TWO U+???? SUPERSCRIPT U+0033 DIGIT THREE U+???? SUBSCRIPT would display as ?? on fonts that supported the new modifiers. The advantage of taking this very simplistic approach is that it can be dealt with in the OpenType (or AAT) tables in modern fonts, rather than necessitating changes to rendering code. It is also obviously not an attempt to replace markup, but will cope with most common ?plain text? uses. Kind regards, Alastair. -- http://alastairs-place.net From doug at ewellic.org Wed Jan 4 13:20:14 2017 From: doug at ewellic.org (Doug Ewell) Date: Wed, 04 Jan 2017 12:20:14 -0700 Subject: Superscript and Subscript Characters in General Use Message-ID: <20170104122014.665a7a7059d7ee80bb4d670165c8327d.20efb7fc52.wbe@email03.godaddy.com> Marcel Schneider wrote: > This is because even complemented with UAXes and TRs, the Core > Specifications cannot cover the whole practice. It seems that to stay > inside reasonable limits, a significant number of usage cases have > been left out, e.g. the mentioned use of plain text for styled custom > vulgar fractions is a recognized practice, but stays persistently > excluded from TUS. I don't understand the relevance to vulgar fractions. Much of this thread has dealt with Basic Latin characters that have no superscript or subscript clones, and how their absence prevents certain passages from being representable in plain text. This is your basic debate over what constitutes plain text. As explained in the July 2015 thread about vulgar fractions, TUS sections 6.2 and 22.3 thoroughly explain the use of U+2044 FRACTION SLASH with normal "Nd" digits. If I want to write "ninety-nine and forty-four one-hundredths," with the non-precomposed vulgar fraction, I can write "99?44?100" and be fully compliant with the Standard. This has nothing to do with what is and isn't plain text. The fact that many current rendering systems can't render this correctly is an implementation matter, though a hard-to-fix one. (Note that the fallback display is perfectly readable and correct, unless you see a box for U+2009.) The fact that TUS doesn't sanction the use of U+2044 with superscript and subscript digits, which I imagine Marcel was alluding to, is irrelevant. TUS is a character encoding standard, not a glyph encoding standard. If Marcel is talking about distinguishing between horizontal and diagonal slashes in vulgar fractions, this is still not a question of plain text. However, in the emoji era, this type of presentation variation has become something that Unicode cares about, and so it might be handled in some way in the future, such as with a variation selector. I suspect this mechanism has been "excluded from TUS" because it doesn't yet exist. -- Doug Ewell | Thornton, CO, US | ewellic.org From nobody_uses at outlook.com Wed Jan 4 15:18:32 2017 From: nobody_uses at outlook.com (eduardo marin) Date: Wed, 4 Jan 2017 21:18:32 +0000 Subject: Soyombo empty letter frame Message-ID: The Soyombo proposal is beautiful, but it is missing a very important character in my opinion: http://www.unicode.org/L2/L2015/15004-soyombo.pdf Encoding an empty letter frame will allow for its proper description in plain text (as it is clear in the proposal itself), it could be used as an stylized cursor in text processors and also we could make zwj sequences such that combining with consonants makes it only render the nucleus. -------------- next part -------------- An HTML attachment was scrubbed... URL: From charupdate at orange.fr Wed Jan 4 15:48:29 2017 From: charupdate at orange.fr (Marcel Schneider) Date: Wed, 4 Jan 2017 22:48:29 +0100 (CET) Subject: Superscript and Subscript Characters in General Use In-Reply-To: <0A3C029B-3C76-43E8-B6A5-4EF96093B044@alastairs-place.net> References: <717069d1-2cdc-fe61-6c2e-b8e3f12f1dfc@cs.tut.fi> <20161227000522.7bb95f3e@JRWUBU2> <230214984.10599.1482854625210.JavaMail.www@wwinf1p10> <20161229014759.5a51c747@JRWUBU2> <1339076265.17001.1483057435430.JavaMail.www@wwinf1p12> <20161230123727.52a00633@JRWUBU2> <5CF85CCB-4321-4620-BE57-C7914C84BCC7@crissov.de> <551789226.12010.1483218242564.JavaMail.www@wwinf1p10> <842947588.29513.1483489492387.JavaMail.www@wwinf1p19> <106567611.8574.1483539640797.JavaMail.www@wwinf1k14> <0A3C029B-3C76-43E8-B6A5-4EF96093B044@alastairs-place.net> Message-ID: <57352104.17641.1483566509921.JavaMail.www@wwinf1k14> On Wed, 4 Jan 2017 15:13:36 +0000, Alastair Houghton wrote: > > > Given that the WG of the French standard keyboard is actually interested in getting > > encoded a new ordinal indicator (kind of '?'), I feel the more urged to stay tuned, > > and to comment on subsequent e-mails, too. > > I can understand the desire to encode the new ordinal indicator. > > Perhaps another option worth contemplating might be to standardise some control > code points, to provide a mechanism for ?plain text? to include the necessary > minimum of formatting information without additional markup. The advantage of > this approach is that it would make it explicitly obvious that Unicode wasn?t > going to include further super or subscript forms, while providing everyone that > wants them with access to a full set of super or subscripts subject to system > (or font) support. > > A simple form of this might be to encode the new zero-width modifier code points > SUBSCRIPT and SUPERSCRIPT that work somewhat like the variation selectors, so e.g. > > U+0032 DIGIT TWO > U+???? SUPERSCRIPT > U+0033 DIGIT THREE > U+???? SUBSCRIPT > > would display as ?? on fonts that supported the new modifiers. The advantage of > taking this very simplistic approach is that it can be dealt with in the OpenType > (or AAT) tables in modern fonts, rather than necessitating changes to rendering > code. It is also obviously not an attempt to replace markup, but will cope with > most common ?plain text? uses. This would indeed make for stable plain text representations that convey the necessary vertical alignment. However its encoding would imply that the design principle of ?not attempt[ing] to describe the positioning of a character above or below the baseline in typographical layout? is superseded in this particular case, that provides a universal mechanism for a basic formatting parameter. Consistently this would call for some extensions catering for other formatting parameters. The expense in code points would be very low, the scheme would meet user expectations, and the Standard would become even more performative and thus, even more attractive through its enhancing the plain text environment. Eventually, the display of text editors, that actually is internally directed (for syntactic highlighting), would become text-guided. This is not far from rich-text. It all tends to the conclusion that the French demand is based upon: modifier letters that are superscript forms, are not real superscripts, they don?t fit the expectations of people regarding superscripts and abbreviations. I already expressed my point of view in this discussion. But the real concern could be to emulate the Spanish ordinal indicators, arguing that their being a part of Unicode justifies similar facilities for other languages. Here the Unicode position is that the Spanish ordinal indicators are backcompat code points for roundtrip compatibility with ISO/IEC 8859-1. This clearly results from the Code Charts at U+00AA, U+00BA. There has been a deadline, that diligence made to precede. Let alone that a complete set of ordinal indicators for French necessitates four letters, that is probably exceeding the framework of 8-bit charsets common to several countries. As far as the discussion grew until now, I feel that French must live with the existing infrastructure. Hence the idea of re-using four modifier letters for that purpose. If I?m wrong with this idea, that could be good or bad news. Good news if the generic SUPERSCRIPT and SUBSCRIPT variant selectors (or alternatively, new ordinal indicators) will be effectively encoded. Bad news if that as well as the re-use of modifier letters will be discarded. In-between, I see the out-of- the-box modifier letter solution, as a kind of second-best choice. Better than nothing at all. In certain circumstances, better than markup and formatting. Kind regards, Marcel From richard.wordingham at ntlworld.com Wed Jan 4 16:12:00 2017 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Wed, 4 Jan 2017 22:12:00 +0000 Subject: Superscript and Subscript Characters in General Use In-Reply-To: References: <717069d1-2cdc-fe61-6c2e-b8e3f12f1dfc@cs.tut.fi> <20161227000522.7bb95f3e@JRWUBU2> <230214984.10599.1482854625210.JavaMail.www@wwinf1p10> <20161229014759.5a51c747@JRWUBU2> <1339076265.17001.1483057435430.JavaMail.www@wwinf1p12> <20161230123727.52a00633@JRWUBU2> <5CF85CCB-4321-4620-BE57-C7914C84BCC7@crissov.de> <551789226.12010.1483218242564.JavaMail.www@wwinf1p10> <842947588.29513.1483489492387.JavaMail.www@wwinf1p19> <3ADB5847-528D-45B0-A963-F0CACC7A69E9@gmail.com> Message-ID: <20170104221200.2a04ba12@JRWUBU2> On Wed, 4 Jan 2017 13:43:50 +0100 Philippe Verdy wrote: > Note that in French the right single quote is normally not used at > all as a quotation mark, and when it appears between two letters it > is unambiguously an apostrophe. I think the letter apostrophe was > addede later in Unicode only for English to allow distrinctions. But > I've rarely seen used. Later it was used as a substitute for a > glottal stop in some Polynesian/Melanesian languages but the actual > character was encoded and is preferable (its glyph is distinctive). As consonants, what we have are spacing clones (U+02BC and U+02BE) of the smooth breathing, usually used for glottal stops, and spacing clones of the rough breathing (U+02BD and U+02BF). We also have the modifier modifications of the IPA letters U+02C0 and U+02C1. These usages only fit English well when representing the glottalisation (or even total loss) of /t/ after vowels. > 2017-01-04 12:44 GMT+01:00 John W Kennedy : > > > No it isn?t. It isn?t an apostrophe; it?s a left single quote, > > although some modern printers mistakenly suppose it to be an > > apostrophe, and substitute one. And it isn?t an elision; it?s meant > > as a substitute glyph for a superscript c. For which I would suggest U+02BF MODIFIER LETTER LEFT HALF RING would be the best modern representative of the substitute character! Of course, that would further increase confusion of those who initially read U+02BF as a superscript 'c', and only later, if ever, realise that it's actually a rough breathing carefully distinguished from the similar punctuation marks. Richard. From charupdate at orange.fr Wed Jan 4 17:36:49 2017 From: charupdate at orange.fr (Marcel Schneider) Date: Thu, 5 Jan 2017 00:36:49 +0100 (CET) Subject: Superscript and Subscript Characters in General Use In-Reply-To: <20170104122014.665a7a7059d7ee80bb4d670165c8327d.20efb7fc52.wbe@email03.godaddy.com> References: <20170104122014.665a7a7059d7ee80bb4d670165c8327d.20efb7fc52.wbe@email03.godaddy.com> Message-ID: <1956590416.18185.1483573009321.JavaMail.www@wwinf1k14> On Wed, 04 Jan 2017 12:20:14 -0700, Doug Ewell wrote: > > Marcel Schneider wrote: > > > This is because even complemented with UAXes and TRs, the Core > > Specifications cannot cover the whole practice. It seems that to stay > > inside reasonable limits, a significant number of usage cases have > > been left out, e.g. the mentioned use of plain text for styled custom > > vulgar fractions is a recognized practice, but stays persistently > > excluded from TUS. > > I don't understand the relevance to vulgar fractions. Vulgar fractions represented using super- and subscript digits around the FRACTION SLASH U+2044, that kerns, are one example illustrating superscript and subscript characters in general use. It is cited because it is the subject of a Microsoft Community wiki that is well referenced on the web: https://answers.microsoft.com/en-us/msoffice/wiki/msoffice_word-mso_other/styled-fractions-in-windows/4a07d5fa-2484-4e39-b1f3-70bb3eb0c332 I recall again that when I launched the related 2015 thread, I was ignoring this page, until close to the end of the thread, when I found and shared the link. Vulgar fractions rather than mathematical fractions due to the slant of the fraction slash. (Though the so-called VULGAR FRACTIONs can be displayed with an horizontal bar, as TUS and Doug state (below). > > Much of this thread has dealt with Basic Latin characters that have no > superscript or subscript clones, and how their absence prevents certain > passages from being representable in plain text. This is your basic > debate over what constitutes plain text. There was indeed a concern about what performance to recognize to plain text. But that had been settled to the extent that Unicode does not sustain attempts to fully represent styled mathematical expressions, but that a set of preformatted alphabets should be completed: superscripts lowercase (q) and uppercase, subscripts lowercase, and small caps (that take the place of subscript capitals). Now I?m advocating the recognition of the re-use of existing modifier letters instead of new or newly modified superscripts, as well as the demand for ordinal indicators in French. > > As explained in the July 2015 thread about vulgar fractions, TUS > sections 6.2 and 22.3 thoroughly explain the use of U+2044 FRACTION > SLASH with normal "Nd" digits. If I want to write "ninety-nine and > forty-four one-hundredths," with the non-precomposed vulgar fraction, I > can write "99?44?100" and be fully compliant with the Standard. This > has nothing to do with what is and isn't plain text. This and the spelling with SOLIDUS are referred to as fallback. What I complain of as not mentioned in the Standard, is that U+2044 can be used with superscript and subscript digits, rather than ASCII digits. The kerning of the FRACTION SLASH makes it fit for this use case, and in certain high-end fonts, especially Arial Unicode MS, the result is fully identical to precomposed fractions. This all is plain text. What isn?t, is the use of U+2044 as a format control, as specified in that part of the Standard. High-end software is meant to automatically apply fraction styling when U+2044 is detected between digits. > > The fact that many current rendering systems can't render this correctly > is an implementation matter, though a hard-to-fix one. (Note that the > fallback display is perfectly readable and correct, unless you see a box > for U+2009.) Agreed. Here the use of superscript and subscript digits is not indispensable to the readability. In this case, their availability constitutes a facility for better representation?even in plain text. > > The fact that TUS doesn't sanction the use of U+2044 with superscript > and subscript digits, which I imagine Marcel was alluding to, is > irrelevant. TUS is a character encoding standard, not a glyph encoding > standard. The distinction between baseline digits and superscript/subscript digits is in my opinion not a glyphic issue, since in Unicode they all are available as distinct characters. > > If Marcel is talking about distinguishing between horizontal and > diagonal slashes in vulgar fractions, this is still not a question of > plain text. However, in the emoji era, this type of presentation > variation has become something that Unicode cares about, and so it might > be handled in some way in the future, such as with a variation selector. > I suspect this mechanism has been "excluded from TUS" because it doesn't > yet exist. I?m not talking about this, and I don?t miss it in Unicode. Some fonts might have horizontal fraction bars. However, such a variation selector could be handy. The plain text custom fractions are IMO a good example of the re-use of superscript and subscript characters. More, I thought that the fraction slash had been encoded to work with them, until I learned in TUS that this was not intended. The 2015 thread brought up that the observed synergy is due to an initiative of the font designer(s). The fact that this happened in a font that claims conformity to the Standard, seems to me non-trivial. Marcel From markus.icu at gmail.com Wed Jan 4 17:40:15 2017 From: markus.icu at gmail.com (Markus Scherer) Date: Wed, 4 Jan 2017 15:40:15 -0800 Subject: IdnaTest.txt and RFC 5893 In-Reply-To: <44B684E5-3EC7-43DE-8BFE-19935FEC8946@alastairs-place.net> References: <44B684E5-3EC7-43DE-8BFE-19935FEC8946@alastairs-place.net> Message-ID: On Wed, Jan 4, 2017 at 2:28 AM, Alastair Houghton < alastair at alastairs-place.net> wrote: > RFC 5893 seems pretty clear to me, and the problem really is that the test > vectors (which come from unicode.org) seem (to me) to be incorrect. https://tools.ietf.org/html/rfc5893#section-2 says "*The following rule*, consisting of six conditions, *applies to labels* in Bidi domain names." That's what the ICU code does -- applying the rule to each label -- and I assume that's the basis for the test data. The latter part of this RFC section says that *if* certain conditions are met *for all labels, then* the domain name as a whole displays well. ICU does not currently check for multi-label bidi combinations. FYI the ICU checkLabelBiDi() code is currently here (Java version). markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Wed Jan 4 18:33:06 2017 From: doug at ewellic.org (Doug Ewell) Date: Wed, 04 Jan 2017 17:33:06 -0700 Subject: Superscript and Subscript Characters in General Use Message-ID: <20170104173306.665a7a7059d7ee80bb4d670165c8327d.7a92080546.wbe@email03.godaddy.com> Marcel Schneider wrote: >> I don't understand the relevance to vulgar fractions. > > Vulgar fractions represented using super- and subscript digits around > the FRACTION SLASH U+2044 Don't do that. The fact that someone, even a Microsoft MVP, posted an article about this glyph hack does not make it a good idea. It's kind of like making a grinning frog or caterpillar out of Telugu letters. > What I complain of as not mentioned in the Standard, is that U+2044 > can be used with superscript and subscript digits, rather than ASCII > digits. Almost any character(s) in Unicode "can be" used with almost any other. You can surround U+2044 with emoji if you like. That doesn't mean you should. -- Doug Ewell | Thornton, CO, US | ewellic.org From mark at kli.org Wed Jan 4 19:54:17 2017 From: mark at kli.org (Mark E. Shoulson) Date: Wed, 4 Jan 2017 20:54:17 -0500 Subject: Soyombo empty letter frame In-Reply-To: References: Message-ID: On 01/04/2017 04:18 PM, eduardo marin wrote: > > The Soyombo proposal is beautiful, but it is missing a very important > character in my opinion: > http://www.unicode.org/L2/L2015/15004-soyombo.pdf > > > Encoding an empty letter frame will allow for its proper description > in plain text (as it is clear in the proposal itself), it could be > used as an stylized cursor in text processors and also we could make > zwj sequences such that combining with consonants makes it only render > the nucleus. > According to the proposal: In the proposed encoding a combination of frame and nucleus is considered an atomic letter.... This approach enhances the conceptualization and identification of letters in the script; for instance, the letter ?ka? refers inherently to the fully-formed (X) and not to the nucleus (X). In other words, they are explicitly rejecting the model considering the "frame" as an item in its own right. I realize that you are not calling for redefining all the letters in terms of frame+nucleus, but encoding the frame seems to be something the proposers deliberately decided against doing. In calling for encoding the frame (and why just one frame? Wouldn't you want both the "closed" and "open" ones?), I think you really are going against what seems to be a design principle of the proposers. Which of course you are completely entitled to do: just that you probably are better off talking it over with the proposers directly, to learn their thinking and so they can learn yours. ~mark -------------- next part -------------- An HTML attachment was scrubbed... URL: From pandey at umich.edu Wed Jan 4 20:31:11 2017 From: pandey at umich.edu (Anshuman Pandey) Date: Wed, 4 Jan 2017 21:31:11 -0500 Subject: Soyombo empty letter frame In-Reply-To: References: Message-ID: <8D4E2FA2-3FCC-4B0B-AE38-5F20EC6A3DAE@umich.edu> > On Jan 4, 2017, at 8:54 PM, Mark E. Shoulson wrote: > >> On 01/04/2017 04:18 PM, eduardo marin wrote: >> The Soyombo proposal is beautiful, but it is missing a very important character in my opinion: http://www.unicode.org/L2/L2015/15004-soyombo.pdf >> >> Encoding an empty letter frame will allow for its proper description in plain text (as it is clear in the proposal itself), it could be used as an stylized cursor in text processors and also we could make zwj sequences such that combining with consonants makes it only render the nucleus. > > According to the proposal: > > In the proposed encoding a combination of frame and nucleus is considered an atomic letter.... This approach enhances the conceptualization and identification of letters in the script; for instance, the letter ?ka? refers inherently to the fully-formed (X) and not to the nucleus (X). > In other words, they are explicitly rejecting the model considering the "frame" as an item in its own right. I realize that you are not calling for redefining all the letters in terms of frame+nucleus, but encoding the frame seems to be something the proposers deliberately decided against doing. In calling for encoding the frame (and why just one frame? Wouldn't you want both the "closed" and "open" ones?), I think you really are going against what seems to be a design principle of the proposers. Which of course you are completely entitled to do: just that you probably are better off talking it over with the proposers directly, to learn their thinking and so they can learn yours. > > ~mark As the author of the Soyombo proposal, I should like to say that I did indeed consider proposing the two frames for encoding as "pedagogical" characters. I did not mention the possibility of such in the proposal, but the present discussion persuades me to reinvestigate the issue. I'd be happy to hear the opinion of others. All the best, Anshuman -------------- next part -------------- An HTML attachment was scrubbed... URL: From charupdate at orange.fr Wed Jan 4 23:56:58 2017 From: charupdate at orange.fr (Marcel Schneider) Date: Thu, 5 Jan 2017 06:56:58 +0100 (CET) Subject: Superscript and Subscript Characters in General Use In-Reply-To: <20170104173306.665a7a7059d7ee80bb4d670165c8327d.7a92080546.wbe@email03.godaddy.com> References: <20170104173306.665a7a7059d7ee80bb4d670165c8327d.7a92080546.wbe@email03.godaddy.com> Message-ID: <2039959835.345.1483595818937.JavaMail.www@wwinf1k39> On Wed, 04 Jan 2017 12:20:14 -0700, Doug Ewell wrote: > > Marcel Schneider wrote: > > >> I don't understand the relevance to vulgar fractions. > > > > Vulgar fractions represented using super- and subscript digits around > > the FRACTION SLASH U+2044 > > Don't do that. > > The fact that someone, even a Microsoft MVP, posted an article about > this glyph hack does not make it a good idea. I found it a good idea long before I found and read the article.[1] It is very coherent, and seemed to me the best way to make sense of the fraction slash in a character encoding standard that does things seriously. Since I?ve read the article, I?m glad that a Microsoft MVP worked out solutions to help people who have incomplete keyboard layouts. Several readers were so kind as to comment on the usefulness of the article and the shared data. > It's kind of like making a > grinning frog or caterpillar out of Telugu letters. I don?t think that Telugu art and ASCII art could be compared to writing numbers with fractions made of superscripts and subscripts. Perhaps there is a difference between Telugu art and ASCII art in that, ASCII is more common, but the availability of super-/subscript Western Arabic digits should not be compared to the availability of a rather uncommon script. > > > What I complain of as not mentioned in the Standard, is that U+2044 > > can be used with superscript and subscript digits, rather than ASCII > > digits. > > Almost any character(s) in Unicode "can be" used with almost any other. > You can surround U+2044 with emoji if you like. That doesn't mean you > should. Not to represent vulgar fractions in a legible way. Superscript and subscript digits are particular in that, they have compatibility mappings to ASCII digits, so that they are not only human readable, but machine readable. See TUS ?22.4 [2]. As of ?readability for the human reader? (NamesList, header), vulgar fractions represented using superscripts-FRACTION SLASH-subscripts have also the advantage of being stable across environments, unless some characters are not supported, in which case they can be parsed and replaced with formatted ASCII-based fractions, e.g. before the text is pasted into an ANSI-encoded form (that replaces with '?'). And they meet user expectations. Preformatted fractions are so demanded that the most frequent of them were encoded in early standards and included in national keyboard layouts. They entered Unicode for roundtrip compatibility [3]. That means, this is not the specific Unicode way of representing fractions, obviously because of the limitation of the number of those fractions. Now, the common denominator of the Unicode scheme and the user expectations is to represent vulgar fractions using preformatted super-/subscripts along with the?accurately kerning?FRACTION SLASH. Therefore (again) that has been implemented in fonts like Arial Unicode MS. The stability of this representation scheme prevents content corruption (see the counter-examples in TUS below, where the PDF tool used arbitrary characters mapped to special fonts; though that is another?already discussed?issue [3]). I suggest that the specification of the fraction slash in TUS [4] be updated. It remained roughly unchanged since version 2.0 (the other one that I?ve checked). First, U+2044 should be used where applicable (actually there is still U+002F). There should be *two* ?standard form[s] of a fraction built using the fraction slash?. Further we read that ?the displaying software is [?] mapping the fraction to a unit?. Does that mean that the preformatted fraction is substituted if available? Or should it read ?_formatting_ the fraction _as_ a unit?? I note, too, that typically the software waits for the digit-slash-digit sequence to be selected and fraction formatting being applied at request, so that this could eventually be mentioned, given that the fraction slash is even more uncommon on keyboards than the complete range of super- and subscript digits. Regards, Marcel [1] Styled Fractions in Windows, Created by Jeeped, July 18, 2013, MVP, Wiki Author: https://answers.microsoft.com/en-us/msoffice/wiki/msoffice_word-mso_other/styled-fractions-in-windows/4a07d5fa-2484-4e39-b1f3-70bb3eb0c332 [2] TUS 9.0, ?22.4, p. 786: | | Parsing of Superscript and Subscript Digits. In the Unicode Character Database, superscript | and subscript digits have not been given the General_Category property value | Decimal_Number (gc=Nd), so as to prevent expressions like 23 from being interpreted like | 23 by simplistic parsers. This should not be construed as preventing more sophisticated | numeric parsers, such as general mathematical expression parsers, from correctly identifying | these compatibility superscript and subscript characters as digits and interpreting them | appropriately. See also the discussion of digits in Section 22.3, Numerals. | http://www.unicode.org/versions/Unicode9.0.0/ch22.pdf#G46374 [3] TUS 9.0, ?22.3, p. 784: | | Fractions | | The Number Forms block (U+2150..U+218F) contains a series of vulgar fraction characters, | encoded for compatibility with legacy character encoding standards. These characters | are intended to represent both of the common forms of vulgar fractions: forms with a | right-slanted division slash, such as G, as shown in the code charts, and forms with a horizontal | division line, such as H, which are considered to be alternative glyphs for the same | fractions, as shown in Figure 22-8. A few other vulgar fraction characters are located in the | Latin-1 block in the range U+00BC..U+00BE. | | Figure 22-8. Alternate Forms of Vulgar Fractions | | G H | | The unusual fraction character, U+2189 vulgar fraction zero thirds, [?] | | The vulgar fraction characters are given compatibility decompositions using U+2044 ?/? | fraction slash. Use of the fraction slash is the more generic way to represent fractions in | text; it can be used to construct fractional number forms that are not included in the collections | of vulgar fraction characters. For more information on the fraction slash, see ?Other | Punctuation? in Section 6.2, General Punctuation. | http://www.unicode.org/versions/Unicode9.0.0/ch22.pdf#G46039 [4] TUS 9.0, ?6.2, p. 277: | | Fraction Slash. U+2044 fraction slash is used between digits to form numeric fractions, | such as 2/3 and 3/9. The standard form of a fraction built using the fraction slash is defined | as follows: any sequence of one or more decimal digits (General Category = Nd), followed | by the fraction slash, followed by any sequence of one or more decimal digits. Such a fraction | should be displayed as a unit, such as ? or !. The precise choice of display can depend | on additional formatting information. | | If the displaying software is incapable of mapping the fraction to a unit, then it can also be | displayed as a simple linear sequence as a fallback (for example, 3/4). If the fraction is to be | separated from a previous number, then a space can be used, choosing the appropriate | width (normal, thin, zero width, and so on). For example, 1 + thin space + 3 + fraction | slash + 4 is displayed as 1?. | http://www.unicode.org/versions/Unicode9.0.0/ch06.pdf#G2000 From moyogo at gmail.com Thu Jan 5 01:22:39 2017 From: moyogo at gmail.com (Denis Jacquerye) Date: Thu, 05 Jan 2017 07:22:39 +0000 Subject: Superscript and Subscript Characters in General Use In-Reply-To: <2039959835.345.1483595818937.JavaMail.www@wwinf1k39> References: <20170104173306.665a7a7059d7ee80bb4d670165c8327d.7a92080546.wbe@email03.godaddy.com> <2039959835.345.1483595818937.JavaMail.www@wwinf1k39> Message-ID: On Thu, 5 Jan 2017 at 06:03 Marcel Schneider wrote: > On Wed, 04 Jan 2017 12:20:14 -0700, Doug Ewell wrote: > > > > Marcel Schneider wrote: > > > > >> I don't understand the relevance to vulgar fractions. > > > > > > Vulgar fractions represented using super- and subscript digits around > > > the FRACTION SLASH U+2044 > > > > Don't do that. > > > > The fact that someone, even a Microsoft MVP, posted an article about > > this glyph hack does not make it a good idea. > > I found it a good idea long before I found and read the article. > > It is not such a good idea, if at all. Superscript and subscript are not the same thing as denominator and numerators. Many fonts make the difference and ? or 1?2 or 1/2 will not look like of ?/? or ??? in many cases. -------------- next part -------------- An HTML attachment was scrubbed... URL: From alastair at alastairs-place.net Thu Jan 5 03:46:53 2017 From: alastair at alastairs-place.net (Alastair Houghton) Date: Thu, 5 Jan 2017 09:46:53 +0000 Subject: IdnaTest.txt and RFC 5893 In-Reply-To: References: <44B684E5-3EC7-43DE-8BFE-19935FEC8946@alastairs-place.net> Message-ID: On 4 Jan 2017, at 23:40, Markus Scherer wrote: > > On Wed, Jan 4, 2017 at 2:28 AM, Alastair Houghton wrote: > RFC 5893 seems pretty clear to me, and the problem really is that the test vectors (which come from unicode.org) seem (to me) to be incorrect. > > https://tools.ietf.org/html/rfc5893#section-2 says "The following rule, consisting of six conditions, applies to labels in Bidi domain names." > > That's what the ICU code does -- applying the rule to each label -- and I assume that's the basis for the test data. Absolutely. But the crucial part is ?in Bidi domain names?. That is, it applies to *all* labels that are part of a Bidi domain name, not just RTL labels. It did not say ?applies to RTL labels in Bidi domain names? and in fact even explicitly states that (in the first bullet point at the end of section 2): ...Note that even LTR labels and pure ASCII labels have to be tested. Not to mention the fact that parts 5 and 6 of the rule apply specifically to LTR labels. So it?s quite clear that given the domain name ?0?.??, both ??? *and* ?0?? need to be checked using the Bidi Rule. Unless someone can explain why ?0?? does not fail the test, surely we all agree that line 74 is incorrect: > B; 0?.\u05D0; ; xn--0-sfa.xn--4db # 0?.? and similarly with line 93: > B; ??.\u05D0; ; xn--0ca88g.xn--4db # ??.? > ICU does not currently check for multi-label bidi combinations. I was a bit puzzled by this, because the code clearly does (both in the C++ and Java versions) and yet the online demo doesn?t appear to object to the above test cases. So I wrote a quick test program against the C++ version of ICU 58.2 and fed it both test cases, and, sure enough, ICU agrees that there is a BiDi error in both of the above cases. Kind regards, Alastair. -- http://alastairs-place.net From charupdate at orange.fr Thu Jan 5 05:33:49 2017 From: charupdate at orange.fr (Marcel Schneider) Date: Thu, 5 Jan 2017 12:33:49 +0100 (CET) Subject: Superscript and Subscript Characters in General Use In-Reply-To: References: <20170104173306.665a7a7059d7ee80bb4d670165c8327d.7a92080546.wbe@email03.godaddy.com> <2039959835.345.1483595818937.JavaMail.www@wwinf1k39> Message-ID: <104200337.5858.1483616029614.JavaMail.www@wwinf1k39> On Thu, 05 Jan 2017 07:22:39 +0000, Denis Jacquerye wrote: > > On Thu, 5 Jan 2017 at 06:03 Marcel Schneider wrote: > > > On Wed, 04 Jan 2017 12:20:14 -0700, Doug Ewell wrote: > > > > > > Marcel Schneider wrote: > > > > > > >> I don't understand the relevance to vulgar fractions. > > > > > > > > Vulgar fractions represented using super- and subscript digits around > > > > the FRACTION SLASH U+2044 > > > > > > Don't do that. > > > > > > The fact that someone, even a Microsoft MVP, posted an article about > > > this glyph hack does not make it a good idea. > > > > I found it a good idea long before I found and read the article. > > > > > It is not such a good idea, if at all. Superscript and subscript are not > the same thing as denominator and numerators. Many fonts make the > difference and ? or 1?2 or 1/2 will not look like of ?/? or ??? in many > cases. Indeed I remember that conclusion from the 2015 thread. If the fraction formatting facility is available, it should be used. If it isn?t, I?d suggest not to leave the ASCII fallbacks, but to use super- and subscripts instead. This still seems an overall second-best solution, that may turn into best solution depending on the font used. If Arial Unicode MS is used (though it is no longer a part of new Windows versions), it really looks exactly like preformatted fractions in the same font. But I can understand that denominators are meant to align on the baseline, while subscripts are often set slightly below. Though sometimes suboptimal, ?styled? plain text custom vulgar fractions still offer a far better readability than their plain ASCII fallbacks. To be consistent, fractions could be represented throughout this way in a given document, avoiding the mix-up of preformatted fractions with precomposed fractions. Marcel From charupdate at orange.fr Thu Jan 5 05:38:58 2017 From: charupdate at orange.fr (Marcel Schneider) Date: Thu, 5 Jan 2017 12:38:58 +0100 (CET) Subject: Superscript and Subscript Characters in General Use In-Reply-To: <20170104221200.2a04ba12@JRWUBU2> References: <717069d1-2cdc-fe61-6c2e-b8e3f12f1dfc@cs.tut.fi> <20161227000522.7bb95f3e@JRWUBU2> <230214984.10599.1482854625210.JavaMail.www@wwinf1p10> <20161229014759.5a51c747@JRWUBU2> <1339076265.17001.1483057435430.JavaMail.www@wwinf1p12> <20161230123727.52a00633@JRWUBU2> <5CF85CCB-4321-4620-BE57-C7914C84BCC7@crissov.de> <551789226.12010.1483218242564.JavaMail.www@wwinf1p10> <842947588.29513.1483489492387.JavaMail.www@wwinf1p19> <3ADB5847-528D-45B0-A963-F0CACC7A69E9@gmail.com> <20170104221200.2a04ba12@JRWUBU2> Message-ID: <1043493870.5967.1483616338879.JavaMail.www@wwinf1k39> On Wed, 4 Jan 2017 00:36:38 -0500, John W Kennedy wrote to Asmus Freytag: > As long as this is being discussed, what about the historic practice of using > M? (nowadays often seen as M? instead) in Scottish names?e.g., M?Donald?as a > typographic substitute for M(superscript c)? My first idea at reading was, that this adds to the examples of character re-use from lack of appropriate characters on the keyboard (or in the typecase, as you explained later). On Tue, 3 Jan 2017 22:48:09 -0800, Asmus Freytag (c) replied: > What about it? There are dozens, perhaps hundreds of fallbacks that have > been used over time, both in hot metal typography as well as with > typewriters or digital systems. Some practices may have started in ways > similar to a fallback, but have now evolved into standard practice. > Other ones remain fallbacks or went out of fashion. > > It's an interesting example, but what kind of discussion did you have in > mind? This designs the principle of user choices that may supersede standard preferences. So I?m picking up that the accurate concept is FALLBACK. It probably expands to the rule that every character may be re-used as a fallback for any other character (unformatted or formatted) if this meets user expectations and preferences. Consequently, the entire ranges of modifier letters, punctuation, symbols and other characters can be used as fallbacks to write superscript abbreviations. Some of them are obviously more appropriate than others: MODIFIER LETTER SMALL C would better fit this use case but was unavailable somewhere (typecase, charset, keyboard) and thus was not retained, while the second-best (single open-quote, or MODIFIER LETTER TURNED COMMA, as Denis Jacquerye suggests) was used. So we have the confirmation that it?s up to the users and their keyboard layout providers and font designers to choose the best fitting fallbacks among the existing Unicode characters. For lack of anything better, MODIFIER LETTER SMALL E is the designated fallback candidate for the hypothetical/on-coming/up-coming French ordinal indicator _kind-of-'?'_, and the other three ordinal indicator fallbacks '?', '?' and '?' are also readily available. Citing again another example: To represent the French abbreviation of ?number?, MODIFIER LETTER SMALL O would better fit than the widely used DEGREE SIGN, that is the only one available on the current keyboard among these two, while the RING ABOVE would be too small, MASCULINE ORDINAL INDICATOR (sometimes used as another fallback by people who have it on the keyboard) has often an underline that is unpreferred in French, and SUPERSCRIPT ZERO is somewhat too big: ?????????????????n? - n? - n? - n? - n? The fallback scheme applies to custom vulgar fractions as well: their representation with super-/subscript digits has seemingly the status of a fallback, during the time when it?s not yet recognized as alternate standard representation. I mean that officially, it is likely to be considered a fallback, while in practice, it has already become a working solution. Further, on Wed, 4 Jan 2017 22:12:00 +0000, Richard Wordingham wrote: > > > 2017-01-04 12:44 GMT+01:00 John W Kennedy : > > > > > No it isn?t. It isn?t an apostrophe; it?s a left single quote, > > > although some modern printers mistakenly suppose it to be an > > > apostrophe, and substitute one. And it isn?t an elision; it?s meant > > > as a substitute glyph for a superscript c. > > For which I would suggest U+02BF MODIFIER LETTER LEFT HALF RING would > be the best modern representative of the substitute character! While I?d thought a wile about the left half ring (having it on the keyboard, in group 3), when trying it I found it too tiny: M?Donald - M?Donald. Why a representative of a substitute? Probably because MODIFIER LETTER SMALL C is already used as a substitute of superscript small C, despite of the Standard specifying that the modifier letters are not [intended as] a substitute for this. So the left half ring might be considered the best representative, while the best modern *solution* for a substitute would really be the modifier letter: ???????????????M?Donald - M?Donald ( - M?Donald). > > Of course, that would further increase confusion of those who initially > read U+02BF as a superscript 'c', and only later, if ever, realise that > it's actually a rough breathing carefully distinguished from the > similar punctuation marks. Indeed it would be a pity to stick with alternatives and worst case fallbacks if a better solution is readily available. Among MODIFIER LETTERs, TURNED COMMA is already a ?typographical alternative for? REVERSED COMMA and LEFT HALF RING, so that it could seem consistent that SMALL C be a typographical alternative for superscript small c, knowing that the (probably) only thing that matters of a fallback, is whether it evolves into standard practice, remains a fallback, or goes out of fashion. E.g., the DEGREE SIGN had evolved into standard practice as a (representative of the) substitute for superscript small o, but could perhaps go out of fashion when a comprehensive set of MODIFIER SMALL letters can be easily accessed on standard keyboards, in the best case completed with automatic sequences for 'n?' and 'N?', that have the advantage over the degree sign that they can easily be complemented with a plural s: 'n??' and 'N??'. Marcel From asmusf at ix.netcom.com Thu Jan 5 05:55:06 2017 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Thu, 5 Jan 2017 03:55:06 -0800 Subject: Superscript and Subscript Characters in General Use In-Reply-To: <20170104173306.665a7a7059d7ee80bb4d670165c8327d.7a92080546.wbe@email03.godaddy.com> References: <20170104173306.665a7a7059d7ee80bb4d670165c8327d.7a92080546.wbe@email03.godaddy.com> Message-ID: <52a425a1-47b7-f63d-8526-c2799897bccf@ix.netcom.com> An HTML attachment was scrubbed... URL: From asmusf at ix.netcom.com Thu Jan 5 05:56:15 2017 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Thu, 5 Jan 2017 03:56:15 -0800 Subject: Superscript and Subscript Characters in General Use In-Reply-To: <104200337.5858.1483616029614.JavaMail.www@wwinf1k39> References: <20170104173306.665a7a7059d7ee80bb4d670165c8327d.7a92080546.wbe@email03.godaddy.com> <2039959835.345.1483595818937.JavaMail.www@wwinf1k39> <104200337.5858.1483616029614.JavaMail.www@wwinf1k39> Message-ID: <95570a60-20f5-6f7e-ccce-788f7b7741c5@ix.netcom.com> An HTML attachment was scrubbed... URL: From mark at macchiato.com Thu Jan 5 09:55:47 2017 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Thu, 5 Jan 2017 16:55:47 +0100 Subject: IdnaTest.txt and RFC 5893 In-Reply-To: References: <44B684E5-3EC7-43DE-8BFE-19935FEC8946@alastairs-place.net> Message-ID: Alastair, thanks for finding it and bringing it up. I think you're right that the problem is in that the test generation code doesn't properly apply the bidi criteria to *all* the labels if *any* of the labels are RTL, but instead is probably just going on a label-by-label basis. Thankfully, it looks like ICU does handle it right, by your note. (The test file generation doesn't use the ICU code.) Could you please report this via http://www.unicode.org/reporting.html so that we make sure that it is tracked and brought up to the UTC? Mark Mark On Thu, Jan 5, 2017 at 10:46 AM, Alastair Houghton < alastair at alastairs-place.net> wrote: > On 4 Jan 2017, at 23:40, Markus Scherer wrote: > > > > On Wed, Jan 4, 2017 at 2:28 AM, Alastair Houghton < > alastair at alastairs-place.net> wrote: > > RFC 5893 seems pretty clear to me, and the problem really is that the > test vectors (which come from unicode.org) seem (to me) to be incorrect. > > > > https://tools.ietf.org/html/rfc5893#section-2 says "The following rule, > consisting of six conditions, applies to labels in Bidi domain names." > > > > That's what the ICU code does -- applying the rule to each label -- and > I assume that's the basis for the test data. > > Absolutely. But the crucial part is ?in Bidi domain names?. That is, it > applies to *all* labels that are part of a Bidi domain name, not just RTL > labels. It did not say ?applies to RTL labels in Bidi domain names? and in > fact even explicitly states that (in the first bullet point at the end of > section 2): > > ...Note that even LTR labels and pure ASCII labels have to be tested. > > Not to mention the fact that parts 5 and 6 of the rule apply specifically > to LTR labels. > > So it?s quite clear that given the domain name ?0?.??, both ??? *and* ?0?? > need to be checked using the Bidi Rule. Unless someone can explain why > ?0?? does not fail the test, surely we all agree that line 74 is incorrect: > > > B; 0?.\u05D0; ; xn--0-sfa.xn--4db # 0?.? > > and similarly with line 93: > > > B; ??.\u05D0; ; xn--0ca88g.xn--4db # ??.? > > > ICU does not currently check for multi-label bidi combinations. > > I was a bit puzzled by this, because the code clearly does (both in the > C++ and Java versions) and yet the online demo doesn?t appear to object to > the above test cases. So I wrote a quick test program against the C++ > version of ICU 58.2 and fed it both test cases, and, sure enough, ICU > agrees that there is a BiDi error in both of the above cases. > > Kind regards, > > Alastair. > > -- > http://alastairs-place.net > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From charupdate at orange.fr Thu Jan 5 12:43:32 2017 From: charupdate at orange.fr (Marcel Schneider) Date: Thu, 5 Jan 2017 19:43:32 +0100 (CET) Subject: Superscript and Subscript Characters in General Use In-Reply-To: <95570a60-20f5-6f7e-ccce-788f7b7741c5@ix.netcom.com> References: <20170104173306.665a7a7059d7ee80bb4d670165c8327d.7a92080546.wbe@email03.godaddy.com> <2039959835.345.1483595818937.JavaMail.www@wwinf1k39> <104200337.5858.1483616029614.JavaMail.www@wwinf1k39> <95570a60-20f5-6f7e-ccce-788f7b7741c5@ix.netcom.com> Message-ID: <1818698400.22051.1483641812457.JavaMail.www@wwinf1p21> On Thu, 5 Jan 2017 03:56:15 -0800, Asmus Freytag wrote: > > On 1/5/2017 3:33 AM, Marcel Schneider wrote: > > > > If Arial Unicode MS is used (though it is no longer > > a part of new Windows versions), it really looks exactly like preformatted > > fractions in the same font. But I can understand that denominators are meant > > to align on the baseline, while subscripts are often set slightly below. > > That's just the kind of issue that you will run into with undisciplined hacks. > > Just... don't. So that cannot be recommended for general use, even outside of publishing software. The question left would be about readability of drafts and so on. From now on, when I?ve to choose between fractions this way: '2/7', and this way: '???', I should always use ASCII only? I?m thinking of an e-mail, like this one. I?m still unable to understand why the unformatted fraction should be better than the preformatted presentation (even when the latter is suboptimal). I still believe that keyboard layout developers are in debt of providing all and every characters of a given script and the related sets of numerals, generic punctuation and symbols, in order to enable the end-user to choose whatever effect he intends to produce. Since keyboards are shaping the practice, people are probably best served when the layout allows eveybody to adapt himself to all use cases. Earlier on Thu, 5 Jan 2017 03:55:06 -0800, Asmus Freytag wrote: > > On 1/4/2017 4:33 PM, Doug Ewell wrote: > > > > > What I complain of as not mentioned in the Standard, is that U+2044 > > > can be used with superscript and subscript digits, rather than ASCII > > > digits. > > > > Almost any character(s) in Unicode "can be" used with almost any other. > > You can surround U+2044 with emoji if you like. That doesn't mean you > > should. > > This is a key point. > > You can use many code points to get some "effect", but that doesn't mean > it represents good practice or should be recommended. This is particularly true for the French use of DEGREE SIGN for superscript o, that 99?% of the users are said to type to get the 'n?' abbreviation, or 'r?', 'v?', 'f?'. It doesn?t look really bad, is stable, and easy to input. The downside comes at least when it?s up to append a plural s. And even before, it?s poor typography, because depending on the font, the degree sign may look very different from a real superscript o. With respect to this, the modifier letter o is way better. > There are no "traffic cops" out there that will flag you down for having made > a poor decision, but that's not a reason enough to endorse random suggestions. > > This goes particularly for practices that need support in systems and/or fonts to work > correctly. If some implementer supports the recommended normal size digits for 2044 > why should they do the additional work of making sure it works for super/sub script. If the implementers really do support the fraction slash U+2044 as triggering the authentic fraction formatting, then they may spare the extra work. But this feature is uncommon enough as to think seriously about the fallback options. And if despite of being discouraged by all recommendations (including mine), the use of super/sub scripts gets thriving, it would be a good idea to support them along with normal size digits, the more as this does not require a lot of supplemental code (just twenty equivalence classes, I guess). In the meantime, what options are available as fallback? The recommendation [1] is unrealistic: A system (OS + program + font) that is unable to map digits to numerators/denominators, cannot be expected neither to map U+2044 to U+002F, as specified. Therefore, the fraction slash is left between the digits. Since in most proportional fonts it is so kerning that it overlaps baseline digits when displayed in between, this can hardly be used as a recommended fallback. This looks good in some fonts only, while in most proportional fonts it doesn?t. Obviously, this use case is not intended. So perhaps all users might be given the unrecommended possibility to choose an unrecommended second-best solution. This would require to make sure that everybody gets the point of being at risk of running into issues. In any case, U+2044 ought to be on the keyboard, according to the Standard (in order to input the specified sequences). As of super/sub scripts, I think it would be a pity to keep them away. The rest could probably be considered as being up to the user. In any case, fashion is unforeseeable. Marcel [1] TUS 9.0, ?6.2, p. 277: | | If the displaying software is incapable of mapping the fraction to a unit, then it can also be | displayed as a simple linear sequence as a fallback (for example, 3/4). [?] | http://www.unicode.org/versions/Unicode9.0.0/ch06.pdf#G2000 From petercon at microsoft.com Thu Jan 5 16:35:29 2017 From: petercon at microsoft.com (Peter Constable) Date: Thu, 5 Jan 2017 22:35:29 +0000 Subject: Superscript and Subscript Characters in General Use In-Reply-To: <104200337.5858.1483616029614.JavaMail.www@wwinf1k39> References: <20170104173306.665a7a7059d7ee80bb4d670165c8327d.7a92080546.wbe@email03.godaddy.com> <2039959835.345.1483595818937.JavaMail.www@wwinf1k39> <104200337.5858.1483616029614.JavaMail.www@wwinf1k39> Message-ID: From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Marcel Schneider Sent: Thursday, January 5, 2017 3:34 AM > If Arial Unicode MS is used (though it is no longer a part of new Windows versions) The Arial Unicode MS font was never included in any version of Windows. It was only ever included in Microsoft Office. Peter From charupdate at orange.fr Thu Jan 5 23:42:14 2017 From: charupdate at orange.fr (Marcel Schneider) Date: Fri, 6 Jan 2017 06:42:14 +0100 (CET) Subject: Superscript and Subscript Characters in General Use In-Reply-To: References: <20170104173306.665a7a7059d7ee80bb4d670165c8327d.7a92080546.wbe@email03.godaddy.com> <2039959835.345.1483595818937.JavaMail.www@wwinf1k39> <104200337.5858.1483616029614.JavaMail.www@wwinf1k39> Message-ID: <538089927.246.1483681334517.JavaMail.www@wwinf1p15> On Thu, 5 Jan 2017 22:35:29 +0000, Peter Constable wrote: > > From: Unicode [mailto:unicode-bounces_at_unicode.org] On Behalf Of Marcel Schneider > Sent: Thursday, January 5, 2017 3:34 AM > > > If Arial Unicode MS is used (though it is no longer a part of new Windows versions) > > The Arial Unicode MS font was never included in any version of Windows. It was only > ever included in Microsoft Office. I?m very sorry, thank you for the correction. I?ve mixed up OS and applications. Now it displays: ?Arial Unicode MS is unavailable on this machine. Do you want to use it nevertheless?? (translated from French). Since a few days I know that Arial Unicode MS is a part of the system fonts on macOS Snow Leopard. I?m now unable to use it on my netbook. Even existing documents don?t display well any more, they?re messed up with .notdef boxes. When trying to get preformatted custom fractions the old way around, it switches to MS Gothic for the subscripts. In 2015, I was about to buy Office 2010. Office 2013 requires too much RAM and I don?t like it. I?m aware that many people are roughly in the same position, at least regarding the Arial Unicode MS font. So perhaps, representing fractions with super/sub scripts ought to be removed from my recommendations, at least for more than drafts or informal papers. However, it seems to match the expectations of many people. But that?s the least part of the topic. The main concern in this thread is the use of modifier letters as a fallback instead of ordinal indicators and for superscript in abbreviations. I agree that inside the document, formatting is much more powerful, as it doesn?t require complete fonts (and makes style fine-tuning easy). Nevertheless, the user might prioritize the stability of the document when it comes to plain text, and he could be interested in a better-looking display of letters that elsewhere should be superscripted. Here, the modifier letters could be a ready-to-use fallback. Converting them to formatted baseline letters could be achieved with a macro in VBA. Couldn?t this be included in the next Office version as an out-of-the-box feature? Marcel From asmusf at ix.netcom.com Fri Jan 6 02:21:29 2017 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Fri, 6 Jan 2017 00:21:29 -0800 Subject: Superscript and Subscript Characters in General Use In-Reply-To: <538089927.246.1483681334517.JavaMail.www@wwinf1p15> References: <20170104173306.665a7a7059d7ee80bb4d670165c8327d.7a92080546.wbe@email03.godaddy.com> <2039959835.345.1483595818937.JavaMail.www@wwinf1k39> <104200337.5858.1483616029614.JavaMail.www@wwinf1k39> <538089927.246.1483681334517.JavaMail.www@wwinf1p15> Message-ID: An HTML attachment was scrubbed... URL: From charupdate at orange.fr Fri Jan 6 08:30:19 2017 From: charupdate at orange.fr (Marcel Schneider) Date: Fri, 6 Jan 2017 15:30:19 +0100 (CET) Subject: Superscript and Subscript Characters in General Use In-Reply-To: References: <20170104173306.665a7a7059d7ee80bb4d670165c8327d.7a92080546.wbe@email03.godaddy.com> <2039959835.345.1483595818937.JavaMail.www@wwinf1k39> <104200337.5858.1483616029614.JavaMail.www@wwinf1k39> <538089927.246.1483681334517.JavaMail.www@wwinf1p15> Message-ID: <514716041.11559.1483713019939.JavaMail.www@wwinf1p15> On Fri, 6 Jan 2017 00:21:29 -0800, Asmus Freytag wrote: > > On 1/5/2017 9:42 PM, Marcel Schneider wrote: > > > > Nevertheless, > > the user might prioritize the stability of the document when it comes to plain text, > > and he could be interested in a better-looking display of letters that elsewhere > > should be superscripted. Here, the modifier letters could be a ready-to-use fallback > > The use of such hacks is destabilizing to any efforts to systematically format superscripts > across a document. That supposes a rich text environment. The orthographical correctness of some languages, among which French, requires traditionally either a rich text environment or some in-line markup like TeX (at the expense of direct usability, i.e. without a LaTeX converter). That is limit non-conformant to the design principles of Unicode. As I understand them, Unicode provides all characters that are needed to correctly spell any language. This goal remains unreached as long as the orthography of some languages cannot be entirely achieved without relying on formatting markup. (I?m aware that complex scripts require hinted fonts for glyph reordering and glyph substitution, but this still is plain text.) The superscripting of abbreviation endings belongs to another level of correctness than the arbitrary stress as expressed with italics, bold, underline (obsolete in this use), extra letter spacing (German, rather old-style), capitalization, or extra acute accents as in Dutch. This is why Karl Pentzlin [1] cited ?Biblio^{que}? vs ?Biblioque?, where the latter is ?no valid French word.? >From this it becomes now clear that Alastair Houghton?s suggestion [2] of encoding a superscript variant selector, would meet this requirement and is therefore not to be confused with the first step towards making Unicode support rich text. Saying it loud: The fact that French and a few other languages cannot be written in a correct orthography when the environment is plain text, seems to me hard to accept. > Text fonts may not support them, because for "ordinary" text, by Unicode's > recommendation, one would use ordinary letters / digits with superscript markup. A text font that does not support all modifier letters has less of a text font than of a title font. Ornamental fonts are produced in such a variety that completing them is/was economically unfeasible. I?m considering this statement rather in the past tense, because diacriticized letters are already (on request) automatically generated and added to the font at creation. If automatic superscripting shouldn?t already be implemented, it will be soon, I suppose. So more and more (new and updated) fonts will support them. But wherever they aren?t, a _Convert modifier letters to superscript_ feature (or an equivalent macro command) ought to be able to make the text conformant to legacy handling. > So, by using these hacks, anytime a document is re-formatted with a different font style, > you are in danger of either losing these to boxes, or to be faced with random font styles. Yes, people should always be aware that the use of modifier letters has its downside, as has the use of superscripted baseline letters. I currently write e-mails (like this one) in a text editor (Notepad++). Several features I use here, are IMO missing in all e-mail clients, as column editing, line reordering, and so on. So I appreciate to be able to spell correctly in plain text, without sloppy fallbacks (i.e. baseline fallbacks for superscript). It?s a matter of making the most of the exsisting charset. I believe that modifier letter fallbacks are very functional. When I paste them into an HTML mail form, the display is always correct and doesn?t need to add superscript by hand in the whole mail. Furthermore, I can even use superscript in the subject. > If you don't think that is a real problem: some (many) character pickers will insert font+code point into > an application. These font bindings often survive and suddenly your text, when read on a different > computer looks like a ransom note, just because the new machine has a new "default" font, and > that is applied to all letters that don't have a specific font binding. Basically this is a good scheme, because character pickers typically are used for symbols. There are also two kinds: local, and online. I sometimes pick in the full-size PDF of the Code Charts. They?re the best character picker IMO. > Some font pickers are "stupid" enough to do this for simple accented code points that would have > been in the currently selected font anyway. That?s really bad. I know that some people are writing documents by picking accented letters in the special characters dialog. I can figure out that some other people may use an online picker instead, partly because the word processor they?re using may be a web-app. Anyhow, this is very unefficient. The reason may be that one often thinks either that a keyboard cannot be completed, or that completing a keyboard would make it unusable, or hard to use, or full of stickers. Here?s one main challenge of keyboard layout development. > Your suggestions will just add to these problems. > If editing in a rich text environment, work in rich text. And then lean on implementers to get > export correct to other rich text formats.... I really worked nearly all the time in a rich text environment, and I added plenty of autocorrections to speed up writing. Today, I work most of the time in plain text. I don?t use LaTeX, but I know that this is easily exported to many other formats. PDF is a main target format. Most of the drawbacks start when the reader wishes to copy-paste some lines of a (basically searchable) PDF either to rich text or to plain text? but that is not the issue here. I hope that my future recommendations will solve more problems than they?ll create! Marcel [1] Karl Pentzlin?s MODIFIER LETTER SMALL Q proposal: http://www.unicode.org/L2/L2010/10230-modifier-q.pdf [2] Alastair Houghton?s SUPERSCRIPT/SUBSCRIPT variant selectors suggestion: http://www.unicode.org/mail-arch/unicode-ml/y2017-m01/0016.html From charupdate at orange.fr Fri Jan 6 13:02:25 2017 From: charupdate at orange.fr (Marcel Schneider) Date: Fri, 6 Jan 2017 20:02:25 +0100 (CET) Subject: Superscript and Subscript Characters in General Use In-Reply-To: References: <20170104173306.665a7a7059d7ee80bb4d670165c8327d.7a92080546.wbe@email03.godaddy.com> <2039959835.345.1483595818937.JavaMail.www@wwinf1k39> <104200337.5858.1483616029614.JavaMail.www@wwinf1k39> <538089927.246.1483681334517.JavaMail.www@wwinf1p15> Message-ID: <828252581.19365.1483729345941.JavaMail.www@wwinf1p15> Another important point for the modifier letter fallbacks to work (if supported), would be that fonts support diacritics combined with modifier small letters. In 2014 I requested the superscript small '?' (not noticing that the intended abbreviation is incorrect), but encoding new characters like this one would be useless because it is decomposable, and out of date since the deadline is long past. But the superscript '?' that I?ve recently mentioned is still used (in 'S^{t?}' for 'Soci?t?' [Corporation], different from 'S??' which is the abbreviation of 'Sainte' [Saint, feminine]); and in Spanish, superscript '?' is used, Denis Jacquerye noted while pointing the need of working with?and enhancing support of?higher level protocols. [4] Higher level protocols will still stay recommended as the standard high-end solution, while the use of modifier letters could get the status of an alternate fallback. Once it has it, modifier letter small q could be encoded and the whole set updated at font level for support of combining diacritics, while software may add two commands for round-trip conversion between modifier letters and superscript baseline letters, and probably between preformatted fractions and formatted fractions; I?m quite sure that all this is possible right now in VBA. I?ve added some more references to my previous mail with respect to past year?s discussion of formatting variation selectors. As there was a typo and missing line breaks (symptomatic of not using any spell checker and of editing the layout by hand in a text editor), I feel the need of letting follow the corrected version below. Best regards, Marcel On Fri, 6 Jan 2017 00:21:29 -0800, Asmus Freytag wrote: > > On 1/5/2017 9:42 PM, Marcel Schneider wrote: > > > > Nevertheless, > > the user might prioritize the stability of the document when it comes to plain text, > > and he could be interested in a better-looking display of letters that elsewhere > > should be superscripted. Here, the modifier letters could be a ready-to-use fallback > > The use of such hacks is destabilizing to any efforts to systematically format superscripts > across a document. That supposes a rich text environment. The orthographical correctness of some languages, among which French, requires traditionally either a rich text environment or some in-line markup like TeX (at the expense of direct usability, i.e. without a LaTeX converter). That is limit non-conformant to the design principles of Unicode. As I understand them, Unicode provides all characters that are needed to correctly spell any language. This goal remains unreached as long as the orthography of some languages cannot be entirely achieved without relying on formatting markup. (I?m aware that complex scripts require hinted fonts for glyph reordering and glyph substitution, but this still is plain text.) The superscripting of abbreviation endings belongs to another level of correctness than the arbitrary stress as expressed with italics, bold, underline (obsolete in this use), extra letter spacing (German, rather old-style), capitalization, or extra acute accents as in Dutch. This is why Karl Pentzlin [1] cited ?Biblio^{que}? vs ?Biblioque?, where the latter is ?no valid French word.? >From this it becomes now clear that Alastair Houghton?s [2] suggestion of encoding a superscript variant selector, would meet this requirement and is therefore not to be confused with the first step towards making Unicode support rich text. This was indeed the traditional argument opposed to previous similar suggestions. [3] Following the actual scheme, French and a few other languages cannot be written in a correct orthography when the environment is plain text. That seems to me hard to accept. > Text fonts may not support them, because for "ordinary" text, by Unicode's > recommendation, one would use ordinary letters / digits with superscript markup. A text font that does not support all modifier letters has less of a text font than of a title font. Ornamental fonts are produced in such a variety that completing them is/was economically unfeasible. I?m considering this statement rather in the past tense, because diacriticized letters are already (on request) automatically generated and added to the font at creation. If automatic superscripting shouldn?t already be implemented, it will be soon, I suppose. So more and more (new and updated) fonts will support them. But wherever they aren?t, a _Convert modifier letters to superscript_ feature (or an equivalent macro command) ought to be able to make the text conformant to legacy handling. > So, by using these hacks, anytime a document is re-formatted with a different font style, > you are in danger of either losing these to boxes, or to be faced with random font styles. Yes, people should always be aware that the use of modifier letters has its downside, as has the use of superscripted baseline letters. I currently write e-mails (like this one) in a text editor (Notepad++). Several features I use here, are IMO missing in all e-mail clients, as column editing, line reordering, and so on. So I appreciate to be able to spell correctly in plain text, without sloppy fallbacks (i.e. baseline fallbacks for superscript). It?s a matter of making the most of the existing charset. I believe that modifier letter fallbacks are very functional. When I paste them into an HTML mail form, the display is always correct and doesn?t need to add superscript by hand in the whole mail. Furthermore, I can even use superscript in the subject. > If you don't think that is a real problem: some (many) character pickers will insert font+code point into > an application. These font bindings often survive and suddenly your text, when read on a different > computer looks like a ransom note, just because the new machine has a new "default" font, and > that is applied to all letters that don't have a specific font binding. Basically this is a good scheme, because character pickers typically are used for symbols. There are also two kinds: local, and online. I sometimes pick in the full-size PDF of the Code Charts. They?re the best character picker IMO. > Some font pickers are "stupid" enough to do this for simple accented code points that would have > been in the currently selected font anyway. That?s really bad. I know that some people are writing documents by picking accented letters in the special characters dialog. I can figure out that some other people may use an online picker instead, partly because the word processor they?re using may be a web-app. Anyhow, this is very unefficient. The reason may be that one often thinks either that a keyboard cannot be completed, or that completing a keyboard would make it unusable, or hard to use, or full of stickers. Here?s one main challenge of keyboard layout development. > Your suggestions will just add to these problems. > If editing in a rich text environment, work in rich text. And then lean on implementers to get > export correct to other rich text formats.... I really worked nearly all the time in a rich text environment, and I added plenty of autocorrections to speed up writing. Today, I work most of the time in plain text. I don?t use LaTeX, but I know that this is easily exported to many other formats. PDF is a main target format. Most of the drawbacks start when the reader wishes to copy-paste some lines of a (basically searchable) PDF either to rich text or to plain text? but that is not the issue here. I hope that my future recommendations will solve more problems than they?ll create! Marcel [1] Karl Pentzlin?s MODIFIER LETTER SMALL Q proposal: http://www.unicode.org/L2/L2010/10230-modifier-q.pdf [2] Alastair Houghton?s SUPERSCRIPT/SUBSCRIPT variant selectors suggestion: http://www.unicode.org/mail-arch/unicode-ml/y2017-m01/0016.html [3] Re: Why incomplete subscript/superscript alphabet ? a.lukyanov http://www.unicode.org/mail-arch/unicode-ml/y2016-m10/0001.html Re: Why incomplete subscript/superscript alphabet ? Leonardo Boiko http://www.unicode.org/mail-arch/unicode-ml/y2016-m10/0013.html Re: Why incomplete subscript/superscript alphabet ? Jukka K. Korpela http://www.unicode.org/mail-arch/unicode-ml/y2016-m10/0014.html Re: Why incomplete subscript/superscript alphabet ? Steve Swales http://www.unicode.org/mail-arch/unicode-ml/y2016-m10/0015.html Re: Why incomplete subscript/superscript alphabet ? Neil Harris http://www.unicode.org/mail-arch/unicode-ml/y2016-m10/0017.html [4] Re: Why incomplete subscript/superscript alphabet ? Denis Jacquerye http://www.unicode.org/mail-arch/unicode-ml/y2016-m10/0037.html From christoph.paeper at crissov.de Fri Jan 6 17:21:37 2017 From: christoph.paeper at crissov.de (=?utf-8?Q?Christoph_P=C3=A4per?=) Date: Sat, 7 Jan 2017 00:21:37 +0100 Subject: WAP Pictogram Specification as Emoji Source Message-ID: I just discovered the WAP Pictogram specification (WAP-213-WAPInterPic), last published in April 2001 and updated in November 2001. - http://www.wapforum.org/what/technical.htm (requires OMA credentials) - http://www.openmobilealliance.org/tech/affiliates/wap/wap-213-wapinterpic-20010406-a.pdf It describes a way to reference locally stored graphics using the `pict` URL scheme in WML or XHTML: ?->? ?snowman?/ Reading through section 7 Pictogram Set, it?s obvious that WAP pictograms have been unified with Japanese (i-mode) emojis upon their encoding in Unicode 6+. However, the mapping is not obvious in all cases and I think there are some pictograms that have been omitted / forgotten or could have better annotation, e.g.: - /emotion/{trapped,tutting,shine,smell,pullFace,shakenHeart} - /human/body/foot - /map/{policeStation,spa,zoo} - /sport/{sport,scuba} - /time/event{anniversary,holiday,newYearsDay} I can imagine a crudely equivalent Unicode emoji for almost all of them, but definitely not for Scuba Diving. I haven?t seen ? or at least not recognized ? a scuba gear, flipper, snorkel or diver in documentation of Japanese vendor sets. Is there a mapping file available at the Unicode website that I?ve missed? I haven?t found any reference or vendor-specific images, by the way, and if it wasn?t just used as an example domain anyway, pict.com seems now defunct. From duerst at it.aoyama.ac.jp Fri Jan 6 22:12:10 2017 From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J._D=c3=bcrst?=) Date: Sat, 7 Jan 2017 13:12:10 +0900 Subject: WAP Pictogram Specification as Emoji Source In-Reply-To: References: Message-ID: <28146f12-21df-f7d0-2b15-aa39e1c78eaf@it.aoyama.ac.jp> On 2017/01/07 08:21, Christoph P?per wrote: > I just discovered the WAP Pictogram specification (WAP-213-WAPInterPic), last published in April 2001 and updated in November 2001. > I haven?t found any reference or vendor-specific images, by the way, and if it wasn?t just used as an example domain anyway, pict.com seems now defunct. Isn't WAP overall pretty much defunct these days? (Well, many including me predicted as much pretty much when it first showed up.) Regards, Martin. From verdy_p at wanadoo.fr Sat Jan 7 05:43:03 2017 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Sat, 7 Jan 2017 12:43:03 +0100 Subject: WAP Pictogram Specification as Emoji Source In-Reply-To: <28146f12-21df-f7d0-2b15-aa39e1c78eaf@it.aoyama.ac.jp> References: <28146f12-21df-f7d0-2b15-aa39e1c78eaf@it.aoyama.ac.jp> Message-ID: Technically it is is operational within operators. Old mobile phones still have an advantage that has completely been forgotten with smartphone, it is their very long battery lifetime, and there are still mobile phones sold today that are NOT smartphones, have NO Internet connectivity (only GSM/EDGE and SMS) and that will remain in charge for about 2 weeks, when my smartphone gets out of charge in less than 24 hours (or several times a day). So no complex layered networking protocol stacks, no advanced typography and a minimalist display. WAP is still supported on the EDGE/GPRS interface (used also with the Internet protocol under 2G networks which works almost everywhere when 3G/4G/5G signals cannot be received). However don't expect using this for feature rich interaction including for sending cute "WAP pictograms" that these devices will anyway not be able to decipher and render. I bet that WAP pictograms was an early specification for test that was in fact never needed, because the target audience goal was better achieved with Internet protocols and encoding standards, but also no one really wanted to administer a registry for the names (see the death of pict.com: no one paying for it, specification redundant with classic URIs on the web for referencing images), or standardizing the glyphs. The existing standard with normalized glyphs and semantics however exist, notably for traffic signs (on streets/roads, railways, rivers/canals, seas...), or in various industry standards (including for food, chemical products, or cleaning instructions for textiles, or additional glyphs for recycling, hazards or pollution). We are far from being complete in Unicode there, even if the supporting standards are effective, sometimes even mandatory, and very used. The problem for them is that these standards are not necessarily international, and incompatible with each other but still regulated and required and you cannot unify the glyphs specified by one of these standards with those from a competing standard (or with those glyphs already implemented in the UCS). And for now Unicode has resisted the idea of standardizing sets of symbols for specific standards, and notably if the glyphs are too strictly defined (not allowing variations/derivations without breaking the intended regulated semantics). 2017-01-07 5:12 GMT+01:00 Martin J. D?rst : > On 2017/01/07 08:21, Christoph P?per wrote: > >> I just discovered the WAP Pictogram specification (WAP-213-WAPInterPic), >> last published in April 2001 and updated in November 2001. >> > > I haven?t found any reference or vendor-specific images, by the way, and >> if it wasn?t just used as an example domain anyway, pict.com seems now >> defunct. >> > > Isn't WAP overall pretty much defunct these days? > > (Well, many including me predicted as much pretty much when it first > showed up.) > > Regards, Martin. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From charupdate at orange.fr Sat Jan 7 21:48:04 2017 From: charupdate at orange.fr (Marcel Schneider) Date: Sun, 8 Jan 2017 04:48:04 +0100 (CET) Subject: Superscript and Subscript Characters in General Use In-Reply-To: <828252581.19365.1483729345941.JavaMail.www@wwinf1p15> References: <20170104173306.665a7a7059d7ee80bb4d670165c8327d.7a92080546.wbe@email03.godaddy.com> <2039959835.345.1483595818937.JavaMail.www@wwinf1k39> <104200337.5858.1483616029614.JavaMail.www@wwinf1k39> <538089927.246.1483681334517.JavaMail.www@wwinf1p15> <828252581.19365.1483729345941.JavaMail.www@wwinf1p15> Message-ID: <1112132817.50.1483847284449.JavaMail.www@wwinf1p17> I?m bringing quickly two updates to what has been previously said, while there?s no other on-going discussion: # Combining diacritics on modifier letters I?m surprised to see that combining diacritics are already supported with modifier letters. When in my past mails I believed that they aren?t, I remembered some example in a last year?s thread, that didn?t look well, as isn?t the rendering in my drafts. Now I?ve used the string U+0053 U+1D57 U+1D49 U+0301 ('S???') in the subject and in the body of an e-mail, sent and received it and printed it out to a PDF file. It renders fairly well everywhere except in the subject at writing (too high) and in the subject in the inbox and the displayed mail (too far). This is clearly a font issue (equally in Chrome and in Firefox, using a webmail). When used in business mail, this could be appealing on one hand (at least when the bugging fonts have been updated) and convey a connotation of respectfulness, while on the other hand it could still raise the suspicion of unefficiency and time waste, as long as people aren?t aware how easy it is to input, thinking at character pickers. I actually hold two modifiers down while typing 'te', and hit the acute dead key (in the base shift state) and the space bar. To some degree it?s the same situation as with the letter '?', that many people here still type as ASCII fallback 'oe' (despite of having a shortcut in Word) and that is now coming on the standard keyboard. # Expected modifier letter small q Given that the use of modifier letters in the place of formatted superscript in abbreviations in English, French, Italian, Portuguese, Spanish and perhaps some other languages is a fallback that for high-end processing is to be replaced with formatted baseline letters, one could consider using a fallback character while waiting for the *MODIFIER LETTER SMALL Q to be encoded. The best approximation seems to be U+1DA3 MODIFIER LETTER SMALL TURNED H. In a draft, the abbreviation of 'Biblioth?que' [Library] as in [1] would then be spelled 'Biblio???'. That could eventually become a fixed convention supported by the conversion macros (that target only abbreviations in natural languages, not phonetics, nor random strings). Among the fallbacks discussed so far, this last one could be considered a ?hack?. This is why it must not be put in the place of *MODIFIER LETTER SMALL Q on the keyboard layouts. This allocation should still output a message string such as: ? ^q_unavailable?, or ? ^q_n?existe_pas? (the maximum number of characters is 16, conforming to the Windows limitation; on macOS it?s practically 20, because with a few more, TextEdit on Snow Leopard shuts down, whatever "maxout" is set to). U+1DA3 MODIFIER LETTER SMALL TURNED H can be input by '[superscript]#h', where '#' (after compose or another dead key) is the (newly defined) composition character for "turned". Once the macros will be written and available (any help is welcome!), should I still be flagged down for undisciplined hacks and for endorsing random suggestions? Kind regards, Marcel [1] Karl Pentzlin?s *MODIFIER LETTER SMALL Q proposal: http://www.unicode.org/L2/L2010/10230-modifier-q.pdf From charupdate at orange.fr Sun Jan 8 05:43:02 2017 From: charupdate at orange.fr (Marcel Schneider) Date: Sun, 8 Jan 2017 12:43:02 +0100 (CET) Subject: Superscript and Subscript Characters in General Use In-Reply-To: <1112132817.50.1483847284449.JavaMail.www@wwinf1p17> References: <20170104173306.665a7a7059d7ee80bb4d670165c8327d.7a92080546.wbe@email03.godaddy.com> <2039959835.345.1483595818937.JavaMail.www@wwinf1k39> <104200337.5858.1483616029614.JavaMail.www@wwinf1k39> <538089927.246.1483681334517.JavaMail.www@wwinf1p15> <828252581.19365.1483729345941.JavaMail.www@wwinf1p15> <1112132817.50.1483847284449.JavaMail.www@wwinf1p17> Message-ID: <1650945202.4739.1483875782165.JavaMail.www@wwinf1p17> On Sun, 8 Jan 2017 04:48:04 +0100 (CET), I wrote: > Among the fallbacks discussed so far, this last one could be considered a ?hack?. > This is why it must not be put in the place of *MODIFIER LETTER SMALL Q on the > keyboard layouts. This allocation should still output a message string such as: > ? ^q_unavailable?, or ? ^q_n?existe_pas? (the maximum number of characters is 16, Please read 'code units' instead of ?characters?. The polysemics of the underscore is another collateral issue. In the strings above, its current function of a space replacement mixes up with its possible TeX value. I wonder whether the LaTeX converter is able to parse that correctly. Please note, too, that the leading space and the low lines are intended to facilitate the deletion of the message by hitting Ctrl + Backspace. > conforming to the Windows limitation; on macOS it?s practically 20, because with > a few more, TextEdit on Snow Leopard shuts down, whatever "maxout" is set to). When that happened, it was set to a higher value. Then I?ve shortened the output string and set ?maxout="20"?. Anyhow, this is no longer an issue here, and this shortcoming is mentioned only for equality?s sake. Marcel From charupdate at orange.fr Sun Jan 8 09:38:33 2017 From: charupdate at orange.fr (Marcel Schneider) Date: Sun, 8 Jan 2017 16:38:33 +0100 (CET) Subject: WAP Pictogram Specification as Emoji Source In-Reply-To: References: <28146f12-21df-f7d0-2b15-aa39e1c78eaf@it.aoyama.ac.jp> Message-ID: <502664845.9352.1483889914046.JavaMail.www@wwinf1p19> On Sat, 7 Jan 2017 00:21:37 +0100, Christoph P?per wrote: > > I just discovered the WAP Pictogram specification (WAP-213-WAPInterPic), > last published in April 2001 and updated in November 2001. > [?] > [?] it?s obvious that WAP pictograms have been unified with Japanese (i-mode) emojis > upon their encoding in Unicode 6+. However, the mapping is not obvious in all cases > and I think there are some pictograms that have been omitted / forgotten [?] On Sat, 7 Jan 2017 13:12:10 +0900, Martin J. D?rst replied: > > Isn't WAP overall pretty much defunct these days? > > (Well, many including me predicted as much pretty much when it first > showed up.) On Sat, 7 Jan 2017 12:43:03 +0100, Philippe Verdy replied: > > Technically it is is operational within operators. Old mobile phones still > have an advantage that has completely been forgotten with smartphone, it is > their very long battery lifetime, and there are still mobile phones sold > today that are NOT smartphones, have NO Internet connectivity (only > GSM/EDGE and SMS and MMS and WAP ? this seems to be what I have. > ) and that will remain in charge for about 2 weeks, when my > smartphone gets out of charge in less than 24 hours (or several times a day). > So no complex layered networking protocol stacks, no advanced typography > and a minimalist display. WAP is still supported on the EDGE/GPRS interface > (used also with the Internet protocol under 2G networks which works almost > everywhere when 3G/4G/5G signals cannot be received). > However don't expect using this for feature rich interaction including for > sending cute "WAP pictograms" that these devices will anyway not be able to > decipher and render. My terminal is able to display colorful pictograms, but I remember that some time ago, the screens were mainly monochrome (and even smaller). > I bet that WAP pictograms was an early specification > for test that was in fact never needed, because the target audience goal > was better achieved with Internet protocols and encoding standards, but > also no one really wanted to administer a registry for the names (see the > death of pict.com : no one paying for it, specification redundant with > classic URIs on the web for referencing images), or standardizing the glyphs. > > The existing standard with normalized glyphs and semantics however exist, > notably for traffic signs (on streets/roads, railways, rivers/canals, > seas...), or in various industry standards (including for food, chemical > products, or cleaning instructions for textiles, or additional glyphs for > recycling, hazards or pollution). There must be several standards in various domains. > We are far from being complete in Unicode > there, even if the supporting standards are effective, sometimes even > mandatory, and very used. The problem for them is that these standards are > not necessarily international, and incompatible with each other but still > regulated and required and you cannot unify the glyphs specified by one of > these standards with those from a competing standard (or with those glyphs > already implemented in the UCS). Yes. This has been discussed in 2003: http://unicode.org/mail-arch/unicode-ml/y2003-m06/0274.html and in 2015: http://www.unicode.org/mail-arch/unicode-ml/y2015-m08/0004.html http://www.unicode.org/mail-arch/unicode-ml/y2015-m08/0013.html > And for now Unicode has resisted the idea > of standardizing sets of symbols for specific standards, and notably if the > glyphs are too strictly defined (not allowing variations/derivations > without breaking the intended regulated semantics). That is the point. Such constraints are opposed to the Unicode principles of encoding symbols, Asmus Freytag explained in another context 12 days ago: http://www.unicode.org/mail-arch/unicode-ml/y2016-m12/0115.html Marcel From charupdate at orange.fr Sun Jan 8 15:05:10 2017 From: charupdate at orange.fr (Marcel Schneider) Date: Sun, 8 Jan 2017 22:05:10 +0100 (CET) Subject: Superscript and Subscript Characters in General Use In-Reply-To: <1650945202.4739.1483875782165.JavaMail.www@wwinf1p17> References: <20170104173306.665a7a7059d7ee80bb4d670165c8327d.7a92080546.wbe@email03.godaddy.com> <2039959835.345.1483595818937.JavaMail.www@wwinf1k39> <104200337.5858.1483616029614.JavaMail.www@wwinf1k39> <538089927.246.1483681334517.JavaMail.www@wwinf1p15> <828252581.19365.1483729345941.JavaMail.www@wwinf1p15> <1112132817.50.1483847284449.JavaMail.www@wwinf1p17> <1650945202.4739.1483875782165.JavaMail.www@wwinf1p17> Message-ID: <652345901.17960.1483909510323.JavaMail.www@wwinf1p19> Is there any other reason not to use superscript characters for abbreviations in plain text? (Except those reasons already discussed, that sum up in the compatibility issue when it comes to rich text.) (Same question for subscript digits for vulgar fractions, that may be considered a kind of abbreviations: - 'Page two of seven' is represented as 'Page 2/7'; - 'Two seventh' is (or should be) represented as '???'.) I?m asking this question now as long as this thread is not closed. I?m not in the habits of asking many questions, but perhaps this one I should have asked. Marcel From charupdate at orange.fr Mon Jan 9 06:42:51 2017 From: charupdate at orange.fr (Marcel Schneider) Date: Mon, 9 Jan 2017 13:42:51 +0100 (CET) Subject: Superscript and Subscript Characters in General Use In-Reply-To: <652345901.17960.1483909510323.JavaMail.www@wwinf1p19> References: <20170104173306.665a7a7059d7ee80bb4d670165c8327d.7a92080546.wbe@email03.godaddy.com> <2039959835.345.1483595818937.JavaMail.www@wwinf1k39> <104200337.5858.1483616029614.JavaMail.www@wwinf1k39> <538089927.246.1483681334517.JavaMail.www@wwinf1p15> <828252581.19365.1483729345941.JavaMail.www@wwinf1p15> <1112132817.50.1483847284449.JavaMail.www@wwinf1p17> <1650945202.4739.1483875782165.JavaMail.www@wwinf1p17> <652345901.17960.1483909510323.JavaMail.www@wwinf1p19> Message-ID: <1236918755.9879.1483965771868.JavaMail.www@wwinf1p19> On Fri, 6 Jan 2017 06:42:14 +0100 (CET), I wrote: > [?] Here, the modifier letters could be a ready-to-use fallback. > Converting them to formatted baseline letters could be achieved with a macro in VBA. > > Couldn?t this be included in the next Office version as an out-of-the-box feature? http://www.unicode.org/mail-arch/unicode-ml/y2017-m01/0036.html A conversion macro is now ready for Notepad++, that uses regexes and adds TeX markup. To get around security issues this time, it is attached below. Thanks Unicode for forwarding. The XML file has some explanations in the header and can be manually added to the user storage file of the software. Macros for LibreOffice and for Office in VBA are in project but I cannot currently write them. Along with this, I feel compelled to submit three detail issues around the topic: (1) Interpretation of sequences: The fraction slash is specified to be interpreted as a sort of format control, and the software is supposed to either format the fraction, or to generate a linear fallback that in TUS (9.0, ?6.2, p. 277) shows up with a character substitution, eventually to emulate a glyph substitution for U+2044 by a glyph similar to U+002F. But the software isn?t meant to perform a glyph substitution as a fallback for another glyph substitution. Is that process conformant to this requirement: ?A process shall not assume that it is required to interpret any particular coded character sequence.? (TUS 9.0, ?3.2, p. 80) Probably I?m missing some clues here. (2) Font conformance: Most fonts seem unhinted so that they cannot substitute numerator and denominator glyphs, and the digits remain normal size. Nevertheless, U+2044 FRACTION SLASH kerns so much that it overlaps many of the adjacent digits, typically 3, 6, 8, 9, 0. Therefore and to get neat fractions, the user may work in rich text and use the generic super-/subscript formatting. The Unicode Core Specification gives hints for implementers to automate this process in another way, while leaving the door open to an unformatted fallback throwout. Some proportional fonts however have an unkerning fraction slash. These inconsistencies in support and display baffle me. Is there any place in the Unicode Standard where the kerning is specified? I believe that there isn?t. So which design decision should be preferred? I think it would be the kerning option. (3) Variation selectors: Today, many characters are given variation selector sequences, so that I believe that the idea could be maintained that letters and digits deserve some information about whether they are a part of abbreviations, or of a vulgar fraction (and then, whether they are numerators or denominators). While the latter can be catered for by the glyph substitution mechanism triggered by the presence of U+2044, the former would require an *ABBREVIATION INDICATOR as it has already been suggested, an invisible formatting control. This however should have been proposed twenty years ago by the mainly concerned communities. Adding and implementing it today would perhaps be inefficient. The more as the concerned sequences are mainly found in Latin script, where thanks to phonetics, superscript forms are already available. After having completed this, I can?t help wondering about the dynamics that show up in this and other related threads over the years, and particularly these past days. While rarely anybody takes offence of the misuse of the DEGREE SIGN as a kind of superscript 'o', many objections are raised whenever people dare to grab the small modifier letters on the keyboard and type them in their text editors, e-mail clients and webmail forms. What is the matter about this practice? Proper handling of such text files turns out to be quite easy and straigtforward, and round-trip conversion is at reach. For once here is a draft format that in some circumstances can even display in a finish-like look. For what reason should that be strongly discouraged and prohibited? This is the more surprising as it would restore the missing equality of the world?s languages with respect to plain text. Is this still overstating that principle? Regards, Marcel -------------- next part -------------- A non-text attachment was scrubbed... Name: shortcuts.xml Type: text/xml Size: 22885 bytes Desc: not available URL: From charupdate at orange.fr Mon Jan 9 15:39:40 2017 From: charupdate at orange.fr (Marcel Schneider) Date: Mon, 9 Jan 2017 22:39:40 +0100 (CET) Subject: Superscript and Subscript Characters in General Use In-Reply-To: <1236918755.9879.1483965771868.JavaMail.www@wwinf1p19> References: <20170104173306.665a7a7059d7ee80bb4d670165c8327d.7a92080546.wbe@email03.godaddy.com> <2039959835.345.1483595818937.JavaMail.www@wwinf1k39> <104200337.5858.1483616029614.JavaMail.www@wwinf1k39> <538089927.246.1483681334517.JavaMail.www@wwinf1p15> <828252581.19365.1483729345941.JavaMail.www@wwinf1p15> <1112132817.50.1483847284449.JavaMail.www@wwinf1p17> <1650945202.4739.1483875782165.JavaMail.www@wwinf1p17> <652345901.17960.1483909510323.JavaMail.www@wwinf1p19> <1236918755.9879.1483965771868.JavaMail.www@wwinf1p19> Message-ID: <572373065.27540.1483997980333.JavaMail.www@wwinf1p25> I?m saddened to have fallen into a monologue. Thus I?ll quickly debrief the arguments opposed so far, to check whether I?m missing some point: ? English ordinals with baseline endings are incorrect too, so that they need formatting, as do French ones: ? English is in the same case as French and a few other languages, that cannot be correctly spelt in plain text without superscript ordinal indicators. Here the modifier small letters can be a ready-to-use, often well-looking fallback. ? Those modifier letters have poor font support, so that the text is messed up: ? Most work fonts do support them. For incomplete (ornamental) fonts, conversion tools will replace the modifier letters with formatted current characters. ? The modifier letters don?t currently match superscripting styles, nor do superscript and subscript digits match fraction styling in most fonts: ? For high-end processing, the text is converted to legacy presentation. Fraction styling is anyway missing in current software, while the superscript and subscript formatting doesn?t match true vulgar fractions neither, though nobody seems to care among the implementers. ? Implementers do hard work to provide fraction styling, so that they mustn?t be bothered with alternate characters to support, as super/sub scripts: ? This additional support is very easy to implement, as it typically needs no more than a set of equivalence classes. ? Character pickers add other problems with font bindings, when people use them instead of looking for an appropriate keyboard layout: ? If the goal is to correctly spell all natural languages in plain text, the character availability is ideally completed with updated keyboard layouts for input. So far by memory. Going through the archives: ? Plain text is often unable to express stress or other aspects of the information that is a part of the content: ? The issue is only about correctly spelling all languages in plain text. Superscripting of abbreviation endings (and of numerators) belongs to another level of correctness than arbitrary stress and other postiche complements. ? Unicode considers superscripting for the representation of natural languages as out of scope: ? Whenever superscripting is required for the correct spelling and unambiguous representation of natural languages, this requirement should be relaxed, as it is for a set of technical notations. ? ?As long as French is ordinary text, the abbreviations require styled (rich) text.? ? No human language can be dismissed to rich text for its orthographically correct representation, without infringing the design principles of Unicode. ? Baseline fallbacks are unambiguous for all French abbreviations, at least in context: ? True. Some other scripts provide much less written information and leave more ambiguities. But this is intrinsic, not by lack of character encoding. Wherever baseline fallbacks are considered incorrect, or not ?pure?, superscripts must be provided in plain text, at least as an unambiguous fallback. ? Other means are available to unambiguously represent abbreviations, and they can be written out: ? Every traditional spelling must be supported in Unicode. ? Some French and Spanish abbreviation endings include accented letters, that aren?t a part of the limited set of modifier letters: ? Following the Unicode design principles, a complete base letter alphabet suffices since combining diacritics can be added. In practice, these diacritics appear to sometimes interact well even with superscript base letters. Where they don?t yet, it?s a matter of updating the fonts, or alternately of falling back to legacy processing (after using a macro, a plugin, or another tool to convert the text to legacy representation). ? Adding *MODIFIER LETTER SMALL Q for use as a superscript in natural languages would bring up the need to provide the same facilities to all other scripts, for equality?s sake: ? Latin script seems to be the only one that uses superscript in current text. If some languages using other scripts still cannot be orthographically spelt in plain text, it?s up to work out the corresponding proposals filling the gaps. ? Superscript abbreviations in natural languages must use generic formatting features, so as they are used for ?footnotes, mathematical and chemical expressions, and the like?: ? These three domains are of another level of complexity, and thus cannot be compared with ordinary text. On the other hand, the use of formatting for orthographic superscripts in ordinary text should be considered a legacy fallback, not a standard way of writing natural languages. ? Vulgar fractions made of super- and subscripts are not machine-readable and cannot be parsed correctly without any not yet available convention, somewhat like arbitrary emoji or ASCII art: ? They have a compatibility mapping to ASCII digits, and Unicode has taken care to prevent misparsing. ? Using super and sub scripts in abbreviations and fractions is bad practice, a sort of random suggestion: ? It can be tagged as bad (though not really ?random?) practice because TUS does not specify it (while not discouraging it neither). To make it good practice, referencing it in the standard as an alternate representation would suffice. (Cf. above, again: ? Implementers do hard work to provide fraction styling, so that they mustn?t be bothered with alternate characters to support, as super/sub scripts: ? This additional support is very easy to implement, as it typically needs no more than a set of equivalence classes.) ? Using those modifier letters and super/sub scripts in that contexts is an undisciplined hack: ? The idea that Unicode characters are only to be used with a specific, conventional meaning is considered a misperception of the Standard. Flagging character re-use as a hack builds a severe limitation to the principle of character polysemics. This is the start-point from where we need to investigate on who is enforcing this kind of discipline, who is interested in restricting the use of the discussed characters to keep (and even, to draw) people away from using them as long as they display well, and establishing new practice-proof usage protocols, including gateways to legacy protocols. Hopefully, Marcel From richard.wordingham at ntlworld.com Mon Jan 9 16:24:14 2017 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Mon, 9 Jan 2017 22:24:14 +0000 Subject: Specification of Encoding of Plain Text Message-ID: <20170109222414.72f83204@JRWUBU2> Where, if anywhere, is the encoding of plain text specified? I am particularly concerned with the arrangement of the code sequences for non-spacing abstract characters once one has determined an encoding for the abstract characters. For example, a naive reading of TUS 9.0 Section 16.4 Subsection "Ordering of Syllable Components" would lead one to believe that the word _khnyom_ 'I' shall be encoded as . However, on further investigation, I cannot find any text that says that would not be compliant with the Unicode standard. Have I missed anything? One might hope that the subsection about 'logical order' in TUS 9.0 Section 2.2 Unicode Design Principles would help, but: 1) Section 3 'Conformance' says nothing about logical order; and 2) The subsection about 'logical order' seems to assume that there exists a common practice; it does not actually place any requirement on this common practice. Richard. From asmusf at ix.netcom.com Mon Jan 9 16:34:17 2017 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Mon, 9 Jan 2017 14:34:17 -0800 Subject: Superscript and Subscript Characters in General Use In-Reply-To: <572373065.27540.1483997980333.JavaMail.www@wwinf1p25> References: <20170104173306.665a7a7059d7ee80bb4d670165c8327d.7a92080546.wbe@email03.godaddy.com> <2039959835.345.1483595818937.JavaMail.www@wwinf1k39> <104200337.5858.1483616029614.JavaMail.www@wwinf1k39> <538089927.246.1483681334517.JavaMail.www@wwinf1p15> <828252581.19365.1483729345941.JavaMail.www@wwinf1p15> <1112132817.50.1483847284449.JavaMail.www@wwinf1p17> <1650945202.4739.1483875782165.JavaMail.www@wwinf1p17> <652345901.17960.1483909510323.JavaMail.www@wwinf1p19> <1236918755.9879.1483965771868.JavaMail.www@wwinf1p19> <572373065.27540.1483997980333.JavaMail.www@wwinf1p25> Message-ID: An HTML attachment was scrubbed... URL: From asmusf at ix.netcom.com Tue Jan 10 02:06:05 2017 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Tue, 10 Jan 2017 00:06:05 -0800 Subject: Specification of Encoding of Plain Text In-Reply-To: <20170109222414.72f83204@JRWUBU2> References: <20170109222414.72f83204@JRWUBU2> Message-ID: An HTML attachment was scrubbed... URL: From mark at macchiato.com Tue Jan 10 03:11:41 2017 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Tue, 10 Jan 2017 10:11:41 +0100 Subject: Specification of Encoding of Plain Text In-Reply-To: References: <20170109222414.72f83204@JRWUBU2> Message-ID: What I really wish we had would be a machine readable set of regexes for each complex script (and for each language-script combination that is different than the default for that script). Such a regex R could be used for determining the well-formed ordering of code points within words. The regex need not be for syllables, or grapheme clusters, or any other formal construct. The *only* requirement it would need to fulfill is that you could determine well-formed words with: word := (R)+ That is, if R were (C V C? | V C?) then any of CVC CVCVC VC V CV would pass the text, but CCV would fail. Ideally R would be as simple as possible (but no simpler). Mark On Tue, Jan 10, 2017 at 9:06 AM, Asmus Freytag wrote: > On 1/9/2017 2:24 PM, Richard Wordingham wrote: > > Where, if anywhere, is the encoding of plain text specified? I am > particularly concerned with the arrangement of the code sequences for > non-spacing abstract characters once one has determined an encoding for > the abstract characters. > > For example, a naive reading of TUS 9.0 Section 16.4 Subsection > "Ordering of Syllable Components" would lead one to believe that the > word _khnyom_ 'I' shall be encoded as U+17D2 KHMER SIGN COENG, U+1789 KHMER LETTER NYO, U+17BB KHMER VOWEL > SIGN U, U+17C6 KHMER SIGN NIKAHIT>. > > Richard, > > the group of Khmer experts that developed the recent label generation > rules for root zone domain names considers that ordering the only one > supported, a specification you find here: https://www.icann.org/en/ > system/files/files/proposal-khmer-lgr-15aug16-en.pdf > > That document states: > > *7.4 Context of COENG Sign (U+17D2)* > The sign ? KHMER SIGN COENG (U+17D2) used for subscripting consonants must > occur between two consonants. If it occurs between any other categories, it > is not in a valid context so the label is not well formed. Further, the > consonant following it must not include ? KHMER LETTER LA (U+17A1), ... > > So, you are not alone in thinking that the COENG goes between consonants. > > Did they just make this up? No, they followed what is laid out in the > standard: > > Page 621 in Unicode 9.0.0, you find (http://www.unicode.org/ > versions/Unicode9.0.0/ch16.pdf) > > *Subscript Consonants.* Subscript consonant signs differ from independent > consonant > characters and are called coeng (literally, ?foot, leg?) after their > subscript position. While a > consonant character can constitute an orthographic syllable by itself, a > subscript consonant > sign cannot. Note that U+17A1 C khmer letter la does not have a > corresponding subscript > consonant sign in standard Khmer.... Subscript consonant signs are used to > represent any > consonant following the first consonant in an orthographic syllable. > > and on page 624: > > .... each of these [subscript consonant] signs is represented by the > sequence of two characters: a > special control character (U+17D2 khmer sign coeng) and a corresponding > consonant > character. > > That text fixes the order MAIN CONSONANT + COENG OPERATOR + SUBSCRIPT > CONSONANT > with suffficient clarity (as do all the examples and tables). > > > However, on further investigation, > I cannot find any text that says that U+17BB> would not be compliant with the Unicode standard. Have I > missed anything? > > > In this example, your coeng operator U+17D2 is out of order, while it is > followed by > a consonant, it does not in turn immediately follow the main consonant, > because a > sign NIKAHIT is inserted in your example. > > Again, from the Root Zone LGR document we find an explicit rule: > > *7.10 Context of NIKAHIT SIGN (U+17C6)* > The sign ?? KHMER SIGN NIKAHIT (U+17C6) can only be preceded by a > consonant or a shifter or one of the subset of dependent vowels tagged > ?dependent-vowel-1? in the repertoire table (? ??), i.e. vowel signs AA and > U. > > That would allow the NIKAHIT to be placed where you suggest, if it were > not for the > rule on the coeng operator (7.4). > > Now, it is a known fact that the label generation rules are slightly more > restrictive than the rules for general text. (See also section 5 in that > document). > > See the text on p. 622 in TUS 9.0.0 where the following *exception* is > noted: > > "The subscript consonant signs in the Khmer script can be used to denote a > final consonant, > although this practice is uncommon." > > The associated example shows MAIN CONSONANT + VOWEL + NIKHAHIT + COENG + > FINAL CONSONANT > > Another exception that is noted on p. 623 is the following: > > "While these subscript consonant signs are usually attached to a consonant > character, they > can also be attached to an independent vowel character. Although this > practice is relatively > rare, it is used in one very common word, meaning ?to give.?" > > Taken together, it would appear that, unless your example fits the first > of these two exceptions, > the NIKAHIT in it is out of order. > > (The label generation rules disallow both of these exceptions, > in an attempt to streamline the rules, sacrificing a number of potential > domain names. Equivelant > rule sets for validating text would have to be more complete). > > One might hope that the subsection about 'logical order' in TUS 9.0 > Section 2.2 Unicode Design Principles would help, but: > > 1) Section 3 'Conformance' says nothing about logical order; and > 2) The subsection about 'logical order' seems to assume that there > exists a common practice; it does not actually place any requirement > on this common practice. > > Richard. > > > > I don't think either of these general sections are intended to provide the > correct > or expected ordering of characters for complex scripts. Any preferred > ordering that > doesn't result by happenstance from normalization would presumably be > describe > in the text of the scrip section, such as Section 16.4 Khmer, in TUS 9.0.0. > > http://www.unicode.org/versions/Unicode9.0.0/ch16.pdf > > A./ > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From alastair at alastairs-place.net Tue Jan 10 05:03:24 2017 From: alastair at alastairs-place.net (Alastair Houghton) Date: Tue, 10 Jan 2017 11:03:24 +0000 Subject: Superscript and Subscript Characters in General Use In-Reply-To: References: <20170104173306.665a7a7059d7ee80bb4d670165c8327d.7a92080546.wbe@email03.godaddy.com> <2039959835.345.1483595818937.JavaMail.www@wwinf1k39> <104200337.5858.1483616029614.JavaMail.www@wwinf1k39> <538089927.246.1483681334517.JavaMail.www@wwinf1p15> <828252581.19365.1483729345941.JavaMail.www@wwinf1p15> <1112132817.50.1483847284449.JavaMail.www@wwinf1p17> <1650945202.4739.1483875782165.JavaMail.www@wwinf1p17> <652345901.17960.1483909510323.JavaMail.www@wwinf1p19> <1236918755.9879.1483965771868.JavaMail.www@wwinf1p19> <572373065.27540.1483997980333.JavaMail.www@wwinf1p25> Message-ID: On 9 Jan 2017, at 22:34, Asmus Freytag wrote: > > On 1/9/2017 1:39 PM, Marcel Schneider wrote: >> I?m saddened to have fallen into a monologue. Thus I?ll quickly debrief >> the arguments opposed so far, to check whether I?m missing some point >> > There's a good reason for that. You are advocating something that everyone else > accepts as going against a settled principle of the standard, That?s part of it, but I think also that the thread is increasingly verbose and hard to follow. I still think that the idea of adding U+???? SUPERSCRIPT and U+???? SUBSCRIPT might be worth contemplating; it would seem to provide a good answer to both Marcel?s and the French standards body?s concerns (wrt their proposed new ordinal indicator) while only using up two code points, and it?d be much easier to explain to people that superscripts and subscripts were a presentational matter and that they needed to talk to their font supplier. Plus it would work with existing platform rendering engines provided a font with an appropriate OpenType GSUB table was available. Does anyone besides Marcel have any input on that idea? Is it worth writing a proposal to add SUPERSCRIPT and SUBSCRIPT? To give some examples: S^{t?} U+0053 LATIN CAPITAL LETTER S U+0074 LATIN SMALL LETTER T U+???? SUPERSCRIPT U+0065 LATIN SMALL LETTER E U+???? SUPERSCRIPT U+0301 COMBINING ACUTE ACCENT i_{j} U+0069 LATIN SMALL LETTER I U+0070 LATIN SMALL LETTER J U+???? SUBSCRIPT Perhaps the code points U+209E and U+209F could be used for SUBSCRIPT and SUPERSCRIPT respectively? Are there other things that should be considered? These are not supposed to be a replacement for rich text, which after all would allow nesting and indeed non-character data in subscripts and superscripts, but more as a way to avoid requests to add further superscript and subscript characters to Unicode itself and for limited use in ?plain text?-only contexts (Twitter, for instance). Kind regards, Alastair. -- http://alastairs-place.net From frederic.grosshans at gmail.com Tue Jan 10 08:40:39 2017 From: frederic.grosshans at gmail.com (=?UTF-8?Q?Fr=c3=a9d=c3=a9ric_Grosshans?=) Date: Tue, 10 Jan 2017 15:40:39 +0100 Subject: Superscript and Subscript Characters in General Use In-Reply-To: References: <20170104173306.665a7a7059d7ee80bb4d670165c8327d.7a92080546.wbe@email03.godaddy.com> <2039959835.345.1483595818937.JavaMail.www@wwinf1k39> <104200337.5858.1483616029614.JavaMail.www@wwinf1k39> <538089927.246.1483681334517.JavaMail.www@wwinf1p15> <828252581.19365.1483729345941.JavaMail.www@wwinf1p15> <1112132817.50.1483847284449.JavaMail.www@wwinf1p17> <1650945202.4739.1483875782165.JavaMail.www@wwinf1p17> <652345901.17960.1483909510323.JavaMail.www@wwinf1p19> <1236918755.9879.1483965771868.JavaMail.www@wwinf1p19> <572373065.27540.1483997980333.JavaMail.www@wwinf1p25> Message-ID: <7b699c09-2f89-decc-caa4-9a62e8c9876f@gmail.com> Le 10/01/2017 ? 12:03, Alastair Houghton a ?crit : > That?s part of it, but I think also that the thread is increasingly verbose and hard to follow. > > I still think that the idea of adding U+???? SUPERSCRIPT and U+???? SUBSCRIPT might be worth contemplating; it would seem to provide a good answer to both Marcel?s and the French standards body?s concerns (wrt their proposed new ordinal indicator) while only using up two code points, and it?d be much easier to explain to people that superscripts and subscripts were a presentational matter and that they needed to talk to their font supplier. Plus it would work with existing platform rendering engines provided a font with an appropriate OpenType GSUB table was available. > > Does anyone besides Marcel have any input on that idea? Is it worth writing a proposal to add SUPERSCRIPT and SUBSCRIPT? No! Long story short: encoding the {super,sub}script characters one by one in unicode is a choice that was made more than two decades ago, and it is much too late to change this, even if it were a good idea. One of the major problems of such a proposition is that it would be incompatible (or ambiguous) with earlier version of unicode, since the same character, let?s say ???, could be encoded in two differrent manners : SUPERSCRIPT + U+0033 DIGIT THREE vs the current U+00B3 SUPESCRIPT THREE, and such things are a big no-no. It was problematic with accented characters and led to the definition of NFC / NFD normalization with strict stability policies enforced since the 1990s. If you would manage to convince the Unicode comity that such an encoding would fit the plain-text model (good luck with that), without removing all the previously encoded superscript/modifier letters (it?s forbidden), you would need to define what happens in the various normalization models NFC / NFD, and probably a introduce new one (NFE ? E for exponent), which would be to say the least, a huge architectural change of the Unicode model, for a modest gain if any. From asmusf at ix.netcom.com Tue Jan 10 10:59:37 2017 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Tue, 10 Jan 2017 08:59:37 -0800 Subject: Specification of Encoding of Plain Text In-Reply-To: References: <20170109222414.72f83204@JRWUBU2> Message-ID: An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Tue Jan 10 13:40:13 2017 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Tue, 10 Jan 2017 19:40:13 +0000 Subject: Specification of Encoding of Plain Text In-Reply-To: References: <20170109222414.72f83204@JRWUBU2> Message-ID: <20170110194013.0476f15f@JRWUBU2> On Tue, 10 Jan 2017 10:11:41 +0100 Mark Davis ?? wrote: > What I really wish we had would be a machine readable set of regexes > for each complex script (and for each language-script combination > that is different than the default for that script). What would the status of these regexes be? For example, the Khmer script already has a regex for words sensu stricto, but there doesn't seem to be any formal requirement to conform to it or, more immediately usefully to users, attempt to support it if one claims to support Khmer. I like the idea, but it seems to have a lot of nits, which I shall now pick. The regexes should also be human-comprehensible. I'm dubious of the concept of each language-script combination potentially having a regex, or indeed of the script having a *default* regex. Would this be used to do the equivalent of saying that English doesn't have the letter thorn, or, for example, prohibiting most complex onsets from Lao? > Such a regex R could be used for determining the well-formed ordering > of code points within words. The regex need not be for syllables, or > grapheme clusters, or any other formal construct. The *only* > requirement it would need to fulfill is that you could determine > well-formed words with: > word := (R)+ > That is, if R were (C V C? | V C?) then any of CVC CVCVC VC V CV > would pass the text, but CCV would fail. Ideally R would be as simple > as possible (but no simpler). Several Indian languages only allow independent vowels word initially. You wouldn't be able to capture that with (R)+. Would the regexes be on strings or on traces (strings modulo canonical equivalence)? The language recognised by the regex for the Universal Shaping Engine (USE) is notoriously not closed under canonical equivalence. Most non-spacing marks should not occur double - though I think the most significant trouble with them is with fonts that won't then show them double. Barring them could make for a tricky regex. But, if we applied that to the Latin script, should we allow f?? (the Fourier transform of the Fourier transform of f) as a word?. Tibetan allows some non-spacing marks to occur triple. Richard. From richard.wordingham at ntlworld.com Tue Jan 10 14:44:30 2017 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Tue, 10 Jan 2017 20:44:30 +0000 Subject: Specification of Encoding of Plain Text In-Reply-To: References: <20170109222414.72f83204@JRWUBU2> Message-ID: <20170110204430.6e580f72@JRWUBU2> On Tue, 10 Jan 2017 00:06:05 -0800 Asmus Freytag wrote: > On 1/9/2017 2:24 PM, Richard Wordingham wrote: I'll take your last point first. >> One might hope that the subsection about 'logical order' in TUS 9.0 >> Section 2.2 Unicode Design Principles would help, but: >> 1) Section 3 'Conformance' says nothing about logical order; and >> 2) The subsection about 'logical order' seems to assume that there >> exists a common practice; it does not actually place any requirement >> on this common practice. > I don't think either of these general sections are intended to > provide the correct or expected ordering of characters for complex > scripts. Any preferred ordering that doesn't result by happenstance > from normalization would presumably be describe in the text of the > scrip section, such as Section 16.4 Khmer, in TUS 9.0.0. The key word here is 'preferred'. Your reply, while not completely clear, confirms my view that Unicode does not *specify* an overall character ordering for Khmer, despite the section's having a BNF regexp for Khmer syllables - B{R|C}{S{R}}*{{Z}V}{O}{S}. >> For example, a naive reading of TUS 9.0 Section 16.4 Subsection >> "Ordering of Syllable Components" would lead one to believe that the >> word _khnyom_ 'I' shall be encoded as > U+17D2 KHMER SIGN COENG, U+1789 KHMER LETTER NYO, U+17BB KHMER VOWEL >> SIGN U, U+17C6 KHMER SIGN NIKAHIT>. > Richard, > the group of Khmer experts that developed the recent label generation > rules for root zone domain names considers that ordering the only one > supported,? a specification you find here: > https://www.icann.org/en/system/files/files/proposal-khmer-lgr-15aug16-en.pdf But as you acknowledge, the specification only covers a strict subset of legitimate Khmer script text, even of text composed of encoded Khmer characters. It excludes some text given in TUS Section 16.4. Indeed, Section 7.4 of the proposal to ICANN even excludes the new spelling of the word ??? (ooy, give) - , for U+17B1 is not a consonant! I have ignored the logical gaps in your reply; nothing in the *Unicode* standard prohibits or deprecates the sequence , even though it does not satisfy the regexp I quoted above. >> So, you are not alone in thinking that the COENG goes between >> consonants.? I do not support the heresy that COENG may only occur between consonants. I do wonder if the Khmer Generation Panel opened their Pali grammars. How would they propose to write the accusative singular of nouns in -i? The accusative singular of non-neuter nouns ends in -i?, which I would expect to be written , which is what I perceive at the end of a line in the middle of the second left-hand page at http://watkhemararatanaram.org/tipitaka/viney_beidok_05b.php . Do they expect one to use U+17B9 KHMER VOWEL SIGN Y? (Thai scholars once had to resort to such an expedient.) Richard. From richard.wordingham at ntlworld.com Tue Jan 10 14:51:21 2017 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Tue, 10 Jan 2017 20:51:21 +0000 Subject: Superscript and Subscript Characters in General Use In-Reply-To: References: <20170104173306.665a7a7059d7ee80bb4d670165c8327d.7a92080546.wbe@email03.godaddy.com> <2039959835.345.1483595818937.JavaMail.www@wwinf1k39> <104200337.5858.1483616029614.JavaMail.www@wwinf1k39> <538089927.246.1483681334517.JavaMail.www@wwinf1p15> <828252581.19365.1483729345941.JavaMail.www@wwinf1p15> <1112132817.50.1483847284449.JavaMail.www@wwinf1p17> <1650945202.4739.1483875782165.JavaMail.www@wwinf1p17> <652345901.17960.1483909510323.JavaMail.www@wwinf1p19> <1236918755.9879.1483965771868.JavaMail.www@wwinf1p19> <572373065.27540.1483997980333.JavaMail.www@wwinf1p25> Message-ID: <20170110205121.57634d57@JRWUBU2> On Tue, 10 Jan 2017 11:03:24 +0000 Alastair Houghton wrote: > Does anyone besides Marcel have any input on that idea? Is it worth > writing a proposal to add SUPERSCRIPT and SUBSCRIPT? To give some > examples: > > S^{t?} > > U+0053 LATIN CAPITAL LETTER S > U+0074 LATIN SMALL LETTER T > U+???? SUPERSCRIPT > U+0065 LATIN SMALL LETTER E > U+???? SUPERSCRIPT > U+0301 COMBINING ACUTE ACCENT > > i_{j} > > U+0069 LATIN SMALL LETTER I > U+0070 LATIN SMALL LETTER J > U+???? SUBSCRIPT > > Perhaps the code points U+209E and U+209F could be used for SUBSCRIPT > and SUPERSCRIPT respectively? I would suggest using a pair of variation selectors instead. It's no messier than ideographic compatibility characters, and I think it is actually less messy. However, I would further suggest creating the variation sequences only when the corresponding superscript or subscript form does not exist. Richard. From asmusf at ix.netcom.com Tue Jan 10 15:12:47 2017 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Tue, 10 Jan 2017 13:12:47 -0800 Subject: Specification of Encoding of Plain Text In-Reply-To: <20170110204430.6e580f72@JRWUBU2> References: <20170109222414.72f83204@JRWUBU2> <20170110204430.6e580f72@JRWUBU2> Message-ID: <7c945443-1d67-e4df-c475-b5ba3b5bc342@ix.netcom.com> On 1/10/2017 12:44 PM, Richard Wordingham wrote: > On Tue, 10 Jan 2017 00:06:05 -0800 > Asmus Freytag wrote: >> On 1/9/2017 2:24 PM, Richard Wordingham wrote: > I'll take your last point first. > >>> One might hope that the subsection about 'logical order' in TUS 9.0 >>> Section 2.2 Unicode Design Principles would help, but: > >>> 1) Section 3 'Conformance' says nothing about logical order; and >>> 2) The subsection about 'logical order' seems to assume that there >>> exists a common practice; it does not actually place any requirement >>> on this common practice. > >> I don't think either of these general sections are intended to >> provide the correct or expected ordering of characters for complex >> scripts. Any preferred ordering that doesn't result by happenstance >> from normalization would presumably be describe in the text of the >> scrip section, such as Section 16.4 Khmer, in TUS 9.0.0. > The key word here is 'preferred'. Your reply, while not completely > clear, confirms my view that Unicode does not *specify* an overall > character ordering for Khmer, despite the section's having a BNF regexp > for Khmer syllables - B{R|C}{S{R}}*{{Z}V}{O}{S}. You are possibly misreading my use of the word "preferred". > >>> For example, a naive reading of TUS 9.0 Section 16.4 Subsection >>> "Ordering of Syllable Components" would lead one to believe that the >>> word _khnyom_ 'I' shall be encoded as >> U+17D2 KHMER SIGN COENG, U+1789 KHMER LETTER NYO, U+17BB KHMER VOWEL >>> SIGN U, U+17C6 KHMER SIGN NIKAHIT>. >> Richard, >> the group of Khmer experts that developed the recent label generation >> rules for root zone domain names considers that ordering the only one >> supported, a specification you find here: >> https://www.icann.org/en/system/files/files/proposal-khmer-lgr-15aug16-en.pdf > But as you acknowledge, the specification only covers a strict subset of > legitimate Khmer script text, even of text composed of encoded Khmer > characters. The advantage of the text I brought to your attention is the way it is formalized and that it was created with local expertise. The disadvantage from your perspective is that the scope does not match with your intended use case. > It excludes some text given in TUS Section 16.4. Indeed, > Section 7.4 of the proposal to ICANN even excludes the new spelling of > the word ??? (ooy, give) - U+17D2 KHMER SIGN COENG, U+1799 KHMER LETTER YO>, for U+17B1 is not a > consonant! > > I have ignored the logical gaps in your reply; nothing in the *Unicode* > standard prohibits or deprecates the sequence U+1789, U+17BB>, even though it does not satisfy the regexp I quoted > above. Unicode clearly doesn't forbid most sequences in complex scripts, even if they cannot be expected to render properly and otherwise would stump the native reader. However, the descriptions are reasonably detailed to let you find out whether you are using characters as intended, or not. > >>> So, you are not alone in thinking that the COENG goes between >>> consonants. > I do not support the heresy that COENG may only occur between > consonants. Remember, I gave you the scope for that. Your example is well taken, but from a different scope, where explicitly accounting for some other sequences is necessary. No disagreement. A./ > > I do wonder if the Khmer Generation Panel opened their Pali grammars. > How would they propose to write the accusative singular of nouns in > -i? The accusative singular of non-neuter nouns ends in -i?, which I > would expect to be written NIKAHIT>, which is what I perceive at the end of a line in the middle > of the second left-hand page at > http://watkhemararatanaram.org/tipitaka/viney_beidok_05b.php . Do they > expect one to use U+17B9 KHMER VOWEL SIGN Y? (Thai scholars once had > to resort to such an expedient.) > > Richard. > > From richard.wordingham at ntlworld.com Tue Jan 10 16:54:57 2017 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Tue, 10 Jan 2017 22:54:57 +0000 Subject: Specification of Encoding of Plain Text In-Reply-To: <7c945443-1d67-e4df-c475-b5ba3b5bc342@ix.netcom.com> References: <20170109222414.72f83204@JRWUBU2> <20170110204430.6e580f72@JRWUBU2> <7c945443-1d67-e4df-c475-b5ba3b5bc342@ix.netcom.com> Message-ID: <20170110225457.56e581ca@JRWUBU2> On Tue, 10 Jan 2017 13:12:47 -0800 Asmus Freytag wrote: > Unicode clearly doesn't forbid most sequences in complex scripts, > even if they cannot be expected to render properly and otherwise > would stump the native reader. Is this expectation based on sequence enforcement in the renderer? The main problem with getting text to render reasonably (not necessarily as desired) is now anti-phishing. The Unicode standard does define what short sequences of characters mean. The problem is that then, outside the Apple world, it seems to be left to Microsoft to decide what longer sequences they will allow. > The advantage of the text I brought to your attention is the way it > is formalized and that it was created with local expertise. The > disadvantage from your perspective is that the scope does not match > with your intended use case. Perhaps ICANN will be the industry-wide definer. However, to stay with Indic rendering, one may have cases where CVC and CCV orthographic syllables have little to no visible difference. The Khmer writing system once made much greater use of CVC syllables. For reproducing older texts, one might be forced to encode phonetic CVC as though it were CCV. This is already the case, through error rather than design, with the Thai script in Tai Tham. This affects about 30% of the Northern Thai lexicon*, and I believe even a higher proportion when adjusted for word frequency. Now, to fight phishing, I have always believed that some brutal folding would be required for Tai Tham, which is why I suggested that the S.SA ligature be encoded (U+1A54 TAI THAM LETTER GREAT SA). *I've sampled the MFL dictionary. I suspect a bias to untruncated forms in loans from Pali, such as _kathina_ rather than _kathin_. If my suspicion is correct, the proportion would be even higher. However, I believe there is some advantage in distinguishing CVC and CCV at the code level, even where there is no visual difference. To display small visual differences, perhaps we will be forced to beg for mark-up to make the distinction visible. In Tai Tham, there are very few CCV-CVC visual homographs in native words because of the phonological structure of Northern Thai, and one can usually guess whether the word is CCV or CVC. Richard. From asmusf at ix.netcom.com Tue Jan 10 19:25:06 2017 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Tue, 10 Jan 2017 17:25:06 -0800 Subject: Specification of Encoding of Plain Text In-Reply-To: <20170110225457.56e581ca@JRWUBU2> References: <20170109222414.72f83204@JRWUBU2> <20170110204430.6e580f72@JRWUBU2> <7c945443-1d67-e4df-c475-b5ba3b5bc342@ix.netcom.com> <20170110225457.56e581ca@JRWUBU2> Message-ID: On 1/10/2017 2:54 PM, Richard Wordingham wrote: > On Tue, 10 Jan 2017 13:12:47 -0800 > Asmus Freytag wrote: > >> Unicode clearly doesn't forbid most sequences in complex scripts, >> even if they cannot be expected to render properly and otherwise >> would stump the native reader. > Is this expectation based on sequence enforcement in the renderer? The > main problem with getting text to render reasonably (not necessarily as > desired) is now anti-phishing. You mean anti-spoofing. There are many types of phishing attempts that do not rely on spoofing identifiers. There are many different tacks that can be taken to make spoofing more difficult. Among them, for critical identifiers: 1) allow only a restricted repertoire 2) disallow certain sequences 3) use a registry and 3a) define sets of labels that overlap (variant sets) 3b) restrict actual labels to be in disjoint sets (one label blocks all others in the same variant set) The ICANN work on creating label generation rules attempts to implement these strategies (currently for 28 scripts in the Root Zone of the DNS). The work on the first half dozen scripts is basically completed. > The Unicode standard does define what > short sequences of characters mean. The problem is that then, outside > the Apple world, it seems to be left to Microsoft to decide what longer > sequences they will allow. MS and Apple are not the only ones writing renderers. > >> The advantage of the text I brought to your attention is the way it >> is formalized and that it was created with local expertise. The >> disadvantage from your perspective is that the scope does not match >> with your intended use case. > Perhaps ICANN will be the industry-wide definer. However, to stay with > Indic rendering, one may have cases where CVC and CCV orthographic > syllables have little to no visible difference. The Khmer writing > system once made much greater use of CVC syllables. For reproducing > older texts, one might be forced to encode phonetic CVC as though it > were CCV. The restriction on sequences appropriate as an anti-spoofing measure are not appropriate on general encoded text! For one, the Root Zone explicitly disallows anything that's not in "widespread everyday" use. This covers most transcriptions of "historic" texts, as well as religious or technical (phonetic) notations and transcriptions. But restriction of repertoire and sequences goes only so far. You will always have a residual set of labels that overlap to a degree that users do not reliably distinguish them. (Actually many disjoint sets of overlapping labels). The hard core of these are labels that appear (practically) identical. There's a further aura of more or less confusables. Mathematically these two behave differently: a set of (practically) identical labels is symmetric and transitive, while a set of merely similar labels may be symmetric, but is not transitive. If A is equivalent to B and B to C then A is equivalent to C (transitivity). However, for merely similar labels there's a non-zero "similarity distance", if you will. If you try to chain similarity together via transitivity then you might exceed a similarity threshold and your end points (e.g. A and C above) may both be similar to B but not (sufficiently) to each other. The project I'm involved in tackles only transitive forms of equivalence (whether visual or semantic). Collisions based on these equivalences can be handled with label generation rulesets defined per RFC 7940, which allow registration policies that are automated. The further "halo" of "merely" similar labels needs to be handled with additional technology that can handle concepts like similarity distance. From a Unicode perspective, there's a virtue in not over specifying sequences, because you don't want to be caught having to re-encode entire scripts should the conventions for the use of the elements making up the script change in an orthography reform! That does not mean that Unicode (at all times) endorses all permutations of free-form sequences as equally valid. A./ > > This is already the case, through error rather than design, > with the Thai script in Tai Tham. This affects about 30% of the > Northern Thai lexicon*, and I believe even a higher proportion when > adjusted for word frequency. Now, to fight phishing, I have always > believed that some brutal folding would be required for Tai Tham, which > is why I suggested that the S.SA ligature be encoded (U+1A54 TAI THAM > LETTER GREAT SA). > > *I've sampled the MFL dictionary. I suspect a bias to untruncated forms > in loans from Pali, such as _kathina_ rather than _kathin_. If my > suspicion is correct, the proportion would be even higher. > > However, I believe there is some advantage in distinguishing CVC and > CCV at the code level, even where there is no visual difference. To > display small visual differences, perhaps we will be forced to beg for > mark-up to make the distinction visible. > > In Tai Tham, there are very few CCV-CVC visual homographs in native > words because of the phonological structure of Northern Thai, and one > can usually guess whether the word is CCV or CVC. > > Richard. > From charupdate at orange.fr Wed Jan 11 00:00:52 2017 From: charupdate at orange.fr (Marcel Schneider) Date: Wed, 11 Jan 2017 07:00:52 +0100 (CET) Subject: Superscript and Subscript Characters in General Use In-Reply-To: References: <20170104173306.665a7a7059d7ee80bb4d670165c8327d.7a92080546.wbe@email03.godaddy.com> <2039959835.345.1483595818937.JavaMail.www@wwinf1k39> <104200337.5858.1483616029614.JavaMail.www@wwinf1k39> <538089927.246.1483681334517.JavaMail.www@wwinf1p15> <828252581.19365.1483729345941.JavaMail.www@wwinf1p15> <1112132817.50.1483847284449.JavaMail.www@wwinf1p17> <1650945202.4739.1483875782165.JavaMail.www@wwinf1p17> <652345901.17960.1483909510323.JavaMail.www@wwinf1p19> <1236918755.9879.1483965771868.JavaMail.www@wwinf1p19> <572373065.27540.1483997980333.JavaMail.www@wwinf1p25> Message-ID: <1469162895.239.1484114453124.JavaMail.www@wwinf1h11> On Mon, 9 Jan 2017 14:34:17 -0800, Asmus Freytag wrote: [?] > Just get over it [?] We are facing a strong user demand since early standards. Actually I cannot. Sorry. Thank you however for all of your feedback. On Tue, 10 Jan 2017 11:03:24 +0000, Alastair Houghton wrote: [?] > [?] I think also that the thread is increasingly verbose and hard to follow. It?s very hard for me too. But I?ll try to be concise. Thank you for involving in the issue. > [?] for limited use in ?plain text?-only contexts (Twitter, for instance). The phenomenon isn?t actually limited to plain text environments. See: http://stackoverflow.com/questions/13878772/how-to-display-classic-fractions-in-css-javascript | You can also use the straight unicode approach to render ?????: | | ¹⁹⁄₄₅ | | (See the wikipedia article.) https://en.wikipedia.org/wiki/Unicode_subscripts_and_superscripts On Tue, 10 Jan 2017 20:51:21 +0000, Richard Wordingham wrote: [?] > I would suggest using a pair of variation selectors instead. It's no > messier than ideographic compatibility characters, and I think it is > actually less messy. However, I would further suggest creating the > variation sequences only when the corresponding superscript or subscript > form does not exist. This clearly advocates the current use of the superscript and subscript forms. Thank you for considering the issue. Thanks to all who responded in these threads. Converting preformatted to TeX formatted: My conversion macro was too simplistic, it was made up hastily, sorry. An improved version (perhaps overkill now) is attached below. I?d have liked to port it to Vim, too. The macros for productivity suites are sadly still missing. Kind regards, Marcel -------------- next part -------------- A non-text attachment was scrubbed... Name: P-to-F_for_Notepad++.xml Type: text/xml Size: 91768 bytes Desc: not available URL: From richard.wordingham at ntlworld.com Wed Jan 11 02:32:12 2017 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Wed, 11 Jan 2017 08:32:12 +0000 Subject: Superscript and Subscript Characters in General Use In-Reply-To: <1469162895.239.1484114453124.JavaMail.www@wwinf1h11> References: <20170104173306.665a7a7059d7ee80bb4d670165c8327d.7a92080546.wbe@email03.godaddy.com> <2039959835.345.1483595818937.JavaMail.www@wwinf1k39> <104200337.5858.1483616029614.JavaMail.www@wwinf1k39> <538089927.246.1483681334517.JavaMail.www@wwinf1p15> <828252581.19365.1483729345941.JavaMail.www@wwinf1p15> <1112132817.50.1483847284449.JavaMail.www@wwinf1p17> <1650945202.4739.1483875782165.JavaMail.www@wwinf1p17> <652345901.17960.1483909510323.JavaMail.www@wwinf1p19> <1236918755.9879.1483965771868.JavaMail.www@wwinf1p19> <572373065.27540.1483997980333.JavaMail.www@wwinf1p25> <1469162895.239.1484114453124.JavaMail.www@wwinf1h11> Message-ID: <20170111083212.476f492e@JRWUBU2> On Wed, 11 Jan 2017 07:00:52 +0100 (CET) Marcel Schneider wrote: > The phenomenon isn?t actually limited to plain text environments. See: > > http://stackoverflow.com/questions/13878772/how-to-display-classic-fractions-in-css-javascript > | You can also use the straight unicode approach to render ?????: > | > | ¹⁹⁄₄₅ > | > | (See the wikipedia article.) > https://en.wikipedia.org/wiki/Unicode_subscripts_and_superscripts If you follow the link from that page to https://en.wikipedia.org/wiki/Subscript_and_superscript , you will notice an immediate issue with the position of the subscripts. This is why the use of explicitly coded subscript and superscript digits for vulgar fractions is not recommended. Rather, one needs to hope that the font one is using supports U+2044 FRACTION SLASH. As not all fonts support all superscript and subscript digits, text using them may render badly, whereas U+2044 itself will usually be rendered at least tolerably even if the glyph comes from a different font to the digits. The truly straight Unicode approach in HTML is to use 19⁄45. Just entering those 5 characters into a text entry box in Firefox gave me a properly formatted vulgar fraction. That is how vulgar fractions are supposed to work. Unfortunately, one may need to avoid 'exciting new fonts' in favour of those with a large, working repertoire. Richard. From charupdate at orange.fr Wed Jan 11 08:20:21 2017 From: charupdate at orange.fr (Marcel Schneider) Date: Wed, 11 Jan 2017 15:20:21 +0100 (CET) Subject: Superscript and Subscript Characters in General Use In-Reply-To: <20170111083212.476f492e@JRWUBU2> References: <20170104173306.665a7a7059d7ee80bb4d670165c8327d.7a92080546.wbe@email03.godaddy.com> <2039959835.345.1483595818937.JavaMail.www@wwinf1k39> <104200337.5858.1483616029614.JavaMail.www@wwinf1k39> <538089927.246.1483681334517.JavaMail.www@wwinf1p15> <828252581.19365.1483729345941.JavaMail.www@wwinf1p15> <1112132817.50.1483847284449.JavaMail.www@wwinf1p17> <1650945202.4739.1483875782165.JavaMail.www@wwinf1p17> <652345901.17960.1483909510323.JavaMail.www@wwinf1p19> <1236918755.9879.1483965771868.JavaMail.www@wwinf1p19> <572373065.27540.1483997980333.JavaMail.www@wwinf1p25> <1469162895.239.1484114453124.JavaMail.www@wwinf1h11> <20170111083212.476f492e@JRWUBU2> Message-ID: <841195008.11765.1484144421611.JavaMail.www@wwinf1p08> On Wed, 11 Jan 2017 08:32:12 +0000, Richard Wordingham wrote: > > On Wed, 11 Jan 2017 07:00:52 +0100 (CET) > Marcel Schneider wrote: > > > The phenomenon isn?t actually limited to plain text environments. See: > > > > http://stackoverflow.com/questions/13878772/how-to-display-classic-fractions-in-css-javascript > > | You can also use the straight unicode approach to render ?????: > > | > > | ¹⁹⁄₄₅ > > | > > | (See the wikipedia article.) > > https://en.wikipedia.org/wiki/Unicode_subscripts_and_superscripts > > If you follow the link from that page to > https://en.wikipedia.org/wiki/Subscript_and_superscript , you will > notice an immediate issue with the position of the subscripts. This is > why the use of explicitly coded subscript and superscript digits for > vulgar fractions is not recommended. Rather, one needs to hope that > the font one is using supports U+2044 FRACTION SLASH. As not all fonts > support all superscript and subscript digits, text using them may > render badly, whereas U+2044 itself will usually be rendered at least > tolerably even if the glyph comes from a different font to the digits. > > The truly straight Unicode approach in HTML is to use 19⁄45. > Just entering those 5 characters into a text entry box in Firefox gave > me a properly formatted vulgar fraction. That is how vulgar fractions > are supposed to work. Unfortunately, one may need to avoid 'exciting > new fonts' in favour of those with a large, working repertoire. Thank you for these hints! Too bad not to have checked this. I?m glad to see that browsers and some fonts already support the standard way of writing custom fractions in plain text, with correct glyph substitution. I?ve added this info on the fly to the following two articles: https://en.wikipedia.org/wiki/Unicode_subscripts_and_superscripts#Uses https://en.wikipedia.org/wiki/Slash_(punctuation)#Fractions Hence, one part of the issue is solved. As of the main part ? the use of modifier letters as ordinal indicators, and eventually in (other) abbreviations ?, the user demand reflects in the article that you cited: For Galician, Italian, Portuguese and Spanish, the preformatted ordinal indicators are used, while for French, formatting is applied: https://en.wikipedia.org/wiki/Subscript_and_superscript#Unicode If this use of formatting were straightforward, it would be constant. In practice, by contrast, it turns out to be actually a fallback, by (supposed) lack of the appropriate preformatted letters. I note too, that the used fonts from the body font stack is outdated since it doesn?t render the fractions properly (whereas the monospaced font in the editing dialog does, as in the Unicode contact form that I?ve tried first). A frequent idea is then to use performatted digits to imitate proper rendering, as more as according to Wikipedia, this is so popular that current fonts have even repurposed the super/sub scripts. Arial Unicode MS would thus enter this category. As you point it out, the straightforward action is to use a font with full support of U+2044, and thus to override the default font with e.g. Cambria. It?s good to know about these issues, and this will help me a lot when writing up the documentation. Best regards, Marcel From kenwhistler at att.net Wed Jan 11 13:37:47 2017 From: kenwhistler at att.net (Ken Whistler) Date: Wed, 11 Jan 2017 11:37:47 -0800 Subject: On the upcoming LATIN LETTER SMALL CAPITAL Q In-Reply-To: References: Message-ID: <4ab7d0ef-435a-964b-b3d3-71b328320997@att.net> This is a character under ballot for Amendment 1 to the 5th edition. It isn't part of the repertoire planned for publication as part of Unicode 10.0 in June. So if you want to have any impact on the subhead used in the charts for A7AF, the correct mechanism now is to get a national body comment added in their vote on Amendment 1. Either that, or just put in tickler in your calendar for February, 201*8*, when the beta review for Unicode *11* will be starting, so you can then make a suggestion as part of the Unicode beta review period. Otherwise, these suggestions are just going to end up lost under the pile of the subsequent 13 months worth of email on unrelated topics. ;-) --Ken On 12/27/2016 8:44 PM, Yif?n W?ng wrote: > Now I start to wonder if the description would be "Letter for > phonetics and Japanese phonology" or "Letter for scholarly > transcription" etc. > > 2016-12-27 18:54 GMT+09:00 Denis Jacquerye : >> For what it?s worth, the small capital q was used as an IPA symbol for a >> while. It was used for the Arabic ?ayn as a ?consonne roule?e gutturale? in >> the 1898 IPA chart (previously noted 3 in the 1894 IPA charts and ? in some >> 1895 IPA charts and later charts) then as a ?consonne fricative bronchiale >> sonore? in the 1905 and 1908 IPA charts, and in the notes after the IPA >> chart in 1912. It was eventually replaced with the reversed glottal stop ?, >> for example in the 1932 IPA chart or later charts. > From fabiang at radgametools.com Wed Jan 11 15:56:26 2017 From: fabiang at radgametools.com (Fabian Giesen) Date: Wed, 11 Jan 2017 13:56:26 -0800 Subject: UAX #9 (Bidirectional algorithm) reference implementations In-Reply-To: <2b433250-fba1-7d7f-1785-694ef25f96bf@att.net> References: <8f858db3-1f9e-e969-5dad-6aa26f0a577e@radgametools.com> <2b433250-fba1-7d7f-1785-694ef25f96bf@att.net> Message-ID: <8617b5a8-579a-3437-de29-b5fd47e0ca3c@radgametools.com> On 12/9/2016 7:04 AM, Ken Whistler wrote: > About the bug you note in BidiReferenceC, I'll investigate. Any news on this? Thanks, -Fabian From richard.wordingham at ntlworld.com Wed Jan 11 20:56:19 2017 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Thu, 12 Jan 2017 02:56:19 +0000 Subject: Specification of Encoding of Plain Text In-Reply-To: References: <20170109222414.72f83204@JRWUBU2> <20170110204430.6e580f72@JRWUBU2> <7c945443-1d67-e4df-c475-b5ba3b5bc342@ix.netcom.com> <20170110225457.56e581ca@JRWUBU2> Message-ID: <20170112025619.6b0bc28d@JRWUBU2> On Tue, 10 Jan 2017 17:25:06 -0800 Asmus Freytag wrote: > On 1/10/2017 2:54 PM, Richard Wordingham wrote: > There are many different tacks that can be taken to make spoofing > more difficult. > > Among them, for critical identifiers: > 1) allow only a restricted repertoire > 2) disallow certain sequences > 3) use a registry and > 3a) define sets of labels that overlap (variant sets) > 3b) restrict actual labels to be in disjoint sets > (one label blocks all others in the same variant set) > > The ICANN work on creating label generation rules attempts to > implement these strategies (currently for 28 scripts in the Root Zone > of the DNS). The > work on the first half dozen scripts is basically completed. > > > The Unicode standard does define what > > short sequences of characters mean. The problem is that then, > > outside the Apple world, it seems to be left to Microsoft to decide > > what longer sequences they will allow. > > MS and Apple are not the only ones writing renderers. HarfBuzz OpenType rendering tries to follow MS. That includes dotted circles. However, it will challenge the MS lead when it is blatantly wrong. In particular, it has a policy of rendering canonically equivalent text the same, though that is a challenge when emulating USE. So far as I am aware, M17n is not in wide use. It's tolerant, but one's text won't go far if it relies on M17n. Text can travel with a graphite font, but that is limiting. Sooner or later, one will want most text to work with different fonts. I'm having trouble digging up hard facts about InDesign's rendering, so I don't know how willing it is to be different to Microsoft's. > > Perhaps ICANN will be the industry-wide definer. However, to stay > > with Indic rendering, one may have cases where CVC and CCV > > orthographic syllables have little to no visible difference. The > > Khmer writing system once made much greater use of CVC syllables. > > For reproducing older texts, one might be forced to encode phonetic > > CVC as though it were CCV. > The restriction on sequences appropriate as an anti-spoofing measure > are not appropriate on general encoded text! So ICANN will at best serve to indicate sequences that should be renderable. > The project I'm involved in tackles only transitive forms of > equivalence (whether visual or semantic). > Collisions based on these equivalences can be handled with label > generation rulesets defined per RFC 7940, which allow registration > policies that are automated. > The further "halo" of "merely" similar labels needs to be handled > with additional technology that can handle concepts like similarity > distance. 'Merely' similar CCV and CVC tend to differ when the vowel is above the consonant and the subscript consonant is spacing, e.g. because it rises to the hanging baseline. The difference, which is in vowel placement, is comparable to the variation within one person's handwriting. However, the difference in mean position seems to be statistically significant. The inequivalence issue starts to arise with spacing vowels, which is when one may find marks being applied to syllables rather than to individual glyphs. > From a Unicode perspective, there's a virtue in not over specifying > sequences, because you don't want to be caught having to re-encode > entire scripts should the conventions for the use of the elements > making up the script change in an orthography reform! This seems to run counter to Mark's idea of regexes defining scripts' words. > That does not mean that Unicode (at all times) endorses all > permutations of free-form sequences as equally valid. Just as well, as such freedom runs counter to the principle of avoiding inequivalent encodings of the same thing. Richard. From duerst at it.aoyama.ac.jp Wed Jan 11 21:24:29 2017 From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J._D=c3=bcrst?=) Date: Thu, 12 Jan 2017 12:24:29 +0900 Subject: Superscript and Subscript Characters in General Use In-Reply-To: <20170111083212.476f492e@JRWUBU2> References: <20170104173306.665a7a7059d7ee80bb4d670165c8327d.7a92080546.wbe@email03.godaddy.com> <2039959835.345.1483595818937.JavaMail.www@wwinf1k39> <104200337.5858.1483616029614.JavaMail.www@wwinf1k39> <538089927.246.1483681334517.JavaMail.www@wwinf1p15> <828252581.19365.1483729345941.JavaMail.www@wwinf1p15> <1112132817.50.1483847284449.JavaMail.www@wwinf1p17> <1650945202.4739.1483875782165.JavaMail.www@wwinf1p17> <652345901.17960.1483909510323.JavaMail.www@wwinf1p19> <1236918755.9879.1483965771868.JavaMail.www@wwinf1p19> <572373065.27540.1483997980333.JavaMail.www@wwinf1p25> <1469162895.239.1484114453124.JavaMail.www@wwinf1h11> <20170111083212.476f492e@JRWUBU2> Message-ID: <93666313-a432-c3b0-44dc-6edd43c19eda@it.aoyama.ac.jp> On 2017/01/11 17:32, Richard Wordingham wrote: > The truly straight Unicode approach in HTML is to use 19⁄45. > Just entering those 5 characters into a text entry box in Firefox gave > me a properly formatted vulgar fraction. That is how vulgar fractions > are supposed to work. Unfortunately, one may need to avoid 'exciting > new fonts' in favour of those with a large, working repertoire. Just for the record: The vulgar fraction display also happened in Thunderbird (on Windows). Firefox and Thunderbird use the same display engine. I have switched HTML display off, because I prefer to read all my mail in plain text, but it still worked. Regards, Martin. From 747.neutron at gmail.com Wed Jan 11 23:39:49 2017 From: 747.neutron at gmail.com (=?UTF-8?B?WWlmw6FuIFfDoW5n?=) Date: Thu, 12 Jan 2017 14:39:49 +0900 Subject: On the upcoming LATIN LETTER SMALL CAPITAL Q In-Reply-To: <4ab7d0ef-435a-964b-b3d3-71b328320997@att.net> References: <4ab7d0ef-435a-964b-b3d3-71b328320997@att.net> Message-ID: > This is a character under ballot for Amendment 1 to the 5th edition. It > isn't part of the repertoire planned for publication as part of Unicode 10.0 > in June. I see. Thank you for the information. I'll remember it until Unicode 11's term. 2017-01-12 4:37 GMT+09:00 Ken Whistler : > This is a character under ballot for Amendment 1 to the 5th edition. It > isn't part of the repertoire planned for publication as part of Unicode 10.0 > in June. > > So if you want to have any impact on the subhead used in the charts for > A7AF, the correct mechanism now is to get a national body comment added in > their vote on Amendment 1. > > Either that, or just put in tickler in your calendar for February, 201*8*, > when the beta review for Unicode *11* will be starting, so you can then make > a suggestion as part of the Unicode beta review period. > > Otherwise, these suggestions are just going to end up lost under the pile of > the subsequent 13 months worth of email on unrelated topics. ;-) > > --Ken > > > > On 12/27/2016 8:44 PM, Yif?n W?ng wrote: >> >> Now I start to wonder if the description would be "Letter for >> phonetics and Japanese phonology" or "Letter for scholarly >> transcription" etc. >> >> 2016-12-27 18:54 GMT+09:00 Denis Jacquerye : >>> >>> For what it?s worth, the small capital q was used as an IPA symbol for a >>> while. It was used for the Arabic ?ayn as a ?consonne roule?e gutturale? >>> in >>> the 1898 IPA chart (previously noted 3 in the 1894 IPA charts and ? in >>> some >>> 1895 IPA charts and later charts) then as a ?consonne fricative >>> bronchiale >>> sonore? in the 1905 and 1908 IPA charts, and in the notes after the IPA >>> chart in 1912. It was eventually replaced with the reversed glottal stop >>> ?, >>> for example in the 1932 IPA chart or later charts. >> >> > From khaledhosny at eglug.org Thu Jan 12 00:35:24 2017 From: khaledhosny at eglug.org (Khaled Hosny) Date: Thu, 12 Jan 2017 08:35:24 +0200 Subject: Superscript and Subscript Characters in General Use In-Reply-To: <93666313-a432-c3b0-44dc-6edd43c19eda@it.aoyama.ac.jp> References: <828252581.19365.1483729345941.JavaMail.www@wwinf1p15> <1112132817.50.1483847284449.JavaMail.www@wwinf1p17> <1650945202.4739.1483875782165.JavaMail.www@wwinf1p17> <652345901.17960.1483909510323.JavaMail.www@wwinf1p19> <1236918755.9879.1483965771868.JavaMail.www@wwinf1p19> <572373065.27540.1483997980333.JavaMail.www@wwinf1p25> <1469162895.239.1484114453124.JavaMail.www@wwinf1h11> <20170111083212.476f492e@JRWUBU2> <93666313-a432-c3b0-44dc-6edd43c19eda@it.aoyama.ac.jp> Message-ID: <20170112063524.GF14923@macbook> On Thu, Jan 12, 2017 at 12:24:29PM +0900, Martin J. D?rst wrote: > On 2017/01/11 17:32, Richard Wordingham wrote: > > > The truly straight Unicode approach in HTML is to use 19⁄45. > > Just entering those 5 characters into a text entry box in Firefox gave > > me a properly formatted vulgar fraction. That is how vulgar fractions > > are supposed to work. Unfortunately, one may need to avoid 'exciting > > new fonts' in favour of those with a large, working repertoire. > > Just for the record: The vulgar fraction display also happened in > Thunderbird (on Windows). Firefox and Thunderbird use the same display > engine. I have switched HTML display off, because I prefer to read all my > mail in plain text, but it still worked. This is done by HarfBuzz which automatically activates OpenType frac/dnom/numr features for sequences, so if the font has the features one gets vulgar fractions out of box. This works in Chrome as well since it uses HarfBuzz (older version of Chrome didn?t enable HarfBuzz by default for Latin so the fractions might not show there). Regards, Khaled From mark at macchiato.com Thu Jan 12 07:12:09 2017 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Thu, 12 Jan 2017 14:12:09 +0100 Subject: Specification of Encoding of Plain Text In-Reply-To: <20170110194013.0476f15f@JRWUBU2> References: <20170109222414.72f83204@JRWUBU2> <20170110194013.0476f15f@JRWUBU2> Message-ID: On Tue, Jan 10, 2017 at 8:40 PM, Richard Wordingham < richard.wordingham at ntlworld.com> wrote: > On Tue, 10 Jan 2017 10:11:41 +0100 > Mark Davis ?? wrote: > > > What I really wish we had would be a machine readable set of regexes > > for each complex script (and for each language-script combination > > that is different than the default for that script). > > What would the status of these regexes be? For example, the Khmer > script already has a regex for words sensu stricto, but there doesn't > seem to be any formal requirement to conform to it or, more > immediately usefully to users, attempt to support it if one claims to > support Khmer. > ??I think the goal would be provide guidance on the preferred ordering/choice of code points for representing a particular visual order of glyphs. That is, help to guide the usage of characters in complex scripts. The target wouldn't even be all scripts, but rather complex ones, where it may not be simple to determine the ordering of code points. And as Asmus said, the goal would be sufficiently "detailed to let you find out whether you are using characters as intended, or not" > I like the idea, but it seems to have a lot of nits, which I shall now > pick. > ?I'm sure there are plenty; those are just an opener. ? > > The regexes should also be human-comprehensible. > ? I agree that comprehension is a goal. I'd imagine using a BNF regex, like the following. This is simple, since I'm just doing Latin, but you can see what I mean. word = base* ; base = (latinLetter latinMn*) ; latinLetter = [[:scx=Latn:]&[:L:]] ; latinMn = [[:scx=Latn:][:scx=Common:]&[:Mn:]] ; which turns into the single regex expression: ([[:scx=Latn:]&[:L:]][[:scx=Latn:][:scx=Common:]&[:Mn:]]*)* See: http://unicode.org/cldr/utility/bnf.jsp?a=word=base*;%0Dbase=(latinLetter+latinMn*);%0DlatinLetter=[[:scx=Latn:]%26[:L:]];%0DlatinMn=[[:scx=Latn:][:scx=Common:]%26[:Mn:]] ; A more complex script might have: word = prefix base* postfix ; ... One could draw on the work done in Harfbuzz and the Universal Shaping Engine to push this along for different scripts. > I'm dubious of the concept of each language-script combination > potentially having a regex, ?I think a language-script combination is only useful if it must vary from the default for the script. > or indeed of the script having a *default* > regex. > Would this be used to do the equivalent of saying that English > doesn't have the letter thorn, or, for example, prohibiting most complex > onsets from Lao? > And for those scripts, the goal would be to represent the core functioning of the script. So it could be broader than what is needed for any particular language using that script. > > Such a regex R could be used for determining the well-formed ordering > > of code points within words. The regex need not be for syllables, or > > grapheme clusters, or any other formal construct. The *only* > > requirement it would need to fulfill is that you could determine > > well-formed words with: > > > word := (R)+? > > That is, if R were (C V C? | V C?) then any of CVC CVCVC VC V CV > > would pass the text, but CCV would fail. Ideally R would be as simple > > as possible (but no simpler). > > Several Indian languages only allow independent vowels word initially. > You wouldn't be able to capture that with (R)+. > ?That was a typo, should have been just R (which could have more complex internal structure with repetition, as above). > > Would the regexes be on strings or on traces (strings modulo canonical > equivalence)? The language recognised by the regex for the Universal > Shaping Engine (USE) is notoriously not closed under canonical > equivalence. > ?Unclear as yet to me what would be the most useful.? > Most non-spacing marks should not occur double - though I think the > most significant trouble with them is with fonts that won't then show > them double. Barring them could make for a tricky regex. But, if we > applied that to the Latin script, should we allow f?? (the Fourier > transform of the Fourier transform of f) as a word?. Tibetan allows > some non-spacing marks to occur triple. > There is always a choice as to how strict to make them. The goal shouldn't be so tight as to exclude legitimate words, and trying to be too fine-grained can make the expressions overly complicated. Moreover there isn't any question as to how "f?? (the Fourier transform of the Fourier transform of f)" would be spelled, so no need to exclude it. But preventing spoofing wouldn't be the goal. > Richard. > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From charupdate at orange.fr Thu Jan 12 08:22:18 2017 From: charupdate at orange.fr (Marcel Schneider) Date: Thu, 12 Jan 2017 15:22:18 +0100 (CET) Subject: Superscript and Subscript Characters in General Use In-Reply-To: <20170112063524.GF14923@macbook> References: <828252581.19365.1483729345941.JavaMail.www@wwinf1p15> <1112132817.50.1483847284449.JavaMail.www@wwinf1p17> <1650945202.4739.1483875782165.JavaMail.www@wwinf1p17> <652345901.17960.1483909510323.JavaMail.www@wwinf1p19> <1236918755.9879.1483965771868.JavaMail.www@wwinf1p19> <572373065.27540.1483997980333.JavaMail.www@wwinf1p25> <1469162895.239.1484114453124.JavaMail.www@wwinf1h11> <20170111083212.476f492e@JRWUBU2> <93666313-a432-c3b0-44dc-6edd43c19eda@it.aoyama.ac.jp> <20170112063524.GF14923@macbook> Message-ID: <1634058514.13007.1484230938837.JavaMail.www@wwinf1p25> On 12 Jan 2017 08:35:24 +0200, Khaled Hosny wrote: > > On Thu, Jan 12, 2017 at 12:24:29PM +0900, Martin J. D?rst wrote: > > On 2017/01/11 17:32, Richard Wordingham wrote: > > > > > The truly straight Unicode approach in HTML is to use 19⁄45. > > > Just entering those 5 characters into a text entry box in Firefox gave > > > me a properly formatted vulgar fraction. That is how vulgar fractions > > > are supposed to work. Unfortunately, one may need to avoid 'exciting > > > new fonts' in favour of those with a large, working repertoire. Even Times New Roman turned out to be obsolete from this viewpoint, while Cambria (and Consolas) do work. I should make a comprehensive overview on all fonts, perhaps in a dedicated article ?Fraction Slash? on Wikipedia (that seems to have existed). > > > > Just for the record: The vulgar fraction display also happened in > > Thunderbird (on Windows). It doesn?t work for me. > > Firefox and Thunderbird use the same display > > engine. I have switched HTML display off, because I prefer to read all my > > mail in plain text, This is one more reason to make plain text more performative and comprehensive. But when an e-mail is written in HTML, turning it to plain text usually doesn?t convert the HTML formatting to plain text markup. Usually, because just this is what HyperMail does, which builds the Unicode Mailing List Archives. At least, the start of superscript is converted to a ^. How do you deal with the loss of content information, such as stress, superscript, and so on? > > but it still worked. It seems to me that in this use case, it will be even more likely to work, given that the plain text font of Firefox and Chrome (admitting that this is used in the text boxes of all websites) is up-to-date, while most font-families used for HTML aren?t. Though I must find a way to update my system fonts. > > This is done by HarfBuzz which automatically activates OpenType > frac/dnom/numr features for sequences, > so if the font has the features one gets vulgar fractions out of box. According to Wikipedia ( https://en.wikipedia.org/wiki/HarfBuzz#Major_users ), HarfBuzz is included in LibreOffice too, but being on Windows, despite of having just installed the brandnew version 5.2.4.2, I still don?t get it, since it comes with 5.3: https://wiki.documentfoundation.org/ReleaseNotes/5.3#Text_Layout Thanks however! Waiting for this, I shall probably stay inputting fractions as superscript- subscript sequences, given that many fonts do have the appropriate glyphs for fractions mapped to the Unicode super/sub scripts, while application formatting for fractions (that I thought TUS is referring to) is available in desktop publishing software only, and the default super/sub formatting doesn?t match requirements for vulgar fractions. > This works in Chrome as well since it uses HarfBuzz (older version of > Chrome didn?t enable HarfBuzz by default for Latin so the fractions > might not show there). This raises a compatibility issue. Having tested my page on Chrome where I get the fractions right in some fonts like Cambria, I?m about to switch the default typeface from Tahoma to Cambria (or some other one, if I find another proportional font out there that does work as intended). But what will happen when somebody charges the page into another browser (Edge, Safari, Opera, IE)? I guess that the collateral damage (of being tagged as a careless and sloppy typographer) is minimized when I use a proven and stable feature like composing fractions following the 'U+00B9 U+2079 U+2044 U+2084 U+2085' pattern, compared with using the?really straightforward?'19 U+2044 45' pattern in its stead. Therefore, I?m interested in learning for what reasons the widespread and thorough implementation of a feature like the Unicode behavior of U+2044 FRACTION SLASH takes more than fifteen years ? if it will ever be thoroughly implemented! Regards, Marcel From khaledhosny at eglug.org Thu Jan 12 09:01:41 2017 From: khaledhosny at eglug.org (Khaled Hosny) Date: Thu, 12 Jan 2017 17:01:41 +0200 Subject: Superscript and Subscript Characters in General Use In-Reply-To: <1634058514.13007.1484230938837.JavaMail.www@wwinf1p25> References: <1650945202.4739.1483875782165.JavaMail.www@wwinf1p17> <652345901.17960.1483909510323.JavaMail.www@wwinf1p19> <1236918755.9879.1483965771868.JavaMail.www@wwinf1p19> <572373065.27540.1483997980333.JavaMail.www@wwinf1p25> <1469162895.239.1484114453124.JavaMail.www@wwinf1h11> <20170111083212.476f492e@JRWUBU2> <93666313-a432-c3b0-44dc-6edd43c19eda@it.aoyama.ac.jp> <20170112063524.GF14923@macbook> <1634058514.13007.1484230938837.JavaMail.www@wwinf1p25> Message-ID: <20170112150141.GG14923@macbook> On Thu, Jan 12, 2017 at 03:22:18PM +0100, Marcel Schneider wrote: > > This is done by HarfBuzz which automatically activates OpenType > > frac/dnom/numr features for sequences, > > so if the font has the features one gets vulgar fractions out of box. > > According to Wikipedia ( > https://en.wikipedia.org/wiki/HarfBuzz#Major_users > ), HarfBuzz is included in LibreOffice too, but being on Windows, despite of > having just installed the brandnew version 5.2.4.2, I still don?t get it, since > it comes with 5.3: > https://wiki.documentfoundation.org/ReleaseNotes/5.3#Text_Layout LibreOffice indeed did not use HarfBuzz on Windows before 5.3, which is not released yet. Regards, Khaled From charupdate at orange.fr Thu Jan 12 11:01:35 2017 From: charupdate at orange.fr (Marcel Schneider) Date: Thu, 12 Jan 2017 18:01:35 +0100 (CET) Subject: Superscript and Subscript Characters in General Use In-Reply-To: <20170112150141.GG14923@macbook> References: <1650945202.4739.1483875782165.JavaMail.www@wwinf1p17> <652345901.17960.1483909510323.JavaMail.www@wwinf1p19> <1236918755.9879.1483965771868.JavaMail.www@wwinf1p19> <572373065.27540.1483997980333.JavaMail.www@wwinf1p25> <1469162895.239.1484114453124.JavaMail.www@wwinf1h11> <20170111083212.476f492e@JRWUBU2> <93666313-a432-c3b0-44dc-6edd43c19eda@it.aoyama.ac.jp> <20170112063524.GF14923@macbook> <1634058514.13007.1484230938837.JavaMail.www@wwinf1p25> <20170112150141.GG14923@macbook> Message-ID: <1978028751.17843.1484240496199.JavaMail.www@wwinf1p25> On Thu, 12 Jan 2017 17:01:41 +0200, Khaled Hosny wrote: > > > According to Wikipedia ( > > https://en.wikipedia.org/wiki/HarfBuzz#Major_users > > ), HarfBuzz is included in LibreOffice too, but being on Windows, despite of > > having just installed the brandnew version 5.2.4.2, I still don?t get it, since > > it comes with 5.3: > > https://wiki.documentfoundation.org/ReleaseNotes/5.3#Text_Layout > > LibreOffice indeed did not use HarfBuzz on Windows before 5.3, which is > not released yet. > Thank you anyway! If I were on Linux, I?d got it all the time (my previous 4.2.4.2 > 4.1, when HarfBuzz was first included in LibreOffice). On Windows 7, I have DirectWrite, and this is probably why Arabic glyphs are substituted at my eye-sight, but I can?t get the fractions displayed the standard way around in Internet Explorer 11, neither in a text box, nor in a web page, even when using Gabriola, DirectWrite?s demo font. This is why, again, I cannot use the intended functioning of U+2044 FRACTION SLASH, given that when I make up a web page relying on this intended display feature, any visitors who will load it in any version of Internet Explorer on Windows 7, may consider that I?m doing bad typography. Hence again: Can any (good) reasons be identified for the following two shortcomings: 1) The implementation of U+2044, while punctually thorough, still isn?t widespread; 2) The use of non-Galician-Italian-Portuguese-Spanish ordinal indicators is prohibited while they are de facto available in Unicode. [1] Regards, Marcel [1] According to Wikipedia: https://en.wikipedia.org/wiki/Subscript_and_superscript#Alignment_examples https://en.wikipedia.org/wiki/Subscript_and_superscript#Desktop_publishing they must be even better than generic superscripting in word processors, that is considered too high and too light from a typographical point of view. From richard.wordingham at ntlworld.com Thu Jan 12 12:42:42 2017 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Thu, 12 Jan 2017 18:42:42 +0000 Subject: Specification of Encoding of Plain Text In-Reply-To: References: <20170109222414.72f83204@JRWUBU2> <20170110194013.0476f15f@JRWUBU2> Message-ID: <20170112184242.1507f3a8@JRWUBU2> On Thu, 12 Jan 2017 14:12:09 +0100 Mark Davis ?? wrote: > I agree that comprehension is a goal. I'd imagine using a BNF regex, > like the following. This is simple, since I'm just doing Latin, but > you can see what I mean. > word = base* ; > base = (latinLetter latinMn*) ; > latinLetter = [[:scx=Latn:]&[:L:]] ; > latinMn = [[:scx=Latn:][:scx=Common:]&[:Mn:]] ; > > which turns into the single regex expression: > > ([[:scx=Latn:]&[:L:]][[:scx=Latn:][:scx=Common:]&[:Mn:]]*)* Ouch! That's alarmingly wrong. You've excluded the likes of English 'Ca?esar' with ZWJ, Welsh 'Llan?gollen' with CGJ (the word doesn't contain the letter 'ng') and the ISO-sanctioned transliteration of Thai SO SUEA as 's?'. Fixin? it isn't easy. At least, I assume Arabic harakat don't attach to Latin letters in your conception of Latin script text, so replacing 'scx=Common' by 'sc=Inherited' doesn't work well. The problem may be conflicting requirements on the Script_Extensions property. Richard. From mark at macchiato.com Thu Jan 12 14:03:29 2017 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Thu, 12 Jan 2017 21:03:29 +0100 Subject: Specification of Encoding of Plain Text In-Reply-To: <20170112184242.1507f3a8@JRWUBU2> References: <20170109222414.72f83204@JRWUBU2> <20170110194013.0476f15f@JRWUBU2> <20170112184242.1507f3a8@JRWUBU2> Message-ID: That was just an example off the top of my head of the format for using with regex; I don't pretend that it is vetted. Latin is not a complex script, so it was only an illustration. However, it was just brain freeze on my part to not also include Inherited or ZWJ. A more serious effort would look at some of the issues from http://unicode.org/reports/tr29/, for example. On the other hand, CGJ is not a problem: it is Mn . And (say) U+064B ARABIC FATHATAN has scx=Arabic,Syriac, so wouldn't be included. Mark On Thu, Jan 12, 2017 at 7:42 PM, Richard Wordingham < richard.wordingham at ntlworld.com> wrote: > On Thu, 12 Jan 2017 14:12:09 +0100 > Mark Davis ?? wrote: > > > I agree that comprehension is a goal. I'd imagine using a BNF regex, > > like the following. This is simple, since I'm just doing Latin, but > > you can see what I mean. > > > word = base* ; > > base = (latinLetter latinMn*) ; > > latinLetter = [[:scx=Latn:]&[:L:]] ; > > latinMn = [[:scx=Latn:][:scx=Common:]&[:Mn:]] ; > > > > which turns into the single regex expression: > > > > ([[:scx=Latn:]&[:L:]][[:scx=Latn:][:scx=Common:]&[:Mn:]]*)* > > Ouch! That's alarmingly wrong. You've excluded the likes of > English 'Ca?esar' with ZWJ, Welsh 'Llan?gollen' with CGJ (the word > doesn't contain the letter 'ng') and the ISO-sanctioned transliteration > of Thai SO SUEA as 's?'. Fixin? it isn't easy. At least, I assume > Arabic harakat don't attach to Latin letters in your conception of > Latin script text, so replacing 'scx=Common' by 'sc=Inherited' doesn't > work well. > > The problem may be conflicting requirements on the Script_Extensions > property. > > Richard. > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From charupdate at orange.fr Thu Jan 12 15:04:12 2017 From: charupdate at orange.fr (Marcel Schneider) Date: Thu, 12 Jan 2017 22:04:12 +0100 (CET) Subject: Superscript and Subscript Characters in General Use In-Reply-To: <1978028751.17843.1484240496199.JavaMail.www@wwinf1p25> References: <1650945202.4739.1483875782165.JavaMail.www@wwinf1p17> <652345901.17960.1483909510323.JavaMail.www@wwinf1p19> <1236918755.9879.1483965771868.JavaMail.www@wwinf1p19> <572373065.27540.1483997980333.JavaMail.www@wwinf1p25> <1469162895.239.1484114453124.JavaMail.www@wwinf1h11> <20170111083212.476f492e@JRWUBU2> <93666313-a432-c3b0-44dc-6edd43c19eda@it.aoyama.ac.jp> <20170112063524.GF14923@macbook> <1634058514.13007.1484230938837.JavaMail.www@wwinf1p25> <20170112150141.GG14923@macbook> <1978028751.17843.1484240496199.JavaMail.www@wwinf1p25> Message-ID: <1651851807.24444.1484255052680.JavaMail.www@wwinf1p25> What typically happens with the correct use of fraction slash on a collaborative website like Wikipedia, is that the superscripts and subscripts are restored, I?ve just found while trying to share the section: https://en.wikipedia.org/w/index.php?title=Slash_(punctuation)&diff=prev&oldid=759542943 | | (??Fractions: Removed browser-specific information, restored hack that works on most browsers) | | [?] | | [?] (e.g., display of {{not a typo|11?12}} as 11?12),{{citation |title=The Unicode Standard, [?] Restored by somebody to: | | [?] (e.g., display of {{not a typo|11?12}} as {{not a typo|?????}}),{{citation |title=The Unicode Standard, [?] | Thus, OK for the ?hack.? Whether that hack is undisciplined or not, becomes now a better question. In my opinion, the lack of dicipline is rather found in editors of persistently non-conformant software. Though I wouldn?t bother them, if only Unicode could accept that the users who need to work with the software, need to work around it. ?Couldn?t Unicode follow Microsoft?? And follow their users, please. Consequently, one ought to remember what a keyboard layout really is: a facility to help people input the characters they need and use. Therefore, complete ones should support the input of fractions composed with super/sub scripts and U+2044, and as of Unicode, the Consortium should allow people to write fractions this way around if they cannot afford to write them in the standard way. Mentioning this in the relevant section of the Standard would avoid tagging these keyboard layout developers as hackers. (I?m not a hacker, nor am I a programmer.) Extrapolating from this to ordinal indicators, one could consider that all the reasons opposed so far are based only on the lack of updated fonts and on the will of the UTC. This is why I cannot consider them as good reasons without some additional arguments. ? Fonts: The *true* FRACTION SLASH U+2044 turns out to be even less common than the superscript small letters, and we can hope that when facing the real use, font-vendors will agree to update the typefaces. ? Formatting: This has ended up as inappropriate whenever no fine-tuning (CSS) can be performed, so that the superscript small letters are finally less bad, and even more appropriate in many circumstances. ? Unicode design principles: They are biased. Cf. the naming policy of the superscript small letters, declared as 'MODIFIER LETTER SMALL .', while all other instances show more straightforward identifiers and headings: @ Latin superscript modifier letters x (superscript latin small letter i - 2071) // (These conform to early standards) x (superscript latin small letter n - 207F) 02B0 MODIFIER LETTER SMALL H // (Should be: LATIN SUPERSCRIPT SMALL LETTER H) * aspiration # 0068 [?] @ Latin subscript modifier letters 1D62 LATIN SUBSCRIPT SMALL LETTER I # 0069 [?] @ Subscripts [?] 2090 LATIN SUBSCRIPT SMALL LETTER A # 0061 2091 LATIN SUBSCRIPT SMALL LETTER E # 0065 [?] Regards, Marcel From richard.wordingham at ntlworld.com Thu Jan 12 15:26:02 2017 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Thu, 12 Jan 2017 21:26:02 +0000 Subject: Specification of Encoding of Plain Text In-Reply-To: References: <20170109222414.72f83204@JRWUBU2> <20170110194013.0476f15f@JRWUBU2> <20170112184242.1507f3a8@JRWUBU2> Message-ID: <20170112212602.00511354@JRWUBU2> On Thu, 12 Jan 2017 21:03:29 +0100 Mark Davis ?? wrote: > That was just an example off the top of my head of the format for > using with regex; I don't pretend that it is vetted. Latin is not a > complex script, so it was only an illustration. However, it was just > brain freeze on my part to not also include Inherited or ZWJ. A more > serious effort would look at some of the issues from > http://unicode.org/reports/tr29/, for example. On the other hand, CGJ > is not a problem: it is Mn > . And (say) > U+064B ARABIC FATHATAN has scx=Arabic,Syriac, so wouldn't be included. Ah, I had not appreciated that sc=Inherited does not imply scx=Inherited. Using Script_Extensions to document the international combining characters that are used, for example, with Thai bases could have all sorts of undesirable knock-on effects. Richard. From richard.wordingham at ntlworld.com Fri Jan 13 03:02:32 2017 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Fri, 13 Jan 2017 09:02:32 +0000 Subject: Specification of Encoding of Plain Text In-Reply-To: References: <20170109222414.72f83204@JRWUBU2> <20170110194013.0476f15f@JRWUBU2> <20170112184242.1507f3a8@JRWUBU2> Message-ID: <20170113090232.536a0d12@JRWUBU2> On Thu, 12 Jan 2017 21:03:29 +0100 Mark Davis ?? wrote: > Latin is not a complex script,... Unlike the common script, which notably has U+2044 FRACTION SLASH. That statement is actually dubious from a typographical point of view. > ...so it was only an illustration. But it's good for looking for the non-obvious issues. > A more serious effort would look at some of the issues from > http://unicode.org/reports/tr29/, for example. I don't think we want to have to repeat them all for each script. Putting common-script punctuation and numbers in the regex will add obscurity, and possibly be a maintainability issue. Richard. From asmusf at ix.netcom.com Fri Jan 13 03:34:48 2017 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Fri, 13 Jan 2017 01:34:48 -0800 Subject: Specification of Encoding of Plain Text In-Reply-To: <20170113090232.536a0d12@JRWUBU2> References: <20170109222414.72f83204@JRWUBU2> <20170110194013.0476f15f@JRWUBU2> <20170112184242.1507f3a8@JRWUBU2> <20170113090232.536a0d12@JRWUBU2> Message-ID: I believe that any attempt to define a "regex" that describes *all legal text* in a given script is a-priori doomed to failure. Part of the problem is that writing systems work not unlike human grammars in a curious mixture of pretty firm rules coupled to lists of exceptions. (Many texts by competent authors will contain "ungrammatical" sentences that somehow work despite or because of not following the standard rules). The Khmer issue that started the discussion showed that there can a be a single word that needs to be handled exceptionally. If you try to capture all the exceptions in the general rules, the set of rules gets complicated, but is also likely to be way too permissive to be useful. The Khmer LGR for the Root Zone, for example, deliberately disallows the exception (in the word for "give") so that it can be stated (a) more compactly and (b) does not allow the exceptional sequencing of certain characters to become applicable outside the single exception. An LGR is concerned with *single* instances of each word. Even the most common word in a language can only be registered once in each zone. Therefore, such a drastic treatment is a perfectly good solution. For a rendering engine, you'd want to be much more permissive, perhaps even attempt to display patently "wrong" sequences. For a validation tool (spell checker) you would strike for some other sweet spot. Finally, to determine "first word" or "first syllable" for formatting purposes (such as "drop caps") there may yet be a different selection. As a result, I believe it would be most useful if a regex or BNF could be created for the "typical" / "idealized" description of a "word" in the various scripts. Then, depending on the facts in question, the BNF could be augmented with more or less formalized descriptions of variations, exceptions, etc. The idea would be to provide "building blocks" that can be used to assemble rules tailored to various scenarios by the reader of the standard. (Because of that, they should be part of the description section, not a data file...) Even if the BNFs did nothing more than capture succinctly the information presented in text and tables, they would be useful. For scripts where things like ZWJ and CGJ are optional, it doesn't make sense to run them into the standard BNF - that just messes things up. It is much more useful to provide generic context information of how to add them to existing text. For example, the CGJ is really intended to go between letters. So, describe that context. Overall, describing the local contexts for a given character or class of characters has proven to be more useful in the LGR project than attempting to write global rules. A./ On 1/13/2017 1:02 AM, Richard Wordingham wrote: > On Thu, 12 Jan 2017 21:03:29 +0100 > Mark Davis ?? wrote: > >> Latin is not a complex script,... > Unlike the common script, which notably has U+2044 FRACTION SLASH. > > That statement is actually dubious from a typographical point of view. > >> ...so it was only an illustration. > But it's good for looking for the non-obvious issues. > >> A more serious effort would look at some of the issues from >> http://unicode.org/reports/tr29/, for example. > I don't think we want to have to repeat them all for each script. > Putting common-script punctuation and numbers in the regex will add > obscurity, and possibly be a maintainability issue. > > Richard. > > From mark at macchiato.com Fri Jan 13 03:38:30 2017 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Fri, 13 Jan 2017 10:38:30 +0100 Subject: Specification of Encoding of Plain Text In-Reply-To: <20170112212602.00511354@JRWUBU2> References: <20170109222414.72f83204@JRWUBU2> <20170110194013.0476f15f@JRWUBU2> <20170112184242.1507f3a8@JRWUBU2> <20170112212602.00511354@JRWUBU2> Message-ID: If you know of combining marks whose scx values should include Thai, please let us know. Also, by "Latin is not a complex script" I mean it in the narrow sense I stated, where the goal is the ordering of characters. That is, nobody would normally wonder whether 0.5 when expressed by a sequence with U+2044 FRACTION SLASH should be written as the sequence <2, U+2044 FRACTION SLASH, 1>! There will always be some edge cases, but the target is Tibetan or Myanmar, not Latin or Cyrillic. Mark On Thu, Jan 12, 2017 at 10:26 PM, Richard Wordingham < richard.wordingham at ntlworld.com> wrote: > On Thu, 12 Jan 2017 21:03:29 +0100 > Mark Davis ?? wrote: > > > That was just an example off the top of my head of the format for > > using with regex; I don't pretend that it is vetted. Latin is not a > > complex script, so it was only an illustration. However, it was just > > brain freeze on my part to not also include Inherited or ZWJ. A more > > serious effort would look at some of the issues from > > http://unicode.org/reports/tr29/, for example. On the other hand, CGJ > > is not a problem: it is Mn > > . And (say) > > U+064B ARABIC FATHATAN has scx=Arabic,Syriac, so wouldn't be included. > > Ah, I had not appreciated that sc=Inherited does not imply > scx=Inherited. Using Script_Extensions to document the international > combining characters that are used, for example, with Thai bases could > have all sorts of undesirable knock-on effects. > > Richard. > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Fri Jan 13 11:47:24 2017 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Fri, 13 Jan 2017 17:47:24 +0000 Subject: Specification of Encoding of Plain Text In-Reply-To: References: <20170109222414.72f83204@JRWUBU2> <20170110194013.0476f15f@JRWUBU2> <20170112184242.1507f3a8@JRWUBU2> <20170113090232.536a0d12@JRWUBU2> Message-ID: <20170113174724.4839b668@JRWUBU2> On Fri, 13 Jan 2017 01:34:48 -0800 Asmus Freytag wrote: > I believe that any attempt to define a "regex" that describes *all > legal text* in a given script is a-priori doomed to failure. > > Part of the problem is that writing systems work not unlike human > grammars in a curious mixture of pretty firm rules coupled to lists > of exceptions. (Many texts by competent authors will contain > "ungrammatical" sentences that somehow work despite or because of not > following the standard rules). The Khmer issue that started the > discussion showed that there can a be a single word that needs to be > handled exceptionally. It's a single word in the *current* orthography for the Khmer language in Cambodia. According to Michel Antelme, on pp20-1 of "Inventaire provisoire des caract?res et divers signes des ?critures khm?res pr?-modernes et modernes employ?s pour la notation du khmer, du siamois, des dialectes tha?s m?ridionaux, du sanskrit et du p?li" (http://aefek.free.fr/iso_album/antelme_bis.pdf), this manner of writing was much commoner until it was largely eliminated by a spelling reform in the first half of the 20th century. The Thai Wikipedia page on the use of the script for Thai (https://th.wikipedia.org/wiki/???????????) gives examples for final consonants with COENG VO (???? = ????), COENG NO (???? = ????) and COENG NGO (????? = ????). > If you try to capture all the exceptions in the general rules, the > set of rules gets complicated, but is also likely to be way too > permissive to be useful. If it is checking for proper use of code points, overgeneration is far preferable to undergeneration. > The Khmer LGR for the Root Zone, for example, deliberately disallows > the exception (in the word for "give") so that it can be stated (a) > more compactly and (b) does not allow the exceptional sequencing of > certain characters to become applicable outside the single exception. > > An LGR is concerned with *single* instances of each word. Even the > most common word in a language can only be registered once in each > zone. A label does not have to be a single word. For example, there are several, if not many, domain names matching give*.com, where the first element is clearly the word 'give'. > Even if the BNFs did nothing more than capture succinctly the > information presented in text and tables, they would be useful. > For scripts where things like ZWJ and CGJ are optional, it doesn't > make sense to run them into the standard BNF - that just messes > things up. It is much more useful to provide generic context > information of how to add them to existing text. > For example, the CGJ is really intended to go between letters. So, > describe that context. It can be quite useful next to combining marks. For example, it may be used to distinguish a diaeresis from an umlaut mark in Fraktur. Richard. From richard.wordingham at ntlworld.com Fri Jan 13 12:19:21 2017 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Fri, 13 Jan 2017 18:19:21 +0000 Subject: Specification of Encoding of Plain Text In-Reply-To: References: <20170109222414.72f83204@JRWUBU2> <20170110194013.0476f15f@JRWUBU2> <20170112184242.1507f3a8@JRWUBU2> <20170112212602.00511354@JRWUBU2> Message-ID: <20170113181921.16967374@JRWUBU2> On Fri, 13 Jan 2017 10:38:30 +0100 Mark Davis ?? wrote: > On Thu, Jan 12, 2017 at 10:26 PM, Richard Wordingham < > richard.wordingham at ntlworld.com> wrote: > > Using Script_Extensions to document the international > > combining characters that are used, for example, with Thai bases > > could have all sorts of undesirable knock-on effects. > If you know of combining marks whose scx values should include Thai, > please let us know. If you refer to the end of TUS 9.0 Section 16.1 you will find mention of U+0331 COMBINING MACRON BELOW and U+0303 COMBINING TILDE, which are thus candidates for scx ? Latn. One might also consider U+0359 COMBINING ASTERISK BELOW; I have seen the combination ?? used in a phonetic symbol for English, representing [?]. As their scx values are 'Inherited', should their values not be treated as though they already included Thai? I suppose, though, that they do not in fact match "p(scx=Thai)". There does seem to be a view that scx=inherited is shorthand for some list of European scripts. Richard. From asmusf at ix.netcom.com Fri Jan 13 12:27:35 2017 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Fri, 13 Jan 2017 10:27:35 -0800 Subject: Specification of Encoding of Plain Text In-Reply-To: <20170113174724.4839b668@JRWUBU2> References: <20170109222414.72f83204@JRWUBU2> <20170110194013.0476f15f@JRWUBU2> <20170112184242.1507f3a8@JRWUBU2> <20170113090232.536a0d12@JRWUBU2> <20170113174724.4839b668@JRWUBU2> Message-ID: On 1/13/2017 9:47 AM, Richard Wordingham wrote: > On Fri, 13 Jan 2017 01:34:48 -0800 > Asmus Freytag wrote: > >> I believe that any attempt to define a "regex" that describes *all >> legal text* in a given script is a-priori doomed to failure. >> >> Part of the problem is that writing systems work not unlike human >> grammars in a curious mixture of pretty firm rules coupled to lists >> of exceptions. (Many texts by competent authors will contain >> "ungrammatical" sentences that somehow work despite or because of not >> following the standard rules). The Khmer issue that started the >> discussion showed that there can a be a single word that needs to be >> handled exceptionally. > It's a single word in the *current* orthography for the Khmer language > in Cambodia. According to Michel Antelme, on pp20-1 of "Inventaire > provisoire des caract?res et divers signes des ?critures khm?res > pr?-modernes et modernes employ?s pour la notation du khmer, du > siamois, des dialectes tha?s m?ridionaux, du sanskrit et du p?li" > (http://aefek.free.fr/iso_album/antelme_bis.pdf), this manner > of writing was much commoner until it was largely eliminated by a > spelling reform in the first half of the 20th century. This points to another interesting issue. A number of languages have seen orthographic reforms that affect the use of complex scripts. Now then, a decision: do you support both the old and the new style in the same rule-set? If vestiges remain in general use, you may not have a choice, but, what if the rules for old and new (or for different languages in the same script) actually conflict? > The Thai > Wikipedia page on the use of the script for Thai > (https://th.wikipedia.org/wiki/???????????) gives examples for final > consonants with COENG VO (???? = ????), COENG NO (???? = ????) and > COENG NGO (????? = ????). In the case that I cited, that combination of language/script was taken as out of scope for other reasons; now, for general text, are there situations where you'd want separate sets of rules for each language? > >> If you try to capture all the exceptions in the general rules, the >> set of rules gets complicated, but is also likely to be way too >> permissive to be useful. > If it is checking for proper use of code points, overgeneration is far > preferable to undergeneration. Agreed. For modeling general text you don't want to actually exclude anything that can occur. However, what can you exclude? If you think of spell-checking as a scenario, overgeneration is not acceptable. Instead, you have a standard dictionary that deals with "general vocabulary" and there's a well defined mechanism to allow the user to add "exceptions". My point is that you cannot design a ruleset without having a very well-defined use-case. If you divide the rule sets into "building blocks" then it may be easier to address different use cases than if you simply provide a "maximally permissive" set of rules. I'm skeptical that a one size fits all sets of rules can be devised and be useful. For rules that strongly err on the side of overgeneration, it might make more sense to simply define the few contexts that are deemed impermissible and set the rest to "anything goes". > >> The Khmer LGR for the Root Zone, for example, deliberately disallows >> the exception (in the word for "give") so that it can be stated (a) >> more compactly and (b) does not allow the exceptional sequencing of >> certain characters to become applicable outside the single exception. >> >> An LGR is concerned with *single* instances of each word. Even the >> most common word in a language can only be registered once in each >> zone. > A label does not have to be a single word. For example, there are > several, if not many, domain names matching give*.com, where the first > element is clearly the word 'give'. Correct, but each compound can still occur only once. I cite this example only because the local body that drafted the rules decided that there was a reasonable tradeoff (complexity vs. generality) for the purpose of top level domain names (i.e. ".give*" not "give*.com"). For that application, complexity has a relatively high negative weight associated with it, and complete coverage, while desirable, is not given the same high positive weight that it would have in describing ordinary text. >> Even if the BNFs did nothing more than capture succinctly the >> information presented in text and tables, they would be useful. >> For scripts where things like ZWJ and CGJ are optional, it doesn't >> make sense to run them into the standard BNF - that just messes >> things up. It is much more useful to provide generic context >> information of how to add them to existing text. >> For example, the CGJ is really intended to go between letters. So, >> describe that context. (Forgot to make clear that this was a bit of a hypothetical) > It can be quite useful next to combining marks. For example, it may be > used to distinguish a diaeresis from an umlaut mark in Fraktur. Even if it is intended to go anywhere, even between digits, symbols and punctuation, it's much easier to describe that behavior separately rather than trying to insert it in every location in every regex. What I'm thinking is a description that gives a "skeleton word" and then you state, that this skeleton can be decorated (or whatever your preferred term) by inserting a CGJ anywhere. The same goes for ZWJ /ZWNJ for any script where they don't have a recognized specific effect in particular sequences. From charupdate at orange.fr Fri Jan 13 18:19:30 2017 From: charupdate at orange.fr (Marcel Schneider) Date: Sat, 14 Jan 2017 01:19:30 +0100 (CET) Subject: Superscript and Subscript Characters in General Use In-Reply-To: <20170111083212.476f492e@JRWUBU2> References: <20170104173306.665a7a7059d7ee80bb4d670165c8327d.7a92080546.wbe@email03.godaddy.com> <2039959835.345.1483595818937.JavaMail.www@wwinf1k39> <104200337.5858.1483616029614.JavaMail.www@wwinf1k39> <538089927.246.1483681334517.JavaMail.www@wwinf1p15> <828252581.19365.1483729345941.JavaMail.www@wwinf1p15> <1112132817.50.1483847284449.JavaMail.www@wwinf1p17> <1650945202.4739.1483875782165.JavaMail.www@wwinf1p17> <652345901.17960.1483909510323.JavaMail.www@wwinf1p19> <1236918755.9879.1483965771868.JavaMail.www@wwinf1p19> <572373065.27540.1483997980333.JavaMail.www@wwinf1p25> <1469162895.239.1484114453124.JavaMail.www@wwinf1h11> <20170111083212.476f492e@JRWUBU2> Message-ID: <2133262187.20678.1484353170497.JavaMail.www@wwinf1p26> On Wed, 11 Jan 2017 08:32:12 +0000, Richard Wordingham wrote: [?] > The truly straight Unicode approach in HTML is to use 19⁄45. > Just entering those 5 characters into a text entry box in Firefox gave > me a properly formatted vulgar fraction. That is how vulgar fractions > are supposed to work. Unfortunately, one may need to avoid 'exciting > new fonts' in favour of those with a large, working repertoire. A new ?Fraction Slash and Fonts? thread in the B?PO community has brought up that this works mainly with new and ambitious fonts: ? Carlito ? Fira Sans ? Linux Biolinum ? Linux Libertine ? Roboto ? Source Sans Pro ? Source Serif Pro ? Ubuntu By contrast, the typefaces not supporting U+2044 correctly include: - FreeSans - FreeSerif - Open Sans - Dej?Vu - Droid - Liberation - TeX Gyre BTW, the Times New Roman font that the Mailing List Archives specify, belongs to this latter category, so that the fractions with U+2044 and normal size digits display in fallback mode. Software support is mainly found in open projects as we have seen: ? HarfBuzz, and its users: ? LibreOffice ? Firefox ? Chrome In the meantime, Microsoft products not supporting U+2044 correctly include: - DirectWrite - Internet Explorer including its last version 11 ? Does anybody know why Microsoft is reluctant in supporting U+2044? ? And why on the other hand, the widespread and popular way of writing fractions ???as U+2044 sequences is discouraged and even ridiculized? Regards, Marcel From charupdate at orange.fr Fri Jan 13 19:18:01 2017 From: charupdate at orange.fr (Marcel Schneider) Date: Sat, 14 Jan 2017 02:18:01 +0100 (CET) Subject: Superscript and Subscript Characters in General Use In-Reply-To: <20170112150141.GG14923@macbook> References: <1650945202.4739.1483875782165.JavaMail.www@wwinf1p17> <652345901.17960.1483909510323.JavaMail.www@wwinf1p19> <1236918755.9879.1483965771868.JavaMail.www@wwinf1p19> <572373065.27540.1483997980333.JavaMail.www@wwinf1p25> <1469162895.239.1484114453124.JavaMail.www@wwinf1h11> <20170111083212.476f492e@JRWUBU2> <93666313-a432-c3b0-44dc-6edd43c19eda@it.aoyama.ac.jp> <20170112063524.GF14923@macbook> <1634058514.13007.1484230938837.JavaMail.www@wwinf1p25> <20170112150141.GG14923@macbook> Message-ID: <850127614.20750.1484356681023.JavaMail.www@wwinf1p26> On Thu, 12 Jan 2017 17:01:41 +0200, Khaled Hosny wrote: > > LibreOffice indeed did not use HarfBuzz on Windows before 5.3, which is > not released yet. Is the integration of HarfBuzz limited to free software? And what might be the reason of the delayed integration of HarfBuzz in the Windows version of LibreOffice? While the lastly cited ?Fraction slash U+2044 and fonts? thread on the B?PO community (Ergodis association) mailing list reviewed mainly free fonts, I find on my netbook on Windows 7 the following fonts that work correctly: ? Calibri ? Calibri Light ? Cambria ? Cambria Math ? Candara ? Consolas ? Constantia ? Corbel ? Gabriola ? Palatino Linotype besides: ? Source Code Pro ? Source Sans Pro ? Is there any evidence about on-going efforts to update Times New Roman? I believe that an outdated typeface, as is actually Times New Roman, is inappropriate for the Unicode Mail Archives. Regards, Marcel From richard.wordingham at ntlworld.com Fri Jan 13 19:18:09 2017 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Sat, 14 Jan 2017 01:18:09 +0000 Subject: Specification of Encoding of Plain Text In-Reply-To: References: <20170109222414.72f83204@JRWUBU2> <20170110194013.0476f15f@JRWUBU2> <20170112184242.1507f3a8@JRWUBU2> <20170113090232.536a0d12@JRWUBU2> <20170113174724.4839b668@JRWUBU2> Message-ID: <20170114011809.78042ba4@JRWUBU2> On Fri, 13 Jan 2017 10:27:35 -0800 Asmus Freytag wrote: > This points to another interesting issue. A number of languages have > seen orthographic reforms that affect the use of complex scripts. > Now then, a decision: do you support both the old and the new style > in the same rule-set? If vestiges remain in general use, you may not > have a choice, but, what if the rules for old and new (or for > different languages in the same script) actually conflict? What we have seen in Khmer is a change that almost prohibits CVC orthographic clusters. (I don't count nikahits, visargas or fragments of vowels as C.) However, that is a rule of the language; it does not need to be a rule of the script. I am not sure that the old and new rules should conflict. We are presumably taking about a change made before the script was soundly encoded; it seems unreasonable that renderers should suddenly refuse to handle text that was previously valid. Now, I can think of a potential problem with Northern Thai ??????? 'all'. It is a single, chained orthographic syllable. This appears to be contrary to Tai Kh?n grammar, and is not clear to me how a modern Tai Kh?n font should render it. (It's also contrary to USE, but so is most of the language.) The problem is that U+1A58 is a final, spacing mark in Tai Kh?n, while further east it is a repha-like mark - it corresponds to kinzi in Burmese. The solution I anticipate is that it must be rendered as a non-spacing mark even in Tai Kh?n when it cannot be interpreted as a spacing mark. Has anyone handled this issue? My intended solution will allow a common sequence of code points for both the old style (U+1A58 as kinzi), the intermediate Northern Thai styles, and the new style (U+1A58 as a final consonant). > In the case that I cited, that combination of language/script was > taken as out of scope for other reasons; now, for general text, are > there situations where you'd want separate sets of rules for each > language? For determining which language a text might belong to, different rules would be appropriate. However, for deciding whether to render text, that seems ridiculous. Converting renderable multilingual text to plain text would make it unrenderable, which is surely undesirable. Having said that, there do appear to be potential problems in the Lanna script arising from interactions of spelling and layout style. In some styles, the consonant (and vowel) stack turns right at a certain depth, and therefore can reasonably contain more items that a strictly vertical stack. As both styles appear in material published in Chiang Mai, I'd be loath to declare different validity rules. I'd rather treat any problems as the surfacing of a renderer limitation. Richard. From mark at macchiato.com Sat Jan 14 04:56:10 2017 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Sat, 14 Jan 2017 11:56:10 +0100 Subject: Specification of Encoding of Plain Text In-Reply-To: <20170113181921.16967374@JRWUBU2> References: <20170109222414.72f83204@JRWUBU2> <20170110194013.0476f15f@JRWUBU2> <20170112184242.1507f3a8@JRWUBU2> <20170112212602.00511354@JRWUBU2> <20170113181921.16967374@JRWUBU2> Message-ID: Mark On Fri, Jan 13, 2017 at 7:19 PM, Richard Wordingham < richard.wordingham at ntlworld.com> wrote: > On Fri, 13 Jan 2017 10:38:30 +0100 > Mark Davis ?? wrote: > > > On Thu, Jan 12, 2017 at 10:26 PM, Richard Wordingham < > > richard.wordingham at ntlworld.com> wrote: > > > > Using Script_Extensions to document the international > > > combining characters that are used, for example, with Thai bases > > > could have all sorts of undesirable knock-on effects. > > > If you know of combining marks whose scx values should include Thai, > > please let us know. > > If you refer to the end of TUS 9.0 Section 16.1 you will find mention > of U+0331 COMBINING MACRON BELOW and U+0303 COMBINING TILDE, which are > thus candidates for scx ? Latn. One might also consider U+0359 > COMBINING ASTERISK BELOW; I have seen the combination ?? CHARACTER CHO CHANG, U+0359> used in a phonetic symbol for English, > representing [?]. > > As their scx values are 'Inherited', should their values not be treated > as though they already included Thai? I suppose, though, that they > do not in fact match "p(scx=Thai)". There does seem to be a view that > scx=inherited is shorthand for some list of European scripts. > ?The distinction between sc=inherited and sc=common is an unfortunate one, a remnant from when we first added the script data. The distinction for a character C is purely derivable from whether gc(C) ? [[:mn:][:me:]] or not, so it is of little value ? and with the advantage of hindsight, mostly just gets in the way. scx=inherited is *not* a shorthand for some list of European scripts. Rather, C ? [ [: scx=inherited:] ?[: scx=inherited:] ?]? means that either 1. we don't have enough information about usage to be able to list the scripts that C is used with, or 2. C can be used with so many scripts that it is not particularly productive to list them all. > Richard. > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From charupdate at orange.fr Sun Jan 15 10:46:13 2017 From: charupdate at orange.fr (Marcel Schneider) Date: Sun, 15 Jan 2017 17:46:13 +0100 (CET) Subject: Superscript and Subscript Characters in General Use In-Reply-To: <850127614.20750.1484356681023.JavaMail.www@wwinf1p26> References: <1650945202.4739.1483875782165.JavaMail.www@wwinf1p17> <652345901.17960.1483909510323.JavaMail.www@wwinf1p19> <1236918755.9879.1483965771868.JavaMail.www@wwinf1p19> <572373065.27540.1483997980333.JavaMail.www@wwinf1p25> <1469162895.239.1484114453124.JavaMail.www@wwinf1h11> <20170111083212.476f492e@JRWUBU2> <93666313-a432-c3b0-44dc-6edd43c19eda@it.aoyama.ac.jp> <20170112063524.GF14923@macbook> <1634058514.13007.1484230938837.JavaMail.www@wwinf1p25> <20170112150141.GG14923@macbook> <850127614.20750.1484356681023.JavaMail.www@wwinf1p26> Message-ID: <295090107.9846.1484498773837.JavaMail.www@wwinf1p26> On Sat, 14 Jan 2017 02:18:01 +0100 (CET), I wrote: > > I believe that an outdated typeface, as is actually Times New Roman, is > inappropriate for the Unicode Mail Archives. I?ve been kindly informed off-list that the Archives don?t specify any font, and are viewed in the default font customizable in the browser preferences. I apologize for this statement of mine. Regards, Marcel From asmusf at ix.netcom.com Sun Jan 15 12:15:33 2017 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Sun, 15 Jan 2017 10:15:33 -0800 Subject: Superscript and Subscript Characters in General Use In-Reply-To: <295090107.9846.1484498773837.JavaMail.www@wwinf1p26> References: <1650945202.4739.1483875782165.JavaMail.www@wwinf1p17> <652345901.17960.1483909510323.JavaMail.www@wwinf1p19> <1236918755.9879.1483965771868.JavaMail.www@wwinf1p19> <572373065.27540.1483997980333.JavaMail.www@wwinf1p25> <1469162895.239.1484114453124.JavaMail.www@wwinf1h11> <20170111083212.476f492e@JRWUBU2> <93666313-a432-c3b0-44dc-6edd43c19eda@it.aoyama.ac.jp> <20170112063524.GF14923@macbook> <1634058514.13007.1484230938837.JavaMail.www@wwinf1p25> <20170112150141.GG14923@macbook> <850127614.20750.1484356681023.JavaMail.www@wwinf1p26> <295090107.9846.1484498773837.JavaMail.www@wwinf1p26> Message-ID: <6535a19d-f880-6e0a-0b18-167d1ea6890f@ix.netcom.com> An HTML attachment was scrubbed... URL: From christoph.paeper at crissov.de Mon Jan 16 19:02:02 2017 From: christoph.paeper at crissov.de (=?utf-8?Q?Christoph_P=C3=A4per?=) Date: Tue, 17 Jan 2017 02:02:02 +0100 Subject: Superscript and Subscript Characters in General Use In-Reply-To: <1651851807.24444.1484255052680.JavaMail.www@wwinf1p25> References: <1650945202.4739.1483875782165.JavaMail.www@wwinf1p17> <652345901.17960.1483909510323.JavaMail.www@wwinf1p19> <1236918755.9879.1483965771868.JavaMail.www@wwinf1p19> <572373065.27540.1483997980333.JavaMail.www@wwinf1p25> <1469162895.239.1484114453124.JavaMail.www@wwinf1h11> <20170111083212.476f492e@JRWUBU2> <93666313-a432-c3b0-44dc-6edd43c19eda@it.aoyama.ac.jp> <20170112063524.GF14923@macbook> <1634058514.13007.1484230938837.JavaMail.www@wwinf1p25> <20170112150141.GG14923@macbook> <1978028751.17843.1484240496199.JavaMail.www@wwinf1p25> <1651851807.24444.1484255052680.JavaMail.www@wwinf1p25> Message-ID: Marcel Schneider : > > What typically happens with the correct use of fraction slash on a collaborative > website like Wikipedia, is that the superscripts and subscripts are restored, JFTR, has been using the fraction slash for many years, but (still) pairs it with HTML/CSS super- and subscripts. From charupdate at orange.fr Tue Jan 17 02:25:46 2017 From: charupdate at orange.fr (Marcel Schneider) Date: Tue, 17 Jan 2017 09:25:46 +0100 (CET) Subject: Superscript and Subscript Characters in General Use In-Reply-To: <6535a19d-f880-6e0a-0b18-167d1ea6890f@ix.netcom.com> References: <1650945202.4739.1483875782165.JavaMail.www@wwinf1p17> <652345901.17960.1483909510323.JavaMail.www@wwinf1p19> <1236918755.9879.1483965771868.JavaMail.www@wwinf1p19> <572373065.27540.1483997980333.JavaMail.www@wwinf1p25> <1469162895.239.1484114453124.JavaMail.www@wwinf1h11> <20170111083212.476f492e@JRWUBU2> <93666313-a432-c3b0-44dc-6edd43c19eda@it.aoyama.ac.jp> <20170112063524.GF14923@macbook> <1634058514.13007.1484230938837.JavaMail.www@wwinf1p25> <20170112150141.GG14923@macbook> <850127614.20750.1484356681023.JavaMail.www@wwinf1p26> <295090107.9846.1484498773837.JavaMail.www@wwinf1p26> <6535a19d-f880-6e0a-0b18-167d1ea6890f@ix.netcom.com> Message-ID: <2048160763.2450.1484641546911.JavaMail.www@wwinf1p10> I?m aware that this thread is getting lengthy and (supposedly) tiresome. Therefore, I wouldn?t have sent this to the List today. I really wanted to make a break and come back later. However, with respect to the consequences of the result of this issue for millions of end-users, and the imminence of the French keyboard standardization these months, I acknowledge to be given the opportunity to keep discussing on-list. ***Disclaimer*** I?m not a part of the French keyboard standard WG, and I?m talking on my own behalf, in civic responsiveness. On Sun, 15 Jan 2017 10:15:33 -0800, Asmus Freytag wrote: > [Quoted mail] > > Contrary to your assertion about fonts elsewhere, the poor rendering of > subscripts/superscripts that I reported to you is based on the fact that > the characters are missing, but that the glyphs are not laid out as > running text. To date, as far as I know, the only domain where superscripts and subscripts are mandatory in general text are abbreviations of numerals, titles, entities, measurement units, chemical compounds and so on, using Western Arabic digits and Latin superscript lowercase. I?m quite sure that no other scripts do have this typographical convention, that is a part of an old discipline called ?orthotypography.? While I was wrong mixing it up with orthography, the outstanding importance of these rules for unambiguous representation of text calls for special treatment in practice and in the Unicode Standard. In these ranges, one character is still missing because the UTC has refused to encode *LATIN SUPERSCRIPT SMALL LETTER Q, aka *MODIFIER LETTER SMALL Q. This has little incidence on general practice. The main challenge outside Unicode is the availability of the related glyphs in current fonts, as well as their consistency. To date, almost all webmails propose only fonts where they are designed in an intentionally inconsistent way, supposedly to make them unusable for accurate display: The '?' is always far too high, and the '?' is too bold and with random vertical alignment. In my opinion, the legacy status of these two is used as a fake explanation; compare with the inconsistent design of '?' and '?' in some fonts, along with that of '?', while there is no excuse of ?legacy,? unlike for '?', '?' and '?', where ?legacy? is equally abused to mess up the typefaces. This applies as well to most other fonts. The only correct font-family I?ve found so far is Calibri. Consistently, this is the body font in the default template of Microsoft Word. > > When viewing with monospaced fonts, the separation between glyphs > corresponds to the spacing of the full-size characters. When using > formatting (styling) the superscripted text is in a smaller font size, > reducing the spacing between characters, so that strings of them look > like ordinary text again and not s p a c e d o u t. I?m facing this issue when writing drafts in my text editor, where however I?m able to set the font to any value, including Calibri. Displaying this in Calibri allows to appreciate the consistent and running-text-like display of the superscripts: // This is ???????????????????????????????????????????????????????????????????????????????????? // This is the range: ???????????????? ^q_unavailable????????????????????????????? This is how a complete and Unicode conformant typeface is supposed to work. In practice, this turns out to be implemented far, far more than U+2044. > > I'm not going to spend much more time on this discussion. When I launched this discussion on December 28, 2016, I naively believed that this time, the matter would be quickly settled, and I could go on being more productive on developing the keyboard layouts and documenting them. Now that this thread still hasn?t come to an even halfway useful result, I need to make one more attempt. The goal is to get Unicode accept the fact that people use superscript letters in French, and super/sub scripts in vulgar fractions, and have them on their keyboards, and that these people are not considered as hackers, but as making a reasonable, thoughtful and responsive use of the Standard. That is not a matter of ?value inversion,? but of correcting a particular design principle that was misled and biased under a (hypothetically) strong influence of *extrinsic* factors from the beginning on (see point 3 below). It?s good to know about the counter-arguments that may be figured out, so I?m grateful to all who were so kind to respond. What bothers me, is that there is still so much persistent opposition; and what makes me fear the worse, is that the arguments raised against the general use of preformatted characters are so biased and fallacious, unlike any normal-time reasoning: 1) Missing font support as an argument against the use of a character has never, ? ? never been the way Unicode worked, so far as I?ve been given the opportunity ? ? to understand something of Unicode till now. 2) This missing font support is mostly a consequence of the Unicode strategy on ? ? these characters: Discouraging their use and even misnaming them intentionally ? ? in an inconsistent manner (from an overall point of view), Unicode drove ? ? a significant part of the font designers away from adding them completely and ? ? with a consistent design, and from implementing combining marks support for ? ? these characters. 3) This strategy is biased from the beginning on, as it goes against the user ? ? preferences of Latin script using countries, while AFAIK all countries ? ? using other scripts are unconcerned because they actually don?t *use* ? ? superscripting in such an *extensive* way. Please correct me If I?m wrong. ? ? Consequently, there would be *nobody* asking for more (except the already ? ? discussed completion of some ranges of Latin script). This strategy of shooing ? ? users (and their developers) away from using preformatted letters and digits ? ? seems to aim nothing serious than support of software vendors? marketing ? ? strategies, despite of the software not needing poor character support based ? ? (and poor keyboard layout based) marketing. > Using code points > "against the grain" that is, in contradiction to the way their use was > intended when they were encoded means that you are going to run into many > issues based on font vendors and implementers expectations on how users > would follow the conventions suggested in the Unicode Standard. I?ve got the news that Edge neither still doesn?t make OpenType fonts work for U+2044. One might wonder however whether the users should conform to the Standard litterally while even Microsoft don?t. I?m not here to post feature requests to the attention of Microsoft any longer. My actual suggestions are perhaps a bit more complex than that. I just wish that the Unicode policy wrt superscripts become more user-centered, more user-friendly. The core issue is the use of these letters in current text in some languages that need them to apply a typographic convention that is close to orthography. Superscripting is a far, far stronger requirement than all other formatting conventions, as it can affect the spelling of the grammatical entity. We?re facing strong demands on user side relayed by standards bodies from the early times on, when ordinal indicators were first encoded as a part of Latin-1. Today most users still type a degree sign to emulate a superscript o, and the French NB (that I?m not a part of, nor am y a member of the keyboard standard WG) wishes an ordinal indicator on the keyboard to represent the most common ordinal indicator in French: "?". > > Your discussion of the support of the fraction slash (with regular digits) > across fonts is potentially more useful -- bringing attention to this issue > could bring font vendors to perhaps update earlier fonts to support the > correct conventions for 2044 (which incidentally post dated the design of > many popular fonts). This is relatively important, but it is far outweighed by the ordinal indicator issue, and along with it, the need to stabilize superscript abbreviations. > > In other words, there's no need to "fix" the character encoding, but much > need to make sure that what's in the character encoding (and its associated > conventions) is actually supported as intended. Additionally, I now suggest to add an informative alias to each one of the (intentionally) misnamed characters. This ?MODIFIER LETTER? disguise of the true *LATIN SUPERSCRIPT LETTERs seems to me a twisted trick to make inadvertant people believe that here?s a thing to insiders that is completely useless to other people. The truth happens to show up wherever the editorial committee (as well as anybody else) can afford to feel free to write their own, unbiased language: [I?m highlighting with uppercase] @ Latin superscript modifier letters @+ See also SUPERSCRIPT LATIN LETTERS in the Spacing Modifier Letters block starting at 02B0. 1D2C MODIFIER LETTER CAPITAL A ... I think that the "MODIFIER LETTER" labeling of these characters is not straightforward enough for a standard who claims that the character names are mere identifiers. This is an example of how the identifiers were (ab)used as descriptors, to carry prescriptions and corporate preferences on how to use or not to use the repertoire. When I?m back writing up some keyboard documentation, I really would like to be able to deliver a better image of Unicode ? and of Microsoft ? than that one. Please help me improve my communication, and make Unicode a user-centered standard. Below are the proposed additions, that I?d like to submit to your kind review prior to posting them with the Contact Form. Regards, Marcel NamesList snippets with additional informative aliases providing straightforward character identifiers, and some comment lines: (Original file: http://www.unicode.org/Public/UCD/latest/ucd/NamesList.txt ) @@ 02B0 Spacing Modifier Letters 02FF @+ Superscript and subscript letters were not intended to replace markup, but they are for use where super/sub scripting is important in plain text, or formatting is inappropriate. @ Latin superscript modifier letters @+ "modifier letter small" stands for "latin superscript small letter", and "modifier letter small capital" for "latin letter small capital". x (superscript latin small letter i - 2071) x (superscript latin small letter n - 207F) 02B0 MODIFIER LETTER SMALL H = latin superscript small letter h * aspiration # 0068 02B1 MODIFIER LETTER SMALL H WITH HOOK = latin superscript small letter h with hook * breathy voiced, murmured x (latin small letter h with hook - 0266) x (combining diaeresis below - 0324) # 0266 02B2 MODIFIER LETTER SMALL J = latin superscript small letter j * palatalization x (combining palatalized hook below - 0321) # 006A 02B3 MODIFIER LETTER SMALL R = latin superscript small letter r # 0072 02B4 MODIFIER LETTER SMALL TURNED R = latin superscript small letter turned r x (latin small letter turned r - 0279) # 0279 02B5 MODIFIER LETTER SMALL TURNED R WITH HOOK = latin superscript small letter turned r with hook x (latin small letter turned r with hook - 027B) # 027B 02B6 MODIFIER LETTER SMALL CAPITAL INVERTED R = latin letter small capital inverted r * preceding four used for r-coloring or r-offglides x (latin letter small capital inverted r - 0281) # 0281 02B7 MODIFIER LETTER SMALL W = latin superscript small letter w * labialization x (combining inverted double arch below - 032B) # 0077 02B8 MODIFIER LETTER SMALL Y = latin superscript small letter y * palatalization * common Americanist usage for 02B2 # 0079 [?] @ Additions based on 1989 IPA 02DE MODIFIER LETTER RHOTIC HOOK * rhotacization in vowel * often ligated: 025A = 0259 + 02DE; 025D = 025C + 02DE 02DF MODIFIER LETTER CROSS ACCENT * Swedish grave accent 02E0 MODIFIER LETTER SMALL GAMMA = latin superscript small letter gamma * these modifier letters are occasionally used in transcription of affricates # 0263 02E1 MODIFIER LETTER SMALL L = latin superscript small letter l # 006C 02E2 MODIFIER LETTER SMALL S = latin superscript small letter s # 0073 02E3 MODIFIER LETTER SMALL X = latin superscript small letter x # 0078 02E4 MODIFIER LETTER SMALL REVERSED GLOTTAL STOP = latin superscript letter reversed glottal stop # 0295 [?] @ Latin superscript modifier letters @+ See also superscript Latin letters in the Spacing Modifier Letters block starting at 02B0. 1D2C MODIFIER LETTER CAPITAL A = latin superscript capital letter a # 0041 1D2D MODIFIER LETTER CAPITAL AE = latin superscript capital letter ae # 00C6 1D2E MODIFIER LETTER CAPITAL B = latin superscript capital letter b # 0042 1D2F MODIFIER LETTER CAPITAL BARRED B = latin superscript capital letter barred b 1D30 MODIFIER LETTER CAPITAL D = latin superscript capital letter d # 0044 1D31 MODIFIER LETTER CAPITAL E = latin superscript capital letter e # 0045 1D32 MODIFIER LETTER CAPITAL REVERSED E = latin superscript capital letter reversed e # 018E 1D33 MODIFIER LETTER CAPITAL G = latin superscript capital letter g # 0047 1D34 MODIFIER LETTER CAPITAL H = latin superscript capital letter h # 0048 1D35 MODIFIER LETTER CAPITAL I = latin superscript capital letter i # 0049 1D36 MODIFIER LETTER CAPITAL J = latin superscript capital letter j # 004A 1D37 MODIFIER LETTER CAPITAL K = latin superscript capital letter k # 004B 1D38 MODIFIER LETTER CAPITAL L = latin superscript capital letter l # 004C 1D39 MODIFIER LETTER CAPITAL M = latin superscript capital letter m # 004D 1D3A MODIFIER LETTER CAPITAL N = latin superscript capital letter n # 004E 1D3B MODIFIER LETTER CAPITAL REVERSED N = latin superscript capital letter reversed n 1D3C MODIFIER LETTER CAPITAL O = latin superscript capital letter o # 004F 1D3D MODIFIER LETTER CAPITAL OU = latin superscript capital letter ou # 0222 1D3E MODIFIER LETTER CAPITAL P = latin superscript capital letter p # 0050 1D3F MODIFIER LETTER CAPITAL R = latin superscript capital letter r # 0052 1D40 MODIFIER LETTER CAPITAL T = latin superscript capital letter t # 0054 1D41 MODIFIER LETTER CAPITAL U = latin superscript capital letter u # 0055 1D42 MODIFIER LETTER CAPITAL W = latin superscript capital letter w # 0057 1D43 MODIFIER LETTER SMALL A = latin superscript small letter a # 0061 1D44 MODIFIER LETTER SMALL TURNED A = latin superscript small letter turned a # 0250 1D45 MODIFIER LETTER SMALL ALPHA = latin superscript small letter alpha # 0251 1D46 MODIFIER LETTER SMALL TURNED AE = latin superscript small letter turned ae # 1D02 1D47 MODIFIER LETTER SMALL B = latin superscript small letter b # 0062 1D48 MODIFIER LETTER SMALL D = latin superscript small letter d # 0064 1D49 MODIFIER LETTER SMALL E = latin superscript small letter e # 0065 1D4A MODIFIER LETTER SMALL SCHWA = latin superscript small letter schwa # 0259 1D4B MODIFIER LETTER SMALL OPEN E = latin superscript small letter open e # 025B 1D4C MODIFIER LETTER SMALL TURNED OPEN E = latin superscript small letter turned open e * more appropriate equivalence would be to 1D08 # 025C 1D4D MODIFIER LETTER SMALL G = latin superscript small letter g # 0067 1D4E MODIFIER LETTER SMALL TURNED I = latin superscript small letter i 1D4F MODIFIER LETTER SMALL K = latin superscript small letter k # 006B 1D50 MODIFIER LETTER SMALL M = latin superscript small letter m # 006D 1D51 MODIFIER LETTER SMALL ENG = latin superscript small letter eng # 014B 1D52 MODIFIER LETTER SMALL O = latin superscript small letter o # 006F 1D53 MODIFIER LETTER SMALL OPEN O = latin superscript small letter open o # 0254 1D54 MODIFIER LETTER SMALL TOP HALF O = latin superscript small letter top half o # 1D16 1D55 MODIFIER LETTER SMALL BOTTOM HALF O = latin superscript small letter bottom half o # 1D17 1D56 MODIFIER LETTER SMALL P = latin superscript small letter p # 0070 1D57 MODIFIER LETTER SMALL T = latin superscript small letter t # 0074 1D58 MODIFIER LETTER SMALL U = latin superscript small letter u # 0075 1D59 MODIFIER LETTER SMALL SIDEWAYS U = latin superscript small letter sideways u # 1D1D 1D5A MODIFIER LETTER SMALL TURNED M = latin superscript small letter turned m # 026F 1D5B MODIFIER LETTER SMALL V = latin superscript small letter v # 0076 1D5C MODIFIER LETTER SMALL AIN // (a misnomer also as it should be MODIFIER LETTER AIN; cf. 1D25 LATIN LETTER AIN, A724 LATIN CAPITAL LETTER EGYPTOLOGICAL AIN, A725 LATIN SMALL LETTER EGYPTOLOGICAL AIN) = latin superscript letter ain # 1D25 @ Greek superscript modifier letters 1D5D MODIFIER LETTER SMALL BETA = greek superscript small letter beta # 03B2 1D5E MODIFIER LETTER SMALL GREEK GAMMA = greek superscript small letter gamma # 03B3 1D5F MODIFIER LETTER SMALL DELTA // (a misnomer also as it should be MODIFIER LETTER SMALL GREEK DELTA, cf. 1E9F LATIN SMALL LETTER DELTA) = greek superscript small letter delta # 03B4 1D60 MODIFIER LETTER SMALL GREEK PHI = greek superscript small letter phi # 03C6 1D61 MODIFIER LETTER SMALL CHI = greek superscript small letter chi # 03C7 @ Latin subscript modifier letters 1D62 LATIN SUBSCRIPT SMALL LETTER I # 0069 1D63 LATIN SUBSCRIPT SMALL LETTER R # 0072 1D64 LATIN SUBSCRIPT SMALL LETTER U # 0075 1D65 LATIN SUBSCRIPT SMALL LETTER V # 0076 @ Greek subscript modifier letters 1D66 GREEK SUBSCRIPT SMALL LETTER BETA # 03B2 1D67 GREEK SUBSCRIPT SMALL LETTER GAMMA # 03B3 1D68 GREEK SUBSCRIPT SMALL LETTER RHO # 03C1 1D69 GREEK SUBSCRIPT SMALL LETTER PHI # 03C6 1D6A GREEK SUBSCRIPT SMALL LETTER CHI # 03C7 [?] @ Modifier letters @+ Other modifier letters can be found in the Spacing Modifier Letters, Phonetic Extensions, as well as Superscripts and Subscripts blocks. 1D9B MODIFIER LETTER SMALL TURNED ALPHA = latin superscript small letter turned alpha # 0252 1D9C MODIFIER LETTER SMALL C = latin superscript small letter c # 0063 1D9D MODIFIER LETTER SMALL C WITH CURL = latin superscript small letter c with curl # 0255 1D9E MODIFIER LETTER SMALL ETH = latin superscript small letter eth # 00F0 1D9F MODIFIER LETTER SMALL REVERSED OPEN E = latin superscript small letter reversed open e # 025C 1DA0 MODIFIER LETTER SMALL F = latin superscript small letter f # 0066 1DA1 MODIFIER LETTER SMALL DOTLESS J WITH STROKE = latin superscript small letter dotless j with stroke # 025F 1DA2 MODIFIER LETTER SMALL SCRIPT G = latin superscript small letter script g # 0261 1DA3 MODIFIER LETTER SMALL TURNED H = latin superscript small letter turned h # 0265 1DA4 MODIFIER LETTER SMALL I WITH STROKE = latin superscript small letter i with stroke # 0268 1DA5 MODIFIER LETTER SMALL IOTA = latin superscript small letter iota # 0269 1DA6 MODIFIER LETTER SMALL CAPITAL I = latin letter small capital i * not for use in UPA x (modifier letter capital i - 1D35) # 026A 1DA7 MODIFIER LETTER SMALL CAPITAL I WITH STROKE = latin letter small capital i with stroke # 1D7B 1DA8 MODIFIER LETTER SMALL J WITH CROSSED-TAIL = latin superscript small letter j with crossed-tail # 029D 1DA9 MODIFIER LETTER SMALL L WITH RETROFLEX HOOK = latin superscript small letter l with retroflex hook # 026D 1DAA MODIFIER LETTER SMALL L WITH PALATAL HOOK = latin superscript small letter l with palatal hook # 1D85 1DAB MODIFIER LETTER SMALL CAPITAL L = latin letter small capital l * not for use in UPA x (modifier letter capital l - 1D38) # 029F 1DAC MODIFIER LETTER SMALL M WITH HOOK = latin superscript small letter m with hook # 0271 1DAD MODIFIER LETTER SMALL TURNED M WITH LONG LEG = latin superscript small letter turned m with long leg # 0270 1DAE MODIFIER LETTER SMALL N WITH LEFT HOOK = latin superscript small letter n with left hook # 0272 1DAF MODIFIER LETTER SMALL N WITH RETROFLEX HOOK = latin superscript small letter n with retroflex hook # 0273 1DB0 MODIFIER LETTER SMALL CAPITAL N = latin letter small capital n * not for use in UPA x (modifier letter capital n - 1D3A) # 0274 1DB1 MODIFIER LETTER SMALL BARRED O = latin superscript small letter barred o # 0275 1DB2 MODIFIER LETTER SMALL PHI = latin superscript small letter phi # 0278 1DB3 MODIFIER LETTER SMALL S WITH HOOK = latin superscript small letter s with hook # 0282 1DB4 MODIFIER LETTER SMALL ESH = latin superscript small letter esh # 0283 1DB5 MODIFIER LETTER SMALL T WITH PALATAL HOOK = latin superscript small letter small t with palatal hook # 01AB 1DB6 MODIFIER LETTER SMALL U BAR = latin superscript small letter u bar # 0289 1DB7 MODIFIER LETTER SMALL UPSILON = latin superscript small letter upsilon # 028A 1DB8 MODIFIER LETTER SMALL CAPITAL U = latin letter small capital u * not for use in UPA x (modifier letter capital u - 1D41) # 1D1C 1DB9 MODIFIER LETTER SMALL V WITH HOOK = latin superscript small letter v with hook # 028B 1DBA MODIFIER LETTER SMALL TURNED V = latin superscript small letter turned v # 028C 1DBB MODIFIER LETTER SMALL Z = latin superscript small letter z # 007A 1DBC MODIFIER LETTER SMALL Z WITH RETROFLEX HOOK = latin superscript small letter z with retroflex hook # 0290 1DBD MODIFIER LETTER SMALL Z WITH CURL = latin superscript small letter z with curl # 0291 1DBE MODIFIER LETTER SMALL EZH = latin superscript small letter ezh # 0292 1DBF MODIFIER LETTER SMALL THETA = latin superscript small letter theta # 03B8 [?] @ Additions for Extended IPA A7F8 MODIFIER LETTER CAPITAL H WITH STROKE = latin superscript capital letter h with stroke * faucalized # 0126 A7F9 MODIFIER LETTER SMALL LIGATURE OE = latin superscript small ligature oe * labialized: open-rounded # 0153 [?] @ Modifier letters for German dialectology AB5B MODIFIER BREVE WITH INVERTED BREVE x (breve - 02D8) x (close up - 2050) x (metrical breve - 23D1) AB5C MODIFIER LETTER SMALL HENG = latin superscript small letter heng # A727 AB5D MODIFIER LETTER SMALL L WITH INVERTED LAZY S = latin superscript small letter l with inverted lazy s # AB37 AB5E MODIFIER LETTER SMALL L WITH MIDDLE TILDE = latin superscript small letter l with middle tilde # 026B AB5F MODIFIER LETTER SMALL U WITH LEFT HOOK = latin superscript small letter u with left hook # AB52 From richard.wordingham at ntlworld.com Tue Jan 17 17:04:05 2017 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Tue, 17 Jan 2017 23:04:05 +0000 Subject: Misspelling or Miscoding? Message-ID: <20170117230405.44f7601e@JRWUBU2> When someone enters text with the code points in the wrong order but the text renders to give the appearance that should have been intended, has the typist misspelt, miscoded or what? I am talking about sequences that are *not* even compatibility equivalent to what should have been entered. Richard. From charupdate at orange.fr Wed Jan 18 05:31:44 2017 From: charupdate at orange.fr (Marcel Schneider) Date: Wed, 18 Jan 2017 12:31:44 +0100 (CET) Subject: Superscript and Subscript Characters in General Use In-Reply-To: References: <1650945202.4739.1483875782165.JavaMail.www@wwinf1p17> <652345901.17960.1483909510323.JavaMail.www@wwinf1p19> <1236918755.9879.1483965771868.JavaMail.www@wwinf1p19> <572373065.27540.1483997980333.JavaMail.www@wwinf1p25> <1469162895.239.1484114453124.JavaMail.www@wwinf1h11> <20170111083212.476f492e@JRWUBU2> <93666313-a432-c3b0-44dc-6edd43c19eda@it.aoyama.ac.jp> <20170112063524.GF14923@macbook> <1634058514.13007.1484230938837.JavaMail.www@wwinf1p25> <20170112150141.GG14923@macbook> <1978028751.17843.1484240496199.JavaMail.www@wwinf1p25> <1651851807.24444.1484255052680.JavaMail.www@wwinf1p25> Message-ID: <988413375.8331.1484739105259.JavaMail.www@wwinf1p07> On Tue, 17 Jan 2017 02:02:02 +0100, Christoph P?per wrote: > > Marcel Schneider : > > > > What typically happens with the correct use of fraction slash on a collaborative > > website like Wikipedia, is that the superscripts and subscripts are restored, > > JFTR, has been using the fraction > slash for many years, but (still) pairs it with HTML/CSS super- and subscripts. Thank you for drawing our attention to this. That has the potential to help in Unicode education, and to spread the word about the full nature of U+2044 as it is intended in the Standard and implemented in HarfBuzz. Though what this template actually does is apply generic HTML super/sub scripting: {{{1}}}{{{2}}} Given that this displays worse than when fractions are hard-coded with Unicode super/sub scripts, I?ve added some CSS but in the French template only, that is not locked: https://fr.wikipedia.org/wiki/Mod%C3%A8le:Fraction (BTW when I?ve come on it, it didn?t use U+2044, so I?ve imported the template from en-wiki, and then from de-wiki: https://de.wikipedia.org/wiki/Vorlage:Bruch ), and added some notes to the template documentation. I?ll have to track the issue on the talk page of en-wiki: https://en.wikipedia.org/wiki/Template_talk:Frac#Template-protected_edit_request_on_17_January_2017 You are welcome to add to it. Please feel free. Thanks. Regards, Marcel From doug at ewellic.org Wed Jan 18 10:44:34 2017 From: doug at ewellic.org (Doug Ewell) Date: Wed, 18 Jan 2017 09:44:34 -0700 Subject: Misspelling or =?UTF-8?Q?Miscoding=3F?= Message-ID: <20170118094434.665a7a7059d7ee80bb4d670165c8327d.3f543ef6a7.wbe@email03.godaddy.com> Richard Wordingham wrote: > When someone enters text with the code points in the wrong order but > the text renders to give the appearance that should have been intended, > has the typist misspelt, miscoded or what? I am talking about > sequences that are *not* even compatibility equivalent to what should > have been entered. I'd say the person misspelled the word, or made a typographical error. The fact that the rendering software displayed it as if it were spelled correctly is immaterial. -- Doug Ewell | Thornton, CO, US | ewellic.org From richard.wordingham at ntlworld.com Wed Jan 18 12:49:55 2017 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Wed, 18 Jan 2017 18:49:55 +0000 Subject: Misspelling or Miscoding? In-Reply-To: <20170118094434.665a7a7059d7ee80bb4d670165c8327d.3f543ef6a7.wbe@email03.godaddy.com> References: <20170118094434.665a7a7059d7ee80bb4d670165c8327d.3f543ef6a7.wbe@email03.godaddy.com> Message-ID: <20170118184955.74cffc8f@JRWUBU2> On Wed, 18 Jan 2017 09:44:34 -0700 "Doug Ewell" wrote: > I'd say the person misspelled the word, or made a typographical error. > The fact that the rendering software displayed it as if it were > spelled correctly is immaterial. If someone made such a mistake with typical English, I could accuse them of not reading what they typed. That line of defence is not available. One of the purposes of the dotted circles introduced in complex script processing is to warn the writer that such an error has been made. However, there are cases where homographic anagrams indicate different pronunciations. I think it is not a 'typographical error' if it renders as it should! Richard. From doug at ewellic.org Wed Jan 18 14:35:55 2017 From: doug at ewellic.org (Doug Ewell) Date: Wed, 18 Jan 2017 13:35:55 -0700 Subject: Misspelling or =?UTF-8?Q?Miscoding=3F?= Message-ID: <20170118133555.665a7a7059d7ee80bb4d670165c8327d.23bcccc767.wbe@email03.godaddy.com> Richard Wordingham wrote: > I think it is not a 'typographical error' if it renders as it should! What if it renders correctly on some systems but not on others? I do see your point, though. Writing systems that permit different spellings of the same glyph (cluster), only one of which is 'correct' even after normalization, can be tricky like this. I think this would still be a matter of 'misspelling' rather than 'miscoding' because a typist should not have to be concerned with character codes per se. -- Doug Ewell | Thornton, CO, US | ewellic.org From richard.wordingham at ntlworld.com Wed Jan 18 19:12:50 2017 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Thu, 19 Jan 2017 01:12:50 +0000 Subject: Misspelling or Miscoding? In-Reply-To: <20170118133555.665a7a7059d7ee80bb4d670165c8327d.23bcccc767.wbe@email03.godaddy.com> References: <20170118133555.665a7a7059d7ee80bb4d670165c8327d.23bcccc767.wbe@email03.godaddy.com> Message-ID: <20170119011250.01bd6a96@JRWUBU2> On Wed, 18 Jan 2017 13:35:55 -0700 "Doug Ewell" wrote: > Richard Wordingham wrote: > > > I think it is not a 'typographical error' if it renders as it > > should! > > What if it renders correctly on some systems but not on others? > I do see your point, though. Writing systems that permit different > spellings of the same glyph (cluster), only one of which is 'correct' > even after normalization, can be tricky like this. I think this would > still be a matter of 'misspelling' rather than 'miscoding' because a > typist should not have to be concerned with character codes per se. As you've put it, it sounds like the way things were with a simple Thai typewriter. A vowel below, a vowel above and a tone mark could be typed in any order, as though they had three different non-zero combining classes. Thais were trained to type into computers by input routines only accepting the marks in the correct order - this was before the days of canonical combining classes. In the case of greatest concern to me, there can be two different orders, but only one is appropriate for a given word. In most cases, only one word of that appearance exists, and one can usually guess which one does exist. (That is why the system works despite the occasional ambiguity.) It's not unlike how Thai would work had phonetic order been successfully insisted upon, except that there is no evidence that sorting should be by appearance, whereas in Thai as it was encoded before Unicode (and is now, after normalisation), encoding and sorting are based purely on appearance. (Well, officially - in practice, Thais appear to sort by doing syllable-by-syllable comparisons.) In this case of concern, the range of renderings is occasionally different, which is another reason that two different encodings for the same appearance must be tolerated. Richard. From asmusf at ix.netcom.com Thu Jan 19 01:24:21 2017 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Wed, 18 Jan 2017 23:24:21 -0800 Subject: Misspelling or Miscoding? In-Reply-To: <20170118133555.665a7a7059d7ee80bb4d670165c8327d.23bcccc767.wbe@email03.godaddy.com> References: <20170118133555.665a7a7059d7ee80bb4d670165c8327d.23bcccc767.wbe@email03.godaddy.com> Message-ID: <1775052c-27eb-43d9-da26-c8715ab83720@ix.netcom.com> On 1/18/2017 12:35 PM, Doug Ewell wrote: > Richard Wordingham wrote: > >> I think it is not a 'typographical error' if it renders as it should! > What if it renders correctly on some systems but not on others? > > I do see your point, though. Writing systems that permit different > spellings of the same glyph (cluster), only one of which is 'correct' > even after normalization, can be tricky like this. I think this would > still be a matter of 'misspelling' rather than 'miscoding' because a > typist should not have to be concerned with character codes per se. > The sequence of character codes isn't necessarily determined by the typist's choice of keystrokes. For example, autocorrection and similar support can result in a substitution of character codes. For scripts with this issue, it would be useful if such mechanisms were more widespread; effectively normalizing to a preferred input order. Arguing over whether this is called mistyping or miscoding or misspelling is perhaps less helpful than trying to get the word out that some scripts could strongly benefit from that additional software layer. A./ From gwalla at gmail.com Thu Jan 19 01:52:12 2017 From: gwalla at gmail.com (Garth Wallace) Date: Wed, 18 Jan 2017 23:52:12 -0800 Subject: Misspelling or Miscoding? In-Reply-To: <20170117230405.44f7601e@JRWUBU2> References: <20170117230405.44f7601e@JRWUBU2> Message-ID: On Tue, Jan 17, 2017 at 3:04 PM, Richard Wordingham < richard.wordingham at ntlworld.com> wrote: > When someone enters text with the code points in the wrong order but > the text renders to give the appearance that should have been intended, > has the typist misspelt, miscoded or what? I am talking about > sequences that are *not* even compatibility equivalent to what should > have been entered. > > Richard. > You mean like when a font puts combining diacritics over the following base character, and people type it in that order so it looks right on their screen? -------------- next part -------------- An HTML attachment was scrubbed... URL: From mark at macchiato.com Thu Jan 19 01:52:05 2017 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Thu, 19 Jan 2017 08:52:05 +0100 Subject: Misspelling or Miscoding? In-Reply-To: <20170119011250.01bd6a96@JRWUBU2> References: <20170118133555.665a7a7059d7ee80bb4d670165c8327d.23bcccc767.wbe@email03.godaddy.com> <20170119011250.01bd6a96@JRWUBU2> Message-ID: We don't have any set terminology for what you're talking about. We've often just used 'misspelling' in a broad sense, which can include visually confusable or identical glyphs. For example, spelling 'of' with an omicron would be one, as well as a word in a complex script with swapped marks. And cases of the former occur surprisingly often in web pages: probably to do with people switching keyboards in mid-stride. They are in (say) a Greek keyboard, hit omicron and then the Greek character in the 'f' position, notice it is wrong, and backspace ? but just over the character that 'looks' wrong ? then type 'f'. The problem with using the term "miscoding" is that it is overloaded. It can be used as having something to do with the character encoding level: for example, interpreting a string of UTF-8 bytes as Latin-1. The sequence is a perfectly valid Unicode string, not ? in that sense ? miscoded. Mark On Thu, Jan 19, 2017 at 2:12 AM, Richard Wordingham < richard.wordingham at ntlworld.com> wrote: > On Wed, 18 Jan 2017 13:35:55 -0700 > "Doug Ewell" wrote: > > > Richard Wordingham wrote: > > > > > I think it is not a 'typographical error' if it renders as it > > > should! > > > > What if it renders correctly on some systems but not on others? > > > I do see your point, though. Writing systems that permit different > > spellings of the same glyph (cluster), only one of which is 'correct' > > even after normalization, can be tricky like this. I think this would > > still be a matter of 'misspelling' rather than 'miscoding' because a > > typist should not have to be concerned with character codes per se. > > As you've put it, it sounds like the way things were with a simple Thai > typewriter. A vowel below, a vowel above and a tone mark could be > typed in any order, as though they had three different non-zero > combining classes. Thais were trained to type into computers by input > routines only accepting the marks in the correct order - this was > before the days of canonical combining classes. > > In the case of greatest concern to me, there can be two different > orders, but only one is appropriate for a given word. In most cases, > only one word of that appearance exists, and one can usually guess which > one does exist. (That is why the system works despite the occasional > ambiguity.) It's not unlike how Thai would work had phonetic order > been successfully insisted upon, except that there is no evidence that > sorting should be by appearance, whereas in Thai as it was encoded > before Unicode (and is now, after normalisation), encoding and sorting > are based purely on appearance. (Well, officially - in practice, Thais > appear to sort by doing syllable-by-syllable comparisons.) > > In this case of concern, the range of renderings is occasionally > different, which is another reason that two different encodings for the > same appearance must be tolerated. > > Richard. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Thu Jan 19 02:45:08 2017 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Thu, 19 Jan 2017 08:45:08 +0000 Subject: Misspelling or Miscoding? In-Reply-To: <1775052c-27eb-43d9-da26-c8715ab83720@ix.netcom.com> References: <20170118133555.665a7a7059d7ee80bb4d670165c8327d.23bcccc767.wbe@email03.godaddy.com> <1775052c-27eb-43d9-da26-c8715ab83720@ix.netcom.com> Message-ID: <20170119084508.706f4774@JRWUBU2> On Wed, 18 Jan 2017 23:24:21 -0800 Asmus Freytag wrote: > The sequence of character codes isn't necessarily determined by the > typist's choice of keystrokes. Wow! ESP for input? > For example, autocorrection and similar support can result in a > substitution of character codes. For scripts with this issue, it > would be useful if such mechanisms were more widespread; effectively > normalizing to a preferred input order. That's not the problem I have in mind. Dotted circles can help, but for Northern Thai in the Lanna script, USE has accidentally (I hope) banned 17% of the vocabulary and demanded that a further 37% be misspelt. It will be much the same for Tai Khuen. Once USE is fixed, the problem is that the encodings of */hi:m/ and /mi:/ may be different but render identically; it so happens that words like the former are rare. Are you aware of predictive input causing havoc with intellectual content? > Arguing over whether this is called mistyping or miscoding or > misspelling is perhaps less helpful than trying to get the word out > that some scripts could strongly benefit from that additional > software layer. Enabling that may require some tools to update to Unicode 5.1. (Hunspell, I'm looking at you.) One thing that would be helpful is some way of showing the difference between distinctly encoded homographs if a spell-checker can help. (I fear it may not be quite the right tool - different suggestion logic is needed.) Coloured fonts may help once support for them has spread, but we're probably still looking at bespoke tools to switch such hints on and off. In the past I've used transliteration fonts to check what I've actually typed. One problem with getting the message out is choosing the right words. That's why I came here for advice on the terminology for such issues. Richard. From richard.wordingham at ntlworld.com Thu Jan 19 03:05:29 2017 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Thu, 19 Jan 2017 09:05:29 +0000 Subject: Misspelling or Miscoding? In-Reply-To: References: <20170117230405.44f7601e@JRWUBU2> Message-ID: <20170119090529.045d2f86@JRWUBU2> On Wed, 18 Jan 2017 23:52:12 -0800 Garth Wallace wrote: > On Tue, Jan 17, 2017 at 3:04 PM, Richard Wordingham < > richard.wordingham at ntlworld.com> wrote: > > > When someone enters text with the code points in the wrong order but > > the text renders to give the appearance that should have been > > intended, has the typist misspelt, miscoded or what? I am talking > > about sequences that are *not* even compatibility equivalent to > > what should have been entered. > > > > Richard. > > > > You mean like when a font puts combining diacritics over the > following base character, and people type it in that order so it > looks right on their screen? No. I particularly have in mind Tai Tham script pairs like ???? /h??m/ (MFL p831) and ???? /m??/ (MFL p793), which look identical with a competent renderer but are sorted differently. The page references are to a Northern Thai to Thai dictionary. ?Richard. From asmusf at ix.netcom.com Thu Jan 19 16:25:14 2017 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Thu, 19 Jan 2017 14:25:14 -0800 Subject: Misspelling or Miscoding? In-Reply-To: <20170119084508.706f4774@JRWUBU2> References: <20170118133555.665a7a7059d7ee80bb4d670165c8327d.23bcccc767.wbe@email03.godaddy.com> <1775052c-27eb-43d9-da26-c8715ab83720@ix.netcom.com> <20170119084508.706f4774@JRWUBU2> Message-ID: OK, I was first thinking you had something more in mind like ordering of (e.g. Lao?) tone marks that normally do not render exactly the same, but close, and where some font/rendering engine could go and make them identical in an effort to be helpful. In those cases one can presume a preferred ordering, and, in principle, that can be imposed upon a text, whether via autocorrrect or spell check. Now I'm thinking your focus was more on cases the like two Khmer subjoined consonant sequences: U+17D2 U+178A ?? KHMER CONSONANT SIGN COENG DA U+17D2 U+178F ?? KHMER CONSONANT SIGN COENG TA that apparently have identical appearance, even though one is a 'd' and the other a 't'. (That's the only example that I'm personally familiar with). Unless some fonts ever make a distinction, this seems to be a case where "miscoding" might be an appropriate term. As far as the user is concerned, the issue only arises because of the encoding scheme used. (A hypothetical different scheme that had one of these precomposed with a name containing something like DA OR TA would have not surfaced an invisible distinction). Are your examples likewise legitimate duplications or merely the case that one could type something else and have it look the same (accidentally). The Khmer example would seem fairly resistant to automated correction if it is a free choice. If, instead, the immediately preceding consonant comes from two disjoined sets, for example if TA COENG TA was possible, but not TA COENG DA, then there's scope for spell check. In designing label generation rules for domain names, one clearly doesn't want two labels that cannot be distinguished other than on the encoding level. For Khmer, the decision was to allow both, but not simultaneously (by allowing only one member of each minimal pair to be registered, which one is decided by the order of application). A./ On 1/19/2017 12:45 AM, Richard Wordingham wrote: > On Wed, 18 Jan 2017 23:24:21 -0800 > Asmus Freytag wrote: > >> The sequence of character codes isn't necessarily determined by the >> typist's choice of keystrokes. > Wow! ESP for input? > >> For example, autocorrection and similar support can result in a >> substitution of character codes. For scripts with this issue, it >> would be useful if such mechanisms were more widespread; effectively >> normalizing to a preferred input order. > That's not the problem I have in mind. Dotted circles can help, but > for Northern Thai in the Lanna script, USE has accidentally (I hope) > banned 17% of the vocabulary and demanded that a further 37% be > misspelt. It will be much the same for Tai Khuen. Once USE is > fixed, the problem is that the encodings of */hi:m/ and /mi:/ may be > different but render identically; it so happens that words like the > former are rare. Are you aware of predictive input causing havoc with > intellectual content? > >> Arguing over whether this is called mistyping or miscoding or >> misspelling is perhaps less helpful than trying to get the word out >> that some scripts could strongly benefit from that additional >> software layer. > Enabling that may require some tools to update to Unicode 5.1. > (Hunspell, I'm looking at you.) > > One thing that would be helpful is some way of showing the difference > between distinctly encoded homographs if a spell-checker can help. (I > fear it may not be quite the right tool - different suggestion logic is > needed.) Coloured fonts may help once support for them has spread, but > we're probably still looking at bespoke tools to switch such hints on > and off. In the past I've used transliteration fonts to check what I've > actually typed. > > One problem with getting the message out is choosing the right words. > That's why I came here for advice on the terminology for such issues. > > Richard. > From khaledhosny at eglug.org Thu Jan 19 17:36:07 2017 From: khaledhosny at eglug.org (Khaled Hosny) Date: Fri, 20 Jan 2017 01:36:07 +0200 Subject: Superscript and Subscript Characters in General Use In-Reply-To: <850127614.20750.1484356681023.JavaMail.www@wwinf1p26> References: <1236918755.9879.1483965771868.JavaMail.www@wwinf1p19> <572373065.27540.1483997980333.JavaMail.www@wwinf1p25> <1469162895.239.1484114453124.JavaMail.www@wwinf1h11> <20170111083212.476f492e@JRWUBU2> <93666313-a432-c3b0-44dc-6edd43c19eda@it.aoyama.ac.jp> <20170112063524.GF14923@macbook> <1634058514.13007.1484230938837.JavaMail.www@wwinf1p25> <20170112150141.GG14923@macbook> <850127614.20750.1484356681023.JavaMail.www@wwinf1p26> Message-ID: <20170119233607.GG30817@macbook> On Sat, Jan 14, 2017 at 02:18:01AM +0100, Marcel Schneider wrote: > On Thu, 12 Jan 2017 17:01:41 +0200, Khaled Hosny wrote: > > > > LibreOffice indeed did not use HarfBuzz on Windows before 5.3, which is > > not released yet. > > Is the integration of HarfBuzz limited to free software? HarfBuzz has a fairly liberal license, so in theory it can be used in anywhere. > And what might be the reason of the delayed integration of HarfBuzz in the > Windows version of LibreOffice? Nothing specific, LibreOffice and OpenOffice.org before it and most like StarOffice before them just used what API the platform provides to do text layout, which is not an uncommon practice, it even seemed to be the best practice back in time. The reasons it finally switched to HarfBuzz are outlined in: https://bugs.documentfoundation.org/show_bug.cgi?id=89870 Regards, Khaled From richard.wordingham at ntlworld.com Thu Jan 19 19:04:06 2017 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Fri, 20 Jan 2017 01:04:06 +0000 Subject: Misspelling or Miscoding? In-Reply-To: References: <20170118133555.665a7a7059d7ee80bb4d670165c8327d.23bcccc767.wbe@email03.godaddy.com> <1775052c-27eb-43d9-da26-c8715ab83720@ix.netcom.com> <20170119084508.706f4774@JRWUBU2> Message-ID: <20170120010406.03929f9e@JRWUBU2> On Thu, 19 Jan 2017 14:25:14 -0800 Asmus Freytag wrote: > Now I'm thinking your focus was more on cases the like two Khmer > subjoined consonant sequences: > U+17D2 U+178A ?? KHMER CONSONANT SIGN COENG DA > U+17D2 U+178F ?? KHMER CONSONANT SIGN COENG TA > that apparently have identical appearance, even though one is a 'd' > and the other a 't'. (That's the only example that I'm personally > familiar with). > Unless some fonts ever make a distinction, this seems to be a case > where "miscoding" might be an appropriate term. As far as the user is > concerned, the issue only arises because of the encoding scheme used. > (A hypothetical different scheme that had one of these precomposed > with a name containing something like DA OR TA would have not > surfaced an invisible distinction). Such a font might be KHOM2004 mentioned by Michel Antelme in his paper aefek.free.fr/iso_album/antelme_bis.pdf. On p25 he makes the point that a distinct COENG DA was still on its last legs in Cambodia in the 1920's; it's still distinct in the Khom variety of the script. This situation makes a good case for the Tibetan model. We might end up making the Khmer script a mixed system like Tai Tham by adding a character KHMER CONSONANT SIGN ARCHAIC COENG DA. There seem to be some Arabic script analogues, where only one or two forms differ between a pair of letters. This is not the situation I was interested in, but it's clearly related. > Are your examples likewise legitimate duplications or merely the case > that one could type something else and have it look the same > (accidentally). They're mostly legitimate duplications, though some may stretch phonological credulity. For example, in Tai Tham, is part of a common Pali verb inflection and is a valid Northern Thai word (apparently not a Pali loan, despite its spelling), but would probably be a miscoding of (an attested final syllable) if the language were Northern Thai. I suppose it's just conceivable that the former might be the name of a fruit, but I'm not aware of the syllabic nasal being written that way. A spell checker would pick up most such errors, though getting the underlying problem explained to the user might be difficult. > The Khmer example would seem fairly resistant to automated correction > if it is a free choice. If, instead, the immediately preceding > consonant comes from two disjoined sets, for example if TA COENG TA > was possible, but not TA COENG DA, then there's scope for spell check. It's supposed to be based on the phonetics, so a spell check could be used, but not a grammar rule. However, I can imagine someone writing in accordance with a rule restricting them to certain bases. Richard. From asmusf at ix.netcom.com Thu Jan 19 20:41:07 2017 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Thu, 19 Jan 2017 18:41:07 -0800 Subject: Misspelling or Miscoding? In-Reply-To: <20170120010406.03929f9e@JRWUBU2> References: <20170118133555.665a7a7059d7ee80bb4d670165c8327d.23bcccc767.wbe@email03.godaddy.com> <1775052c-27eb-43d9-da26-c8715ab83720@ix.netcom.com> <20170119084508.706f4774@JRWUBU2> <20170120010406.03929f9e@JRWUBU2> Message-ID: On 1/19/2017 5:04 PM, Richard Wordingham wrote: > On Thu, 19 Jan 2017 14:25:14 -0800 > Asmus Freytag wrote: > >> Now I'm thinking your focus was more on cases the like two Khmer >> subjoined consonant sequences: >> U+17D2 U+178A ?? KHMER CONSONANT SIGN COENG DA >> U+17D2 U+178F ?? KHMER CONSONANT SIGN COENG TA >> that apparently have identical appearance, even though one is a 'd' >> and the other a 't'. (That's the only example that I'm personally >> familiar with). >> Unless some fonts ever make a distinction, this seems to be a case >> where "miscoding" might be an appropriate term. As far as the user is >> concerned, the issue only arises because of the encoding scheme used. >> (A hypothetical different scheme that had one of these precomposed >> with a name containing something like DA OR TA would have not >> surfaced an invisible distinction). > Such a font might be KHOM2004 mentioned by Michel Antelme in his paper > aefek.free.fr/iso_album/antelme_bis.pdf. On p25 he makes the point > that a distinct COENG DA was still on its last legs in Cambodia in the > 1920's; it's still distinct in the Khom variety of the script. This > situation makes a good case for the Tibetan model. We might end up > making the Khmer script a mixed system like Tai Tham by adding a > character KHMER CONSONANT SIGN ARCHAIC COENG DA. > > There seem to be some Arabic script analogues, where only one or two > forms differ between a pair of letters. Yes, and these are treated similarly to the Khmer case in label generation rulesets for domain names. > > This is not the situation I was interested in, but it's clearly related. Funny thing is, not actually knowing Khmer, I hadn't thought of the COENG DA as a "form of DA", but had considered the sequence it's own entity. In Latin you have to characters that look like reverse e but have different upper cases so that they have a distinct encoding. (You could argue that picking the wrong member of a disunified set is a miscoding, but I think "misspelling" works fine -- in another context we limit the term "misspelling" to phono-something or typo/grapho-something *possible* spellings, and try to not restrict them for that purpose. The "impossible" ones, are ones that we expect some font or renderer to not support on the basis that they are not needed, and those we do restrict; wouldn't use the name "miscoding" for those, just "invalid" does nicely for us in that context). The case where something (=member of or associated with an alphabet) is simply and fully identical in appearance in all contexts (and I regard script as a context) is fortunately quite rare in Unicode. Your examples may be the closest thing. > >> Are your examples likewise legitimate duplications or merely the case >> that one could type something else and have it look the same >> (accidentally). > They're mostly legitimate duplications, though some may stretch > phonological credulity. For example, in Tai Tham, SIGN I> is part of a common Pali verb inflection and HIGH TA> is a valid Northern Thai word (apparently not a Pali loan, > despite its spelling), but would probably > be a miscoding of (an attested final > syllable) if the language were Northern Thai. I suppose > it's just conceivable that the former might be the name of a fruit, but > I'm not aware of the syllabic nasal being written that way. > > A spell checker would pick up most such errors, though getting the > underlying problem explained to the user might be difficult. > >> The Khmer example would seem fairly resistant to automated correction >> if it is a free choice. If, instead, the immediately preceding >> consonant comes from two disjoined sets, for example if TA COENG TA >> was possible, but not TA COENG DA, then there's scope for spell check. > It's supposed to be based on the phonetics, so a spell check could be > used, but not a grammar rule. However, I can imagine someone writing > in accordance with a rule restricting them to certain bases. Your last sentence reads as if you might equally well meant "can't" instead of "can" (?) Having agreement in consonants or vowels across syllables or words isn't necessarily unheard of; spell checkers tend to go on the basis of existing lexical items, not necessarily purely productive rules. At least the ones I use for European languages have this annoying habit of not having a productive rule for compounds - even for languages that do allow arbitrary compound formation. Anyway, digressing from your point. A./ From charupdate at orange.fr Fri Jan 20 00:59:39 2017 From: charupdate at orange.fr (Marcel Schneider) Date: Fri, 20 Jan 2017 07:59:39 +0100 (CET) Subject: Superscript and Subscript Characters in General Use In-Reply-To: <20170119233607.GG30817@macbook> References: <1236918755.9879.1483965771868.JavaMail.www@wwinf1p19> <572373065.27540.1483997980333.JavaMail.www@wwinf1p25> <1469162895.239.1484114453124.JavaMail.www@wwinf1h11> <20170111083212.476f492e@JRWUBU2> <93666313-a432-c3b0-44dc-6edd43c19eda@it.aoyama.ac.jp> <20170112063524.GF14923@macbook> <1634058514.13007.1484230938837.JavaMail.www@wwinf1p25> <20170112150141.GG14923@macbook> <850127614.20750.1484356681023.JavaMail.www@wwinf1p26> <20170119233607.GG30817@macbook> Message-ID: <905739506.715.1484895580643.JavaMail.www@wwinf1p07> On Fri, 20 Jan 2017 01:36:07 +0200, Khaled Hosny wrote: > > On Sat, Jan 14, 2017 at 02:18:01AM +0100, Marcel Schneider wrote: > > On Thu, 12 Jan 2017 17:01:41 +0200, Khaled Hosny wrote: > > > > > > LibreOffice indeed did not use HarfBuzz on Windows before 5.3, which is > > > not released yet. > > > > Is the integration of HarfBuzz limited to free software? > > HarfBuzz has a fairly liberal license, so in theory it can be used in > anywhere. > > > And what might be the reason of the delayed integration of HarfBuzz in the > > Windows version of LibreOffice? > > Nothing specific, LibreOffice and OpenOffice.org before it and most like > StarOffice before them just used what API the platform provides to do > text layout, which is not an uncommon practice, it even seemed to be the > best practice back in time. The reasons it finally switched to HarfBuzz > are outlined in: > > https://bugs.documentfoundation.org/show_bug.cgi?id=89870 Thank you for the great job you are doing for cross-platform text layout and Unicode implementation! Now we?re just missing a good reason to bring to the users why Edge still doesn?t support the Unicode fraction slash specific text rendering. Doubtlessly if it did, users would expect Word to do the same. Then if Word did, continuing this way would make it a clone of Publisher. Is that what we shall tell people when they wonder why the fraction slash?that may be in a prominent position on the keyboard, such as on Shift + AltGr/Option/0x10 + 7?doesn?t work for them when they?re on Word? Thus we?ll end up recommending to use LibreOffice throughout. IMO that?s fair. (Though we?ll have to get the NNBSP displayed. Quite easy, but deliberately discarded.) On the other hand, Microsoft?s way of writing good-looking vulgar fractions seems to be with super/sub scripts. That could expand to use superscripts for ordinals and abbreviations, too. Is that what we?re supposed to do? The answer has been ?no!? But in a user-centered approach, we?ve to provide both and let the user choose what?s the most appropriate for the actual task. I think that not doing so is to overstate the separation between text encoding and typography, that has been questioned anyway. [1] Regards, Marcel [1] http://www.cairn.info/article.php?ID_REVUE=DN&ID_NUMPUBLIE=DN_063&ID_ARTICLE=DN_063_0089&FRM=B From richard.wordingham at ntlworld.com Fri Jan 20 02:37:01 2017 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Fri, 20 Jan 2017 08:37:01 +0000 Subject: Misspelling or Miscoding? In-Reply-To: References: <20170118133555.665a7a7059d7ee80bb4d670165c8327d.23bcccc767.wbe@email03.godaddy.com> <1775052c-27eb-43d9-da26-c8715ab83720@ix.netcom.com> <20170119084508.706f4774@JRWUBU2> <20170120010406.03929f9e@JRWUBU2> Message-ID: <20170120083701.56d86075@JRWUBU2> On Thu, 19 Jan 2017 18:41:07 -0800 Asmus Freytag wrote: > On 1/19/2017 5:04 PM, Richard Wordingham wrote: > > On Thu, 19 Jan 2017 14:25:14 -0800 > > Asmus Freytag wrote: > >> The Khmer example would seem fairly resistant to automated > >> correction if it is a free choice. If, instead, the immediately > >> preceding consonant comes from two disjoined sets, for example if > >> TA COENG TA was possible, but not TA COENG DA, then there's scope > >> for spell check. > > It's supposed to be based on the phonetics, so a spell check could > > be used, but not a grammar rule. However, I can imagine someone > > writing in accordance with a rule restricting them to certain > > bases. > Your last sentence reads as if you might equally well meant "can't" > instead of "can" (?) I meant 'can'. According to Huffman's 'Cambodian System of Writing', initial TA is to be read as /d/ in compounds formed by infixes. (The spelling may have changed since then.) Suffixed to ? NNO (which is in the retroflex series), the subscript is to be read as /d/, while subscripted to ? NO, it is usually /t/ but occasionally /d/. I would be tempted to apply the Pali & Sanskrit rule of place agreement and use COENG DA below ? NNO and COENG TA below ? NO. I would expect similar agreement with ? DA and ? TA. Interestingly, such a discordance in the use of the nasals also occurs in Northern Thai; DA (= Indic DDA) may be written subscript to NA whereas the Indic place agreement rule would dictate NNA. This increases the visual ambiguity of subscripts on the ligature NAA - both /-n da?/ and /na?t/ occur, but there are no anagrammatic homographs in the dictionary. The example ?????? of /-n da?/ shows every sign of having been borrowed via Khmer. Richard. From marc at keyman.com Sat Jan 21 04:18:18 2017 From: marc at keyman.com (Marc Durdin) Date: Sat, 21 Jan 2017 17:18:18 +0700 Subject: Misspelling or Miscoding? In-Reply-To: <20170120083701.56d86075@JRWUBU2> References: <20170118133555.665a7a7059d7ee80bb4d670165c8327d.23bcccc767.wbe@email03.godaddy.com> <1775052c-27eb-43d9-da26-c8715ab83720@ix.netcom.com> <20170119084508.706f4774@JRWUBU2> <20170120010406.03929f9e@JRWUBU2> <20170120083701.56d86075@JRWUBU2> Message-ID: On 20 January 2017 at 15:37, Richard Wordingham < richard.wordingham at ntlworld.com> wrote: > On Thu, 19 Jan 2017 18:41:07 -0800 > Asmus Freytag wrote: > > > On 1/19/2017 5:04 PM, Richard Wordingham wrote: > > > On Thu, 19 Jan 2017 14:25:14 -0800 > > > Asmus Freytag wrote: > > > >> The Khmer example would seem fairly resistant to automated > > >> correction if it is a free choice. If, instead, the immediately > > >> preceding consonant comes from two disjoined sets, for example if > > >> TA COENG TA was possible, but not TA COENG DA, then there's scope > > >> for spell check. > > > It's supposed to be based on the phonetics, so a spell check could > > > be used, but not a grammar rule. However, I can imagine someone > > > writing in accordance with a rule restricting them to certain > > > bases. > > Your last sentence reads as if you might equally well meant "can't" > > instead of "can" (?) > > I meant 'can'. According to Huffman's 'Cambodian System of Writing', > initial TA is to be read as /d/ in compounds formed by infixes. (The > spelling may have changed since then.) Suffixed to ? NNO (which is in > the retroflex series), the subscript is to be read as /d/, while > subscripted to ? NO, it is usually /t/ but occasionally /d/. I would be > tempted to apply the Pali & Sanskrit rule of place agreement and > use COENG DA below ? NNO and COENG TA below ? NO. I would expect > similar agreement with ? DA and ? TA. > Khmer spelling is inconsistent enough that attempts to leverage this kind of rule are in my opinion of limited utility. This kind of knowledge is better embedded in dictionaries where it is accessible to readers, than in an encoding where it just introduces ambiguity and confusion to the average user. Presentation is identical in modern Khmer. From what I've observed, most Khmer users type the subscript which is most obvious to them, that is COENG + TA as the major form is visually similar. The online dictionaries I've consulted are somewhat inconsistent in their use of COENG DA/TA (and do not normalise searches). The rule regarding suffixing to ? NNO seems consistent as far as I can tell, but suffixed to other letters, the pronunciation is less consistent. In my current Khmer language learning, My tutors have suggested that the pronunciation is inconsistent and in some cases can be pronounced either way. Some examples of words using COENG DA/TA: ?????? /b?ndaal/ giving rise to ?????? /b?nt?c/ a little ???? /pd?y/ husband ????? /kattaa/ agent, factor ?????? /cendaa/ thought, thinking ???????? /vi?c?ntaa/ or /vi?c?ndaa/ daydreaming ???? /staa/ arrogantly Marc -------------- next part -------------- An HTML attachment was scrubbed... URL: From opokupongkyekyeku at yahoo.com Sat Jan 21 12:56:39 2017 From: opokupongkyekyeku at yahoo.com (Kyekyeku Opoku-Pong) Date: Sat, 21 Jan 2017 18:56:39 +0000 (UTC) Subject: Encoding West African Adinkra sysmbols In-Reply-To: References: <20170118133555.665a7a7059d7ee80bb4d670165c8327d.23bcccc767.wbe@email03.godaddy.com> <1775052c-27eb-43d9-da26-c8715ab83720@ix.netcom.com> <20170119084508.706f4774@JRWUBU2> <20170120010406.03929f9e@JRWUBU2> <20170120083701.56d86075@JRWUBU2> Message-ID: <1028605960.1637145.1485024999338@mail.yahoo.com> Hello,I hope this is the right forum to seek help.I am looking for the possibility of encoding Adinkra symbols used extensively in Ghana and West Africa. There is information on Adinkra symbols and their meanings at:?http://www.adinkra.org/htmls/adinkra_index.htm How do I go about the process of getting the symbols approved for Unicode. Thank you,Kyekyeku -------------- next part -------------- An HTML attachment was scrubbed... URL: From everson at evertype.com Sun Jan 22 11:21:33 2017 From: everson at evertype.com (Michael Everson) Date: Sun, 22 Jan 2017 17:21:33 +0000 Subject: Encoding West African Adinkra sysmbols In-Reply-To: <1028605960.1637145.1485024999338@mail.yahoo.com> References: <20170118133555.665a7a7059d7ee80bb4d670165c8327d.23bcccc767.wbe@email03.godaddy.com> <1775052c-27eb-43d9-da26-c8715ab83720@ix.netcom.com> <20170119084508.706f4774@JRWUBU2> <20170120010406.03929f9e@JRWUBU2> <20170120083701.56d86075@JRWUBU2> <1028605960.1637145.1485024999338@mail.yahoo.com> Message-ID: <9265FB90-FF41-4CF4-B46D-CA0E5CC1A0E8@evertype.com> Are they used in plain text? How? > On 21 Jan 2017, at 18:56, Kyekyeku Opoku-Pong wrote: > > Hello, > I hope this is the right forum to seek help. > I am looking for the possibility of encoding Adinkra symbols used extensively in Ghana and West Africa. There is information on Adinkra symbols and their meanings at: http://www.adinkra.org/htmls/adinkra_index.htm > > How do I go about the process of getting the symbols approved for Unicode. > > Thank you, > Kyekyeku From verdy_p at wanadoo.fr Sun Jan 22 11:42:40 2017 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Sun, 22 Jan 2017 18:42:40 +0100 Subject: Encoding West African Adinkra sysmbols In-Reply-To: <9265FB90-FF41-4CF4-B46D-CA0E5CC1A0E8@evertype.com> References: <20170118133555.665a7a7059d7ee80bb4d670165c8327d.23bcccc767.wbe@email03.godaddy.com> <1775052c-27eb-43d9-da26-c8715ab83720@ix.netcom.com> <20170119084508.706f4774@JRWUBU2> <20170120010406.03929f9e@JRWUBU2> <20170120083701.56d86075@JRWUBU2> <1028605960.1637145.1485024999338@mail.yahoo.com> <9265FB90-FF41-4CF4-B46D-CA0E5CC1A0E8@evertype.com> Message-ID: I think the book listed on the web site is a perfect example: https://www.amazon.com/exec/obidos/ASIN/B000059TIM/adinkrasymbol-20 If it locally has cultural meaning, it should be encoded for allowing interchange also in encoded texts, rather than just artistic creations in architecture, or handwritten books and displays. That information site is interesting as it is currently collecting the usages (including photos) and meanings. 2017-01-22 18:21 GMT+01:00 Michael Everson : > Are they used in plain text? How? > > > On 21 Jan 2017, at 18:56, Kyekyeku Opoku-Pong < > opokupongkyekyeku at yahoo.com> wrote: > > > > Hello, > > I hope this is the right forum to seek help. > > I am looking for the possibility of encoding Adinkra symbols used > extensively in Ghana and West Africa. There is information on Adinkra > symbols and their meanings at: http://www.adinkra.org/htmls/ > adinkra_index.htm > > > > How do I go about the process of getting the symbols approved for > Unicode. > > > > Thank you, > > Kyekyeku > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From haberg-1 at telia.com Sun Jan 22 13:47:11 2017 From: haberg-1 at telia.com (=?utf-8?Q?Hans_=C3=85berg?=) Date: Sun, 22 Jan 2017 20:47:11 +0100 Subject: Encoding West African Adinkra sysmbols In-Reply-To: <9265FB90-FF41-4CF4-B46D-CA0E5CC1A0E8@evertype.com> References: <20170118133555.665a7a7059d7ee80bb4d670165c8327d.23bcccc767.wbe@email03.godaddy.com> <1775052c-27eb-43d9-da26-c8715ab83720@ix.netcom.com> <20170119084508.706f4774@JRWUBU2> <20170120010406.03929f9e@JRWUBU2> <20170120083701.56d86075@JRWUBU2> <1028605960.1637145.1485024999338@mail.yahoo.com> <9265FB90-FF41-4CF4-B46D-CA0E5CC1A0E8@evertype.com> Message-ID: <84A82C14-238E-494D-B2A3-0D9CC49A5927@telia.com> > On 22 Jan 2017, at 18:21, Michael Everson wrote: > > Are they used in plain text? How? On textiles and walls in a similar fashion as emoji, it seems [1]. Known since the beginning of the 19th century. 1. https://en.wikipedia.org/wiki/Adinkra_symbols -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Sun Jan 22 15:31:50 2017 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Sun, 22 Jan 2017 22:31:50 +0100 Subject: Encoding West African Adinkra sysmbols In-Reply-To: <84A82C14-238E-494D-B2A3-0D9CC49A5927@telia.com> References: <20170118133555.665a7a7059d7ee80bb4d670165c8327d.23bcccc767.wbe@email03.godaddy.com> <1775052c-27eb-43d9-da26-c8715ab83720@ix.netcom.com> <20170119084508.706f4774@JRWUBU2> <20170120010406.03929f9e@JRWUBU2> <20170120083701.56d86075@JRWUBU2> <1028605960.1637145.1485024999338@mail.yahoo.com> <9265FB90-FF41-4CF4-B46D-CA0E5CC1A0E8@evertype.com> <84A82C14-238E-494D-B2A3-0D9CC49A5927@telia.com> Message-ID: I read that there are similar sets of symbols in Polynesian/Melanesian cultures. There are possibly others in native Amerindian cultures, often related to religious features, nature. These symbols look in fact very similar to the initial creation of our modern alphabets we all know, just a step behing ideograms as those used in Mayan, Han, Egyptian and proto-Indo-European scripts, or runes in Europe, or today's very active creation of emojis and lots of icons and logograms created everywhere, by the industry and by various standard bodies: they encode more than just a letter or identifiable word, but instead a concept/idea which could be "spelled" orally by various sentences in modern languages. Their properties would be complex to design to to their complex meaning/associations and usage rules. 2017-01-22 20:47 GMT+01:00 Hans ?berg : > > On 22 Jan 2017, at 18:21, Michael Everson wrote: > > Are they used in plain text? How? > > > On textiles and walls in a similar fashion as emoji, it seems [1]. Known > since the beginning of the 19th century. > > 1. https://en.wikipedia.org/wiki/Adinkra_symbols > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From duerst at it.aoyama.ac.jp Sun Jan 22 20:37:32 2017 From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J._D=c3=bcrst?=) Date: Mon, 23 Jan 2017 11:37:32 +0900 Subject: Encoding West African Adinkra sysmbols In-Reply-To: <1028605960.1637145.1485024999338@mail.yahoo.com> References: <20170118133555.665a7a7059d7ee80bb4d670165c8327d.23bcccc767.wbe@email03.godaddy.com> <1775052c-27eb-43d9-da26-c8715ab83720@ix.netcom.com> <20170119084508.706f4774@JRWUBU2> <20170120010406.03929f9e@JRWUBU2> <20170120083701.56d86075@JRWUBU2> <1028605960.1637145.1485024999338@mail.yahoo.com> Message-ID: On 2017/01/22 03:56, Kyekyeku Opoku-Pong wrote: > Hello,I hope this is the right forum to seek help.I am looking for the possibility of encoding Adinkra symbols used extensively in Ghana and West Africa. There is information on Adinkra symbols and their meanings at: http://www.adinkra.org/htmls/adinkra_index.htm > How do I go about the process of getting the symbols approved for Unicode. > Thank you,Kyekyeku The two main things would be: 1) Write a proposal 2) Document usage (this is part of the proposal, but it is important to show that these symbols are actually used in running texts, not e.g. just as decorations on walls,...) Regards, Martin. From opokupongkyekyeku at yahoo.com Sun Jan 22 20:44:06 2017 From: opokupongkyekyeku at yahoo.com (Kyekyeku Opoku-Pong) Date: Mon, 23 Jan 2017 02:44:06 +0000 (UTC) Subject: Encoding West African Adinkra sysmbols In-Reply-To: References: <20170118133555.665a7a7059d7ee80bb4d670165c8327d.23bcccc767.wbe@email03.godaddy.com> <1775052c-27eb-43d9-da26-c8715ab83720@ix.netcom.com> <20170119084508.706f4774@JRWUBU2> <20170120010406.03929f9e@JRWUBU2> <20170120083701.56d86075@JRWUBU2> <1028605960.1637145.1485024999338@mail.yahoo.com> <9265FB90-FF41-4CF4-B46D-CA0E5CC1A0E8@evertype.com> <84A82C14-238E-494D-B2A3-0D9CC49A5927@telia.com> Message-ID: <278493006.2506185.1485139446746@mail.yahoo.com> Thank you all for the responses to my email. Adinkra symbols are unique symbols of cultural expressions and proverbs that have been used in wax printing, royal symbols, carvings and jewelry in West Africa. They originate in the Ivory Coast and Ghana but the symbols are popular in West Africa and with Africans in diaspora.? Some of the popular adinkra symbols are "Sankofa" and "Gye Nyame". This link http://www.adinkra.org/htmls/adinkra_index.htm?gives a good list of the symbols and their meanings. Encoding the symbols would make it easy to use them as emojis and icons and in printing of all forms.? Thank you,Kyekyeku -------------- next part -------------- An HTML attachment was scrubbed... URL: From charupdate at orange.fr Mon Jan 23 03:30:17 2017 From: charupdate at orange.fr (Marcel Schneider) Date: Mon, 23 Jan 2017 10:30:17 +0100 (CET) Subject: Superscript and Subscript Characters in General Use / Re: French Superscript Abbreviations Fit Plain Text Requirements In-Reply-To: <72c27940-ca8e-f616-48d5-5ee58835b354@ix.netcom.com> References: <717069d1-2cdc-fe61-6c2e-b8e3f12f1dfc@cs.tut.fi> <20161227000522.7bb95f3e@JRWUBU2> <230214984.10599.1482854625210.JavaMail.www@wwinf1p10> <1877734343.9096.1482938748638.JavaMail.www@wwinf1p14> <72c27940-ca8e-f616-48d5-5ee58835b354@ix.netcom.com> Message-ID: <1621294794.4399.1485163818350.JavaMail.www@wwinf1p26> Gladly this thread comes now to a far better and very useful result. A set of Unicode super- and subscripts are proven to be already promoted by Microsoft in a fully validated way. From this we can expand to promote the use of a set of Latin superscript letters. Connectedly, Microsoft?s position of unsupporting the OpenType rendering properties of U+2044 FRACTION SLASH (at least in a Latin script context in Edge) turns out to be a fairly user-frienly, practice-oriented option. That helps, too, to get around of holding people?s feet to the fire about U+2044. On Wed, 28 Dec 2016 13:47:00 -0800, Asmus Freytag wrote: [?] > > Mathematical notation is a good example of such a mixed case: while > ordinary variables can be expressed in plain text with the help of > mathematical alphabets, the proper display of formulas requires markup. > Even Murray Sargent's plain text math is markup, albeit a very clever one > that re-uses conventions used for the inline presentation of mathematical > expression. (Where that is insufficient, it introduces additional > conventions, clearly extraneous to the content, and hence markup). > http://www.unicode.org/mail-arch/unicode-ml/y2016-m12/0119.html Murray Sargent?s Nearly Plain-Text Encoding of Mathematics (UnicodeMath) is in my opinion a key gateway to the understanding of Unicode, and thus becomes a key point in my communication about Unicode-supporting keyboard layouts. See version 3.1: http://www.unicode.org/notes/tn28/UTN28-PlainTextMath-v3.1.pdf Thanks to Asmus Freytag for drawing our attention to it! What makes this notation so important to this thread?s issue, is in that it uses Unicode superscripts and subscripts as a valid and parseable alternative to the [La]TeX-style notation that uses markup ('^' and '_'), ?since Unicode has a full set of decimal subscripts and superscripts. As a practical matter, numeric subscripts are typically entered using an underscore and the number followed by a space or an operator? (p. 7). These Unicode superscript and subscript characters are parseable and are converted to formatted digits at build. Hence they are unambiguous, not random characters as sometimes alleged. They ?should be rendered the same way that scripts of the corresponding script nesting level would be rendered.? (p. 18) Although fractions are ordinarily written with ASCII digits and slash, U+2044 can be used to get skewed fractions (p. 5) built up in Microsoft Word (where fractions can also be formatted using the math features). Combining both schemes, the user may feel free to write fractions using super/sub scripts around U+2044, as suggested in the already cited wiki proposing to add a huge autocorrect list for quick input: https://answers.microsoft.com/en-us/msoffice/wiki/msoffice_word-mso_other/styled-fractions-in-windows/4a07d5fa-2484-4e39-b1f3-70bb3eb0c332 This is practice-oriented and user-friendly because relying only on the OpenType font feature specified for U+2044 would dramatically restrict the number of usable fonts, that in Latin script is traditionally several thousands, as opposed to complex scripts for which HarfBuzz is primarily intended, where the number of available typefaces is much smaller, so that full conversion to OpenType is feasible. So I think that the correct rendering of U+2044 in HarfBuzz targets mainly these complex scripts. In other scripts like Latin, the feature would then be a nice fall-off, that potentially raises user expectations about professional (typographical) ligature rendering. At the other end, for drafts and even ?for simple documentation purposes?, ?plain-text linearly formatted mathematical expressions can be used ?as is?? (p.29). That can be extended to vulgar fractions in current text, and abbreviations. This helps to understand that any font with inconsistent glyphs for Unicode subscript and superscript digits is not Unicode conformant. The same applies to superscript i and n (as mentioned in: http://www.unicode.org/mail-arch/unicode-ml/y2017-m01/0093.html ). These inconsistent fonts don?t conform to the Unicode Standard specifying that there is no functional difference between those characters that have the word SUPERSCRIPT in their name, and those that don?t: TUS 9.0, ?7.8, p. 327: | The superscript forms of the i and n letters can be found in the | Superscripts and Subscripts block (U+2070..U+209F). The fact that the latter | two letters contain the word ?superscript? in their names instead of ?modifier | letter? is an historical artifact of original sources for the characters, and | is not intended to convey a functional distinction in the use of these | characters in the Unicode Standard. http://www.unicode.org/versions/Unicode9.0.0/ch07.pdf#G24762 Moreover, the Code Charts contain comment lines to these two characters, connecting them to the set of Unicode superscript Latin letters named ?MODIFIER LETTER?: 2071 SUPERSCRIPT LATIN SMALL LETTER I * functions as a modifier letter # 0069 [?] 207F SUPERSCRIPT LATIN SMALL LETTER N * functions as a modifier letter # 006E Accordingly, the user can count on a whole small alphabet ? except q, that has been rejected arguing invented imaginary allegations on behalf of the UTC ? displaying in a consistent way in all complete, conformant fonts, with a running-text like layout so far as the fonts have proportional advance width. To run a test, see example in: http://www.unicode.org/mail-arch/unicode-ml/y2017-m01/0093.html (again). Trying to conclude so far (please feel free to correct), I now believe and will spread the word that following Microsoft ? a user-friendly corporation eager to help everybody make the most of Unicode ? the users of any word processor and text editor are welcome to use the Unicode repertoire as they need and like, while on the other hand, the recommendations in TUS may be considered a mere official discourse for encoding process management purposes, but with little through no real impact on actual practice. Hence, National Bodies and user communities as well as developers may issue usage recommendations of their own, to meet user expectations and propose working methods additionally?or alternatively?to those provided by the Standard. Regards, Marcel From eric.muller at efele.net Tue Jan 24 00:43:56 2017 From: eric.muller at efele.net (Eric Muller) Date: Mon, 23 Jan 2017 22:43:56 -0800 Subject: how would you state requirements involving sorting? Message-ID: <3231824a-2f9b-a6a3-6e1b-23cea04e9aec@efele.net> An HTML attachment was scrubbed... URL: From Shawn.Steele at microsoft.com Tue Jan 24 02:52:38 2017 From: Shawn.Steele at microsoft.com (Shawn Steele) Date: Tue, 24 Jan 2017 08:52:38 +0000 Subject: how would you state requirements involving sorting? In-Reply-To: <3231824a-2f9b-a6a3-6e1b-23cea04e9aec@efele.net> References: <3231824a-2f9b-a6a3-6e1b-23cea04e9aec@efele.net> Message-ID: That requirement will probably really annoy speakers of some languages. -Shawn From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Eric Muller Sent: Monday, January 23, 2017 10:44 PM To: unicode at unicode.org Subject: how would you state requirements involving sorting? Suppose you help somebody write requirements for a piece of software and you see an item: Sorting. Diacritic marks need to be stripped when sorting titles You know that sorting is a lot more complicated than removing diacritics, and that giving the directive above to a naive developer is going to lead to trouble. You know you want to end up with an implementation involving the UCA with a tailoring based on the locale. How would you suggest to reword the requirement? Thanks, Eric. -------------- next part -------------- An HTML attachment was scrubbed... URL: From mark at macchiato.com Tue Jan 24 06:38:11 2017 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Tue, 24 Jan 2017 13:38:11 +0100 Subject: how would you state requirements involving sorting? In-Reply-To: <3231824a-2f9b-a6a3-6e1b-23cea04e9aec@efele.net> References: <3231824a-2f9b-a6a3-6e1b-23cea04e9aec@efele.net> Message-ID: Perhaps suggest something along the following lines. Sorting. Unicode-conformant collation (http://unicode.org/reports/tr10/) must be used when sorting titles. The collation must follow the user's locale, such as using ICU APIs (http://site.icu-project.org/). Mark On Tue, Jan 24, 2017 at 7:43 AM, Eric Muller wrote: > Suppose you help somebody write requirements for a piece of software and > you see an item: > > Sorting. Diacritic marks need to be stripped when sorting titles > > > You know that sorting is a lot more complicated than removing diacritics, > and that giving the directive above to a naive developer is going to lead > to trouble. You know you want to end up with an implementation involving > the UCA with a tailoring based on the locale. How would you suggest to > reword the requirement? > > Thanks, > Eric. > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From andrea.giammarchi at gmail.com Tue Jan 24 10:39:11 2017 From: andrea.giammarchi at gmail.com (Andrea Giammarchi) Date: Tue, 24 Jan 2017 16:39:11 +0000 Subject: Curly Lips Code Point Proposal Message-ID: I'd like to bring to your attention a request, about a common emoticon, that has apparently no equivalent yet in the Emoji standard. This was a PR to the Twemoji project: https://github.com/twitter/twemoji/issues/199 The author also created a proper PDF explaining all the reasons: Proposal for CURLY LIPS Emoji.pdf I hope this can be considered in the near future as possible extra face. Thanks in advance for any sort of outcome. Best Regards -------------- next part -------------- An HTML attachment was scrubbed... URL: From gwalla at gmail.com Tue Jan 24 11:05:19 2017 From: gwalla at gmail.com (Garth Wallace) Date: Tue, 24 Jan 2017 09:05:19 -0800 Subject: Curly Lips Code Point Proposal In-Reply-To: References: Message-ID: AIUI that's a "catlike face" smiley. "Homer eating a donut" is not what I would associate with it at all, IME it's usually used to express something on the order of "mischievous", "playful", or "acting cute". The closest kaomoji equivalent, I think, is (???) or (???). On Tue, Jan 24, 2017 at 8:39 AM, Andrea Giammarchi < andrea.giammarchi at gmail.com> wrote: > I'd like to bring to your attention a request, about a common emoticon, > that has apparently no equivalent yet in the Emoji standard. > > This was a PR to the Twemoji project: > https://github.com/twitter/twemoji/issues/199 > > The author also created a proper PDF explaining all the reasons: > Proposal for CURLY LIPS Emoji.pdf > > > I hope this can be considered in the near future as possible extra face. > > Thanks in advance for any sort of outcome. > > Best Regards > -------------- next part -------------- An HTML attachment was scrubbed... URL: From leoboiko at namakajiri.net Tue Jan 24 11:12:03 2017 From: leoboiko at namakajiri.net (Leonardo Boiko) Date: Tue, 24 Jan 2017 15:12:03 -0200 Subject: Curly Lips Code Point Proposal In-Reply-To: References: Message-ID: I find it curious that this community defines the ":3" emoji as "mmmm" or "om nom nom". In my circles it's quite the frequent emoticon/emoji, but I've never seen it used this way. Instead, they usually employ it as "cat mouth" or "cat face", implying the mood of cuteness, perkiness or mischievousness. (This is distinct from U+1F431 CAT FACE in that it represents a human making a cat-like mouth, not an actual cat.) Here are a few images found through a web search for "cat face": ? ? ? ? Here's the relevant TVTropes article: http://tvtropes.org/pmwiki/pmwiki.php/Main/CatSmile (TVTropes, incidentally, is one of the many web forums which has a :3 textual emoji.) And the KnowYourMeme page: http://knowyourmeme.com/memes/3-cat-face 2017-01-24 14:39 GMT-02:00 Andrea Giammarchi : > I'd like to bring to your attention a request, about a common emoticon, > that has apparently no equivalent yet in the Emoji standard. > > This was a PR to the Twemoji project: > https://github.com/twitter/twemoji/issues/199 > > The author also created a proper PDF explaining all the reasons: > Proposal for CURLY LIPS Emoji.pdf > > > I hope this can be considered in the near future as possible extra face. > > Thanks in advance for any sort of outcome. > > Best Regards > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: :3-0.jpg Type: image/jpeg Size: 29144 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: :3-1.jpg Type: image/jpeg Size: 15782 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: :3-3.jpg Type: image/jpeg Size: 15171 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: :3-5.jpg Type: image/jpeg Size: 5179 bytes Desc: not available URL: From gwalla at gmail.com Tue Jan 24 11:37:10 2017 From: gwalla at gmail.com (Garth Wallace) Date: Tue, 24 Jan 2017 09:37:10 -0800 Subject: Curly Lips Code Point Proposal In-Reply-To: References: Message-ID: On Tue, Jan 24, 2017 at 9:12 AM, Leonardo Boiko wrote: > I find it curious that this community defines the ":3" emoji as "mmmm" or > "om nom nom". In my circles it's quite the frequent emoticon/emoji, but > I've never seen it used this way. > I can kind of see how someone might get that impression. For example, someone writing "om nom nom :3", and somebody who is unfamiliar with the smiley's usage interpreting the meanings as linked, when the intent was originally to express "I'm being silly". > Instead, they usually employ it as "cat mouth" or "cat face", implying > the mood of cuteness, perkiness or mischievousness. (This is distinct from > U+1F431 CAT FACE in that it represents a human making a cat-like mouth, not > an actual cat.) Here are a few images found through a web search for "cat > face": > > ? > > > I think the expression from manga and anime is probably the origin of the smiley. That's consistent with the communities and contexts where it's found most often. And the "catlike expression" I think is meant to be more metaphorical, rather than a depiction of an actual facial expression. Like how ?? is not meant to be an actual throbbing vein in the forehead. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: :3-5.jpg Type: image/jpeg Size: 5179 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: :3-1.jpg Type: image/jpeg Size: 15782 bytes Desc: not available URL: From andrea.giammarchi at gmail.com Tue Jan 24 11:39:36 2017 From: andrea.giammarchi at gmail.com (Andrea Giammarchi) Date: Tue, 24 Jan 2017 17:39:36 +0000 Subject: Curly Lips Code Point Proposal In-Reply-To: References: Message-ID: I wouldn't stereotype "this community" already, as it's a single person request and maybe a single person common use case. However, I have seen mostly on Twitter the usage of :3 to indicate "engagement" in the sense of "interest", or "I'm digging it" but if there's a meaning widely recognised already internationally, I guess there's no point in using the proposed name, yet there's no code point to represent :3 isn't it? Whatever it means, do we have a code point for it already? If we do, maybe that'd be already enough. There are indeed already many emoji misused here and there due different visual meaning in different cultures (the triumph face, as example, the one with steam from nose which is used as "furious face" in some culture) If there's no code point, being apparently this popular, should Unicode consider one? Regards On Tue, Jan 24, 2017 at 5:12 PM, Leonardo Boiko wrote: > I find it curious that this community defines the ":3" emoji as "mmmm" or > "om nom nom". In my circles it's quite the frequent emoticon/emoji, but > I've never seen it used this way. Instead, they usually employ it as "cat > mouth" or "cat face", implying the mood of cuteness, perkiness or > mischievousness. (This is distinct from U+1F431 CAT FACE in that it > represents a human making a cat-like mouth, not an actual cat.) Here are a > few images found through a web search for "cat face": > > > > ? > ? > > > > ? > ? > Here's the relevant TVTropes article: > http://tvtropes.org/pmwiki/pmwiki.php/Main/CatSmile > > (TVTropes, incidentally, is one of the many web forums which has a :3 > textual emoji.) > > And the KnowYourMeme page: > http://knowyourmeme.com/memes/3-cat-face > > > > > > > 2017-01-24 14:39 GMT-02:00 Andrea Giammarchi > : > >> I'd like to bring to your attention a request, about a common emoticon, >> that has apparently no equivalent yet in the Emoji standard. >> >> This was a PR to the Twemoji project: >> https://github.com/twitter/twemoji/issues/199 >> >> The author also created a proper PDF explaining all the reasons: >> Proposal for CURLY LIPS Emoji.pdf >> >> >> I hope this can be considered in the near future as possible extra face. >> >> Thanks in advance for any sort of outcome. >> >> Best Regards >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: :3-0.jpg Type: image/jpeg Size: 29144 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: :3-5.jpg Type: image/jpeg Size: 5179 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: :3-3.jpg Type: image/jpeg Size: 15171 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: :3-1.jpg Type: image/jpeg Size: 15782 bytes Desc: not available URL: From leoboiko at namakajiri.net Tue Jan 24 22:44:20 2017 From: leoboiko at namakajiri.net (Leonardo Boiko) Date: Wed, 25 Jan 2017 02:44:20 -0200 Subject: Curly Lips Code Point Proposal In-Reply-To: References: Message-ID: Undoubtedly so. That's why U+1F481 INFORMATION DESK PERSON ?? is listed with the keyword "sassy" in the Unicode emoji table (besides "tipping hand"). Which helps a lot, because the keywords are used by input methods to search characters; if no one bothered to keep track of how people are using emoji, then people would try looking for the "sassy" gesture and find nothing, and they'd have to learn that it's called "information desk person", even though no one uses it with this meaning. Precisely because language (and symbolic systems like emoji) are in flux, it's a good idea trying to document how it's used. 2017-01-25 2:35 GMT-02:00 Fritz Gheen : > "There are indeed already many emoji misused here and there..." > > I'd venture to say most emoji are divorced from their original intent. > Help Desk Lady is one of the most popular emoji...and I can't recall ever > seeing someone use it for that reason. I personally use Rocket emoji > mostly to mean, "I'm taking-off from home." And then there's aubergine =) > > I'd like to think no emoji is "misused." People employ emoji outside of > their original or intended meaning, and that's beautiful: language is > fluid; it evolves. > > > > > > On Wed, Jan 25, 2017 at 12:39 AM, Andrea Giammarchi < > andrea.giammarchi at gmail.com> wrote: > >> I wouldn't stereotype "this community" already, as it's a single person >> request and maybe a single person common use case. >> >> However, I have seen mostly on Twitter the usage of :3 to indicate >> "engagement" in the sense of "interest", or "I'm digging it" but if there's >> a meaning widely recognised already internationally, I guess there's no >> point in using the proposed name, yet there's no code point to represent :3 >> >> isn't it? >> >> Whatever it means, do we have a code point for it already? >> >> If we do, maybe that'd be already enough. >> >> There are indeed already many emoji misused here and there due different >> visual meaning in different cultures (the triumph face, as example, the one >> with steam from nose which is used as "furious face" in some culture) >> >> If there's no code point, being apparently this popular, should Unicode >> consider one? >> >> Regards >> >> >> >> >> >> >> >> On Tue, Jan 24, 2017 at 5:12 PM, Leonardo Boiko >> wrote: >> >>> I find it curious that this community defines the ":3" emoji as "mmmm" >>> or "om nom nom". In my circles it's quite the frequent emoticon/emoji, but >>> I've never seen it used this way. Instead, they usually employ it as "cat >>> mouth" or "cat face", implying the mood of cuteness, perkiness or >>> mischievousness. (This is distinct from U+1F431 CAT FACE in that it >>> represents a human making a cat-like mouth, not an actual cat.) Here are a >>> few images found through a web search for "cat face": >>> >>> >>> >>> ? >>> ? >>> >>> >>> >>> ? >>> ? >>> Here's the relevant TVTropes article: >>> http://tvtropes.org/pmwiki/pmwiki.php/Main/CatSmile >>> >>> (TVTropes, incidentally, is one of the many web forums which has a :3 >>> textual emoji.) >>> >>> And the KnowYourMeme page: >>> http://knowyourmeme.com/memes/3-cat-face >>> >>> >>> >>> >>> >>> >>> 2017-01-24 14:39 GMT-02:00 Andrea Giammarchi < >>> andrea.giammarchi at gmail.com>: >>> >>>> I'd like to bring to your attention a request, about a common emoticon, >>>> that has apparently no equivalent yet in the Emoji standard. >>>> >>>> This was a PR to the Twemoji project: >>>> https://github.com/twitter/twemoji/issues/199 >>>> >>>> The author also created a proper PDF explaining all the reasons: >>>> Proposal for CURLY LIPS Emoji.pdf >>>> >>>> >>>> I hope this can be considered in the near future as possible extra face. >>>> >>>> Thanks in advance for any sort of outcome. >>>> >>>> Best Regards >>>> >>> >>> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: :3-1.jpg Type: image/jpeg Size: 15782 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: :3-5.jpg Type: image/jpeg Size: 5179 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: :3-0.jpg Type: image/jpeg Size: 29144 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: :3-3.jpg Type: image/jpeg Size: 15171 bytes Desc: not available URL: From richard.wordingham at ntlworld.com Wed Jan 25 02:13:10 2017 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Wed, 25 Jan 2017 08:13:10 +0000 Subject: Implications of Logical Order Exception Property Message-ID: <20170125081310.5bfe5ce5@JRWUBU2> What is the significance of logical_order_exception being true for a character? TUS 9.0 Section 4.3 appears to claim that such characters need to be rearranged 'logically' for searching and sorting. However, I cannot see how they need to be rearranged for searching. Is this property a general warning, or does it mean that swapping with the *next* character gives a less bad sorting experience? For example, why doesn't U+17CC KHMER SIGN ROBAT have the property? Consonant + ROBAT has to be rearranged to ROBAT + consonant for sorting - ROBAT is a repha stored after rather than before the *visual* base consonant. Richard. From fgheen at gmail.com Tue Jan 24 22:35:29 2017 From: fgheen at gmail.com (Fritz Gheen) Date: Wed, 25 Jan 2017 11:35:29 +0700 Subject: Curly Lips Code Point Proposal In-Reply-To: References: Message-ID: "There are indeed already many emoji misused here and there..." I'd venture to say most emoji are divorced from their original intent. Help Desk Lady is one of the most popular emoji...and I can't recall ever seeing someone use it for that reason. I personally use Rocket emoji mostly to mean, "I'm taking-off from home." And then there's aubergine =) I'd like to think no emoji is "misused." People employ emoji outside of their original or intended meaning, and that's beautiful: language is fluid; it evolves. On Wed, Jan 25, 2017 at 12:39 AM, Andrea Giammarchi < andrea.giammarchi at gmail.com> wrote: > I wouldn't stereotype "this community" already, as it's a single person > request and maybe a single person common use case. > > However, I have seen mostly on Twitter the usage of :3 to indicate > "engagement" in the sense of "interest", or "I'm digging it" but if there's > a meaning widely recognised already internationally, I guess there's no > point in using the proposed name, yet there's no code point to represent :3 > > isn't it? > > Whatever it means, do we have a code point for it already? > > If we do, maybe that'd be already enough. > > There are indeed already many emoji misused here and there due different > visual meaning in different cultures (the triumph face, as example, the one > with steam from nose which is used as "furious face" in some culture) > > If there's no code point, being apparently this popular, should Unicode > consider one? > > Regards > > > > > > > > On Tue, Jan 24, 2017 at 5:12 PM, Leonardo Boiko > wrote: > >> I find it curious that this community defines the ":3" emoji as "mmmm" or >> "om nom nom". In my circles it's quite the frequent emoticon/emoji, but >> I've never seen it used this way. Instead, they usually employ it as "cat >> mouth" or "cat face", implying the mood of cuteness, perkiness or >> mischievousness. (This is distinct from U+1F431 CAT FACE in that it >> represents a human making a cat-like mouth, not an actual cat.) Here are a >> few images found through a web search for "cat face": >> >> >> >> ? >> ? >> >> >> >> ? >> ? >> Here's the relevant TVTropes article: >> http://tvtropes.org/pmwiki/pmwiki.php/Main/CatSmile >> >> (TVTropes, incidentally, is one of the many web forums which has a :3 >> textual emoji.) >> >> And the KnowYourMeme page: >> http://knowyourmeme.com/memes/3-cat-face >> >> >> >> >> >> >> 2017-01-24 14:39 GMT-02:00 Andrea Giammarchi > >: >> >>> I'd like to bring to your attention a request, about a common emoticon, >>> that has apparently no equivalent yet in the Emoji standard. >>> >>> This was a PR to the Twemoji project: >>> https://github.com/twitter/twemoji/issues/199 >>> >>> The author also created a proper PDF explaining all the reasons: >>> Proposal for CURLY LIPS Emoji.pdf >>> >>> >>> I hope this can be considered in the near future as possible extra face. >>> >>> Thanks in advance for any sort of outcome. >>> >>> Best Regards >>> >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: :3-5.jpg Type: image/jpeg Size: 5179 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: :3-0.jpg Type: image/jpeg Size: 29144 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: :3-1.jpg Type: image/jpeg Size: 15782 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: :3-3.jpg Type: image/jpeg Size: 15171 bytes Desc: not available URL: From verdy_p at wanadoo.fr Wed Jan 25 10:51:13 2017 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Wed, 25 Jan 2017 17:51:13 +0100 Subject: Curly Lips Code Point Proposal In-Reply-To: References: Message-ID: I'd say that the current icon shown by Google for that character does not mean anything to me, it lloks like a ghost of some grotesque creature, possibly dancing, unrelated to any information desk, and not even a lady (may be Google wanted it to be gender-neutral, as the character is not encoded and named to mean a woman). So I doubt seriously it will be used for its intended usage. 2017-01-25 5:44 GMT+01:00 Leonardo Boiko : > Undoubtedly so. That's why U+1F481 INFORMATION DESK PERSON ?? is listed > with the keyword "sassy" in the Unicode emoji table (besides "tipping > hand"). Which helps a lot, because the keywords are used by input methods > to search characters; if no one bothered to keep track of how people are > using emoji, then people would try looking for the "sassy" gesture and find > nothing, and they'd have to learn that it's called "information desk > person", even though no one uses it with this meaning. > > Precisely because language (and symbolic systems like emoji) are in flux, > it's a good idea trying to document how it's used. > > > 2017-01-25 2:35 GMT-02:00 Fritz Gheen : > >> "There are indeed already many emoji misused here and there..." >> >> I'd venture to say most emoji are divorced from their original intent. >> Help Desk Lady is one of the most popular emoji...and I can't recall ever >> seeing someone use it for that reason. I personally use Rocket emoji >> mostly to mean, "I'm taking-off from home." And then there's aubergine >> =) >> >> I'd like to think no emoji is "misused." People employ emoji outside of >> their original or intended meaning, and that's beautiful: language is >> fluid; it evolves. >> >> >> >> >> >> On Wed, Jan 25, 2017 at 12:39 AM, Andrea Giammarchi < >> andrea.giammarchi at gmail.com> wrote: >> >>> I wouldn't stereotype "this community" already, as it's a single person >>> request and maybe a single person common use case. >>> >>> However, I have seen mostly on Twitter the usage of :3 to indicate >>> "engagement" in the sense of "interest", or "I'm digging it" but if there's >>> a meaning widely recognised already internationally, I guess there's no >>> point in using the proposed name, yet there's no code point to represent :3 >>> >>> isn't it? >>> >>> Whatever it means, do we have a code point for it already? >>> >>> If we do, maybe that'd be already enough. >>> >>> There are indeed already many emoji misused here and there due different >>> visual meaning in different cultures (the triumph face, as example, the one >>> with steam from nose which is used as "furious face" in some culture) >>> >>> If there's no code point, being apparently this popular, should Unicode >>> consider one? >>> >>> Regards >>> >>> >>> >>> >>> >>> >>> >>> On Tue, Jan 24, 2017 at 5:12 PM, Leonardo Boiko >> > wrote: >>> >>>> I find it curious that this community defines the ":3" emoji as "mmmm" >>>> or "om nom nom". In my circles it's quite the frequent emoticon/emoji, but >>>> I've never seen it used this way. Instead, they usually employ it as "cat >>>> mouth" or "cat face", implying the mood of cuteness, perkiness or >>>> mischievousness. (This is distinct from U+1F431 CAT FACE in that it >>>> represents a human making a cat-like mouth, not an actual cat.) Here are a >>>> few images found through a web search for "cat face": >>>> >>>> >>>> >>>> ? >>>> ? >>>> >>>> >>>> >>>> ? >>>> ? >>>> Here's the relevant TVTropes article: >>>> http://tvtropes.org/pmwiki/pmwiki.php/Main/CatSmile >>>> >>>> (TVTropes, incidentally, is one of the many web forums which has a :3 >>>> textual emoji.) >>>> >>>> And the KnowYourMeme page: >>>> http://knowyourmeme.com/memes/3-cat-face >>>> >>>> >>>> >>>> >>>> >>>> >>>> 2017-01-24 14:39 GMT-02:00 Andrea Giammarchi < >>>> andrea.giammarchi at gmail.com>: >>>> >>>>> I'd like to bring to your attention a request, about a common emoticon, >>>>> that has apparently no equivalent yet in the Emoji standard. >>>>> >>>>> This was a PR to the Twemoji project: >>>>> https://github.com/twitter/twemoji/issues/199 >>>>> >>>>> The author also created a proper PDF explaining all the reasons: >>>>> Proposal for CURLY LIPS Emoji.pdf >>>>> >>>>> >>>>> I hope this can be considered in the near future as possible extra >>>>> face. >>>>> >>>>> Thanks in advance for any sort of outcome. >>>>> >>>>> Best Regards >>>>> >>>> >>>> >>> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: :3-1.jpg Type: image/jpeg Size: 15782 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: :3-3.jpg Type: image/jpeg Size: 15171 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: :3-0.jpg Type: image/jpeg Size: 29144 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: :3-5.jpg Type: image/jpeg Size: 5179 bytes Desc: not available URL: From richard.wordingham at ntlworld.com Wed Jan 25 13:10:15 2017 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Wed, 25 Jan 2017 19:10:15 +0000 Subject: Implications of Logical Order Exception Property In-Reply-To: <20170125081310.5bfe5ce5@JRWUBU2> References: <20170125081310.5bfe5ce5@JRWUBU2> Message-ID: <20170125191015.4a1c4d59@JRWUBU2> On Wed, 25 Jan 2017 08:13:10 +0000 Richard Wordingham wrote: > What is the significance of logical_order_exception being true for a > character? > > TUS 9.0 Section 4.3 appears to claim that such characters need to be > rearranged 'logically' for searching and sorting. > > However, I cannot see how they need to be rearranged for searching. > > Is this property a general warning, or does it mean that swapping with > the *next* character gives a less bad sorting experience? For > example, why doesn't U+17CC KHMER SIGN ROBAT have the property? > Consonant + ROBAT has to be rearranged to ROBAT + consonant for > sorting - ROBAT is a repha stored after rather than before the > *visual* base consonant. After some further research, I see that it is relevant to Revisions 9 and 11 of the Unicode Collation Algorithm. For earlier and later revisions, its effects were defined by the collation element table rather than the Unicode Character Database. The reordering for ROBAT is the wrong way round for the property to be applicable in its original meaning. I can't find any other formal requirement for the property. I now have a clutch of errors to report on Unicode's use of the term 'logical order' and references to logical_order_exception: 1) Claims that Thai is not encoded in logical order in Technical Report 10 (Collation) UCD file IndicPositionalCategory.txt 2) Claims that logical_order_exception is relevant for searching (TUS, as above) Should I make this one report or three reports? Richard. From markus.icu at gmail.com Wed Jan 25 13:27:52 2017 From: markus.icu at gmail.com (Markus Scherer) Date: Wed, 25 Jan 2017 11:27:52 -0800 Subject: Implications of Logical Order Exception Property In-Reply-To: <20170125191015.4a1c4d59@JRWUBU2> References: <20170125081310.5bfe5ce5@JRWUBU2> <20170125191015.4a1c4d59@JRWUBU2> Message-ID: On Wed, Jan 25, 2017 at 11:10 AM, Richard Wordingham < richard.wordingham at ntlworld.com> wrote: > I now have a clutch of errors to report on Unicode's use of the term > 'logical order' and references to logical_order_exception: > > 1) Claims that Thai is not encoded in logical order in > Technical Report 10 (Collation) > UCD file IndicPositionalCategory.txt > 2) Claims that logical_order_exception is relevant for searching (TUS, > as above) > It informs the construction of the DUCET and could be used to suppress_contractions in a search tailoring (see CLDR root collation data file). Should I make this one report or three reports? > I think one report would be better. I would wait a few days to see if there is more feedback here on the list. markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Wed Jan 25 14:00:38 2017 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Wed, 25 Jan 2017 20:00:38 +0000 Subject: Implications of Logical Order Exception Property In-Reply-To: References: <20170125081310.5bfe5ce5@JRWUBU2> <20170125191015.4a1c4d59@JRWUBU2> Message-ID: <20170125200038.1b2746d8@JRWUBU2> On Wed, 25 Jan 2017 11:27:52 -0800 Markus Scherer wrote: > On Wed, Jan 25, 2017 at 11:10 AM, Richard Wordingham < > richard.wordingham at ntlworld.com> wrote: > > 2) Claims that logical_order_exception is relevant for searching > > (TUS, as above) > It informs the construction of the DUCET and could be used to > suppress_contractions in a search tailoring (see CLDR root collation > data file). It is irrelevant for searching. If one created a collation just for searching, one wouldn't have to remove the effects of this irrelevant property. Richard. From markus.icu at gmail.com Wed Jan 25 14:35:33 2017 From: markus.icu at gmail.com (Markus Scherer) Date: Wed, 25 Jan 2017 12:35:33 -0800 Subject: Implications of Logical Order Exception Property In-Reply-To: <20170125200038.1b2746d8@JRWUBU2> References: <20170125081310.5bfe5ce5@JRWUBU2> <20170125191015.4a1c4d59@JRWUBU2> <20170125200038.1b2746d8@JRWUBU2> Message-ID: On Wed, Jan 25, 2017 at 12:00 PM, Richard Wordingham < richard.wordingham at ntlworld.com> wrote: > > > 2) Claims that logical_order_exception is relevant for searching > > > (TUS, as above) > > > It informs the construction of the DUCET and could be used to > > suppress_contractions in a search tailoring (see CLDR root collation > > data file). > > It is irrelevant for searching. If one created a collation just for > searching, one wouldn't have to remove the effects of this irrelevant > property. > It narrows match boundaries and improves performance. markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Wed Jan 25 15:44:49 2017 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Wed, 25 Jan 2017 21:44:49 +0000 Subject: Implications of Logical Order Exception Property In-Reply-To: References: <20170125081310.5bfe5ce5@JRWUBU2> <20170125191015.4a1c4d59@JRWUBU2> <20170125200038.1b2746d8@JRWUBU2> Message-ID: <20170125214449.53d3455b@JRWUBU2> On Wed, 25 Jan 2017 12:35:33 -0800 Markus Scherer wrote: > On Wed, Jan 25, 2017 at 12:00 PM, Richard Wordingham < > richard.wordingham at ntlworld.com> wrote: > > > > > 2) Claims that logical_order_exception is relevant for searching > > > > (TUS, as above) > > > > > It informs the construction of the DUCET and could be used to > > > suppress_contractions in a search tailoring (see CLDR root > > > collation data file). > > > > It is irrelevant for searching. If one created a collation just for > > searching, one wouldn't have to remove the effects of this > > irrelevant property. > > > > It narrows match boundaries and improves performance. What is 'it'? If 'it' means 'removing the effects...', then it would be irrelevant in a collation created just for searching. The effects wouldn't be there to complicate matters. A search-only collation for normalised, correctly spelt Thai would be simple; it would have no contractions unless you wanted to claim that words containing did not contain .) Richard. From christoph.paeper at crissov.de Fri Jan 27 05:16:22 2017 From: christoph.paeper at crissov.de (=?utf-8?Q?Christoph_P=C3=A4per?=) Date: Fri, 27 Jan 2017 12:16:22 +0100 Subject: Curly Lips Code Point Proposal In-Reply-To: References: Message-ID: <3E7FB424-A62F-4782-B421-11686246F2BB@crissov.de> Leonardo Boiko : > > That's why U+1F481 INFORMATION DESK PERSON ?? is listed with the keyword "sassy" in the Unicode emoji table (besides "tipping hand"). Which helps a lot, because the keywords are used by input methods to search characters; if no one bothered to keep track of how people are using emoji, then people would try looking for the "sassy" gesture and find nothing, and they'd have to learn that it's called "information desk person", even though no one uses it with this meaning. > > Precisely because language (and symbolic systems like emoji) are in flux, it's a good idea trying to document how it's used. ?? Maybe, but that?s not what?s actually done. s/(butt|boob|phall|penis|vulva|vagina|genital|orgasm|sex|69|fart)/ > ?? *eggplant | aubergine | vegetable ? > ?? *peach | fruit ? > ?? *tulip | flower ? > ? *Cancer | crab | zodiac ? > ?? *sweat droplets | comic | splashing | sweat ? > ?? *dashing away | comic | dash | running ? From verdy_p at wanadoo.fr Sat Jan 28 11:24:19 2017 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Sat, 28 Jan 2017 18:24:19 +0100 Subject: Pagus symbol Message-ID: See Sample [1] The symbol that is shown near some villages (Cuce, Cice, Bruts) on this old map is for "pagus" (plural "pagi") and is an old territorial unit grouping several villages, and would more or or less map to today's cantons in France (or "pays" in today's rural speech), or counties in England (however smaller than counties). [2] It looks like an ideogram in used in Roman or medieval periods (in the example above it appears later on a map of the 17th century). I've seen it several times (not just on maps) with minor variations. It looks like two symbolized bell towers with a top platform holding a christian cross, both surrounding the circle (locating the village). It gives higher importnace to these places than other surrounding villages that are administered from the pagus. Are there other examples of symbols used on maps or old judiciary acts that could be encoded? [1] https://commons.wikimedia.org/wiki/File:Tabula_ducatus_britanniae_gallis_-_Sud_Rennes.png [2] https://en.wikipedia.org/wiki/Pagus -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Sat Jan 28 11:36:50 2017 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Sat, 28 Jan 2017 18:36:50 +0100 Subject: Pagus symbol In-Reply-To: References: Message-ID: Other example, same period in Western Russia: the symbol is less "ideographic" and colored in red, it clearly shows a church bell tower and a dependant building: https://commons.wikimedia.org/wiki/File:Atlas_Van_der_Hagen-KW1049B10_032-RVSSIAE_Vulgo_MOSCOVIA_dictae,_Pars_Occidentalis.jpeg Same thing in England https://commons.wikimedia.org/wiki/File:Atlas_Van_der_Hagen-KW1049B11_004-A_NEW_MAP_OF_THE_KINGDOME_of_ENGLAND,_Representing_the_Princedome_of_WALES,_and_other_PROVINCES,_CITIES,_MARKET_TOWNS,_with_the_ROADS_from_TOWN_to_TOWN.jpeg Other variant (two towers or high houses): https://commons.wikimedia.org/wiki/File:Arae_Flaviae_tab_peut.jpg 2017-01-28 18:24 GMT+01:00 Philippe Verdy : > See Sample [1] > > The symbol that is shown near some villages (Cuce, Cice, Bruts) on this > old map is for "pagus" (plural "pagi") and is an old territorial unit > grouping several villages, and would more or or less map to today's cantons > in France (or "pays" in today's rural speech), or counties in England > (however smaller than counties). [2] > > It looks like an ideogram in used in Roman or medieval periods (in the > example above it appears later on a map of the 17th century). I've seen it > several times (not just on maps) with minor variations. It looks like two > symbolized bell towers with a top platform holding a christian cross, both > surrounding the circle (locating the village). It gives higher importnace > to these places than other surrounding villages that are administered from > the pagus. > > Are there other examples of symbols used on maps or old judiciary acts > that could be encoded? > > [1] https://commons.wikimedia.org/wiki/File:Tabula_ducatus_ > britanniae_gallis_-_Sud_Rennes.png > [2] https://en.wikipedia.org/wiki/Pagus > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From asmusf at ix.netcom.com Sat Jan 28 14:39:58 2017 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Sat, 28 Jan 2017 12:39:58 -0800 Subject: Pagus symbol In-Reply-To: References: Message-ID: <4021a0fb-44f6-53f1-b05d-d3806adae28b@ix.netcom.com> An HTML attachment was scrubbed... URL: